字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] LAURENCE MORONEY: Hi, and welcome back to this series on Zero to Hero with TensorFlow, where we're looking at Natural Language Processing. In the last couple of episodes, you saw how to tokenize text into numeric values, and how to use tools in TensorFlow to regularize and pad that text. Now that we've gotten the preprocessing out of the way, we can next look at how to build a classifier to recognize sentiment in text. We'll start by using a dataset of headlines, where the headline has been categorized as sarcastic or not. We'll train a classifier on this, and it can then tell us afterwards if a new piece of text looks like it might be sarcastic. We'll use Rishabh Misra's dataset from Kaggle, and you can find details on it here. The data is nice and simple. The is_sarcastic field is 1 if it's sarc-y, and 0 otherwise. There is a headline where the text will train on, and then there's a URL to the article if you're interested in reading it. But we're not going to use this, just the headline text. The data is stored in JSON format like this, pretty straightforward. We'll have to convert it to Python format for training, so it will look like this. Every JSON element becomes a Python list element, and it's all encapsulated in square brackets. Python has a JSON toolkit that can achieve this. And here's the complete code. We'll go through it step by step. First of all, we'll import the JSON library. Then, we can load in the sarcasm JSON file using the JSON library. We can then create lists for the labels, headlines, and article URLs. And when we iterate through the JSON, we can load the requisite values into our Python list. Now that we have three lists, one with our labels, one with the text, and one with the URLs, we can start doing a familiar preprocessing on the text. Here's the code. By calling tokenizer.fit on texts with the headline, we'll create tokens for every word in the corpus. And then, we'll see them in the word index. You can see an example of some of the words here. So "underwood" has been tokenized at 24127, and "skillingsbolle"-- what is that, anyway-- to 23055. So now, we can turn our sentences into sequences of tokens, and pad them to the same length with this code. If we want to inspect them, we can simply print them out. Here you can see one tokenized sentence and the shape of the entire corpus. That's 26,709 sequences, each with 40 tokens. Now, there's a problem here. We don't have a split in the data for training and testing. We just have a list of 26,709 sequences. Fortunately, Python makes it super easy for us to slice this up. Let's take a look at that next. So we have a bunch of sentences in a list and a bunch of labels in a list. To slice them into training and test sets is actually pretty easy. If we pick a training size, say 20,000, we can cut it up with code like this. So the training sentences will be the first 20,000 sliced by this syntax, and the testing sentences will be the remaining slice, like this. And we can do the same for the labels to get a training and a test set. But there's a bit of a problem. Remember earlier we used the tokenizer to create a word index of every word in the set? That was all very good. But if we really want to test its effectiveness, we have to ensure that the neural net only sees the training data, and that it never sees the test data. So we have to rewrite our code to ensure that the tokenizer is just fit to the training data. Let's take a look at how to do that now. Here's the new code to create our training and test sets. Let's look at it line by line. We'll first instantiate a tokenizer like before, but now, we'll fit the tokenizer on just the training sentences that we split out earlier, instead of the entire corpus. And now, instead of one overall set of sequences, we can now create a set of training sequences, and pad them, and then do exactly the same thing for the test sequences. It's really that easy. But you might be wondering at this point, we've turned our sentences into numbers, with the numbers being tokens representing words. But how do we get meaning from that? How do we determine if something is sarcastic just from the numbers? Well, here's where the context of embeddings come in. Let's consider the most basic of sentiments. Something is good or something is bad. We often see these as being opposites, so we can plot them as having opposite directions like this. So then what happens with a word like "meh"? It's not particularly good, and it's not particularly bad. Probably a little more bad than good. So you might plot it a bit like this. Or the phrase, "not bad," which is usually meant to plot something as having a little bit of goodness, but not necessarily very good. So it might look like this. Now, if we plot this on an x- and y-axis, we can start to determine the good or bad sentiment as coordinates in the x and y. Good is 1, 0. Meh is minus 0.4, 0.7, et cetera. By looking at the direction of the vector, we can start to determine the meaning of the word. So what if you extend that into multiple dimensions instead of just two? What if words that are labeled with sentiments, like sarcastic and not sarcastic, are plotted in these multiple dimensions? And then, as we train, we try to learn what the direction in these multi-dimensional spaces should look like. Words that only appear in the sarcastic sentences will have a strong component in the sarcastic direction, and others will have one in the not-sarcastic direction. As we load more and more sentences into the network for training, these directions can change. And when we have a fully trained network and give it a set of words, it could look up the vectors for these words, sum them up, and thus, give us an idea for the sentiment. This concept is known as embedding. So going back to this diagram, consider what would have happened if I said something was "not bad, a bit meh." If we were to sum up the vectors, we'd have something that's 0.7 on y and 0.1 on x. So its sentiment could be considered slightly on the good side of neutral. So now, let's take a look at coding this. Here's my neural network code. The top layer is an embedding, where the direction of each word will be learned epoch by epoch. After that, we pool with a global average pooling, namely adding up the vectors, as I demonstrated earlier. This is then fed into a common or garden deep neural network. Training is now as simple as model.fit, using the training data and labels, and specifying the testing padded and labels for the validation data. At this URL, you can try it out for yourself. And here, you can see the results that I got training it for just 30 epochs. While it was able to fit the training data to 99% accuracy, more importantly, with the test data, that is words that the network has never seen, it's still got 81% to 82% accuracy, which is pretty good. So how do we use this to establish sentiment for new sentences? Here's the code. Let's create a couple of sentences that we want to classify. The first one looks a little bit sarcastic, and the second one's quite plain and boring. We'll use the tokenizer that we created earlier to convert them into sequences. This way, the words will have the same tokens as the training set. We'll then pad those sequences to be the same dimensions as those in the training set and use the same padding type. And we can then predict on the padded set. The results are like this. The first sentence gives me 0.91, which is very close to 1, indicating that there's a very high probability of sarcasm. The second is 5 times 10 to the minus 6, indicating an extremely low chance of sarcasm. It does seem to be working. All of this code is runnable in a Colab at this URL. So give it a try for yourself. You've now built your first text-classification model to understand sentiment in text. Give it a try for yourself, and let us know what kind of classifiers you built. I hope you've enjoyed this short series, and there's more on the way. So don't forget to hit that Subscribe button and get the latest and greatest in AI videos right here on the TensorFlow Channel. [MUSIC PLAYING]
B1 中級 訓練模型來識別文本中的情感(NLP Zero to Hero,第三部分)。 (Training a model to recognize sentiment in text (NLP Zero to Hero, part 3)) 3 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字