Placeholder Image

字幕列表 影片播放

  • [MUSIC PLAYING]

  • LAURENCE MORONEY: Hi, and welcome back

  • to this series on Zero to Hero with TensorFlow,

  • where we're looking at Natural Language Processing.

  • In the last couple of episodes, you

  • saw how to tokenize text into numeric values,

  • and how to use tools in TensorFlow

  • to regularize and pad that text.

  • Now that we've gotten the preprocessing out of the way,

  • we can next look at how to build a classifier

  • to recognize sentiment in text.

  • We'll start by using a dataset of headlines,

  • where the headline has been categorized

  • as sarcastic or not.

  • We'll train a classifier on this,

  • and it can then tell us afterwards

  • if a new piece of text looks like it might be sarcastic.

  • We'll use Rishabh Misra's dataset from Kaggle,

  • and you can find details on it here.

  • The data is nice and simple.

  • The is_sarcastic field is 1 if it's sarc-y, and 0 otherwise.

  • There is a headline where the text will train on,

  • and then there's a URL to the article

  • if you're interested in reading it.

  • But we're not going to use this, just the headline text.

  • The data is stored in JSON format

  • like this, pretty straightforward.

  • We'll have to convert it to Python format for training,

  • so it will look like this.

  • Every JSON element becomes a Python list element,

  • and it's all encapsulated in square brackets.

  • Python has a JSON toolkit that can achieve this.

  • And here's the complete code.

  • We'll go through it step by step.

  • First of all, we'll import the JSON library.

  • Then, we can load in the sarcasm JSON

  • file using the JSON library.

  • We can then create lists for the labels, headlines, and article

  • URLs.

  • And when we iterate through the JSON,

  • we can load the requisite values into our Python list.

  • Now that we have three lists, one with our labels,

  • one with the text, and one with the URLs,

  • we can start doing a familiar preprocessing on the text.

  • Here's the code.

  • By calling tokenizer.fit on texts with the headline,

  • we'll create tokens for every word in the corpus.

  • And then, we'll see them in the word index.

  • You can see an example of some of the words here.

  • So "underwood" has been tokenized

  • at 24127, and "skillingsbolle"-- what is that, anyway--

  • to 23055.

  • So now, we can turn our sentences

  • into sequences of tokens, and pad them

  • to the same length with this code.

  • If we want to inspect them, we can simply print them out.

  • Here you can see one tokenized sentence and the shape

  • of the entire corpus.

  • That's 26,709 sequences, each with 40 tokens.

  • Now, there's a problem here.

  • We don't have a split in the data for training and testing.

  • We just have a list of 26,709 sequences.

  • Fortunately, Python makes it super easy

  • for us to slice this up.

  • Let's take a look at that next.

  • So we have a bunch of sentences in a list

  • and a bunch of labels in a list.

  • To slice them into training and test sets

  • is actually pretty easy.

  • If we pick a training size, say 20,000,

  • we can cut it up with code like this.

  • So the training sentences will be the first 20,000 sliced

  • by this syntax, and the testing sentences

  • will be the remaining slice, like this.

  • And we can do the same for the labels

  • to get a training and a test set.

  • But there's a bit of a problem.

  • Remember earlier we used the tokenizer

  • to create a word index of every word in the set?

  • That was all very good.

  • But if we really want to test its effectiveness,

  • we have to ensure that the neural net only

  • sees the training data, and that it never sees the test data.

  • So we have to rewrite our code to ensure that the tokenizer is

  • just fit to the training data.

  • Let's take a look at how to do that now.

  • Here's the new code to create our training and test sets.

  • Let's look at it line by line.

  • We'll first instantiate a tokenizer like before,

  • but now, we'll fit the tokenizer on just the training sentences

  • that we split out earlier, instead of the entire corpus.

  • And now, instead of one overall set of sequences,

  • we can now create a set of training sequences,

  • and pad them, and then do exactly the same thing

  • for the test sequences.

  • It's really that easy.

  • But you might be wondering at this point,

  • we've turned our sentences into numbers,

  • with the numbers being tokens representing words.

  • But how do we get meaning from that?

  • How do we determine if something is sarcastic just

  • from the numbers?

  • Well, here's where the context of embeddings come in.

  • Let's consider the most basic of sentiments.

  • Something is good or something is bad.

  • We often see these as being opposites,

  • so we can plot them as having opposite directions like this.

  • So then what happens with a word like "meh"?

  • It's not particularly good, and it's not particularly bad.

  • Probably a little more bad than good.

  • So you might plot it a bit like this.

  • Or the phrase, "not bad," which is usually

  • meant to plot something as having

  • a little bit of goodness, but not necessarily very good.

  • So it might look like this.

  • Now, if we plot this on an x- and y-axis,

  • we can start to determine the good or bad sentiment

  • as coordinates in the x and y.

  • Good is 1, 0.

  • Meh is minus 0.4, 0.7, et cetera.

  • By looking at the direction of the vector,

  • we can start to determine the meaning of the word.

  • So what if you extend that into multiple dimensions instead

  • of just two?

  • What if words that are labeled with sentiments,

  • like sarcastic and not sarcastic,

  • are plotted in these multiple dimensions?

  • And then, as we train, we try to learn

  • what the direction in these multi-dimensional spaces

  • should look like.

  • Words that only appear in the sarcastic sentences

  • will have a strong component in the sarcastic direction,

  • and others will have one in the not-sarcastic direction.

  • As we load more and more sentences

  • into the network for training, these directions can change.

  • And when we have a fully trained network

  • and give it a set of words, it could look up

  • the vectors for these words, sum them up, and thus, give us

  • an idea for the sentiment.

  • This concept is known as embedding.

  • So going back to this diagram, consider

  • what would have happened if I said something

  • was "not bad, a bit meh."

  • If we were to sum up the vectors,

  • we'd have something that's 0.7 on y and 0.1 on x.

  • So its sentiment could be considered slightly

  • on the good side of neutral.

  • So now, let's take a look at coding this.

  • Here's my neural network code.

  • The top layer is an embedding, where

  • the direction of each word will be learned epoch by epoch.

  • After that, we pool with a global average pooling,

  • namely adding up the vectors, as I demonstrated earlier.

  • This is then fed into a common or garden deep neural network.

  • Training is now as simple as model.fit,

  • using the training data and labels,

  • and specifying the testing padded and labels

  • for the validation data.

  • At this URL, you can try it out for yourself.

  • And here, you can see the results

  • that I got training it for just 30 epochs.

  • While it was able to fit the training data to 99% accuracy,

  • more importantly, with the test data, that is words

  • that the network has never seen, it's

  • still got 81% to 82% accuracy, which is pretty good.

  • So how do we use this to establish sentiment

  • for new sentences?

  • Here's the code.

  • Let's create a couple of sentences

  • that we want to classify.

  • The first one looks a little bit sarcastic,

  • and the second one's quite plain and boring.

  • We'll use the tokenizer that we created earlier

  • to convert them into sequences.

  • This way, the words will have the same tokens as the training

  • set.

  • We'll then pad those sequences to be

  • the same dimensions as those in the training set

  • and use the same padding type.

  • And we can then predict on the padded set.

  • The results are like this.

  • The first sentence gives me 0.91, which is very close to 1,

  • indicating that there's a very high probability of sarcasm.

  • The second is 5 times 10 to the minus 6,

  • indicating an extremely low chance of sarcasm.

  • It does seem to be working.

  • All of this code is runnable in a Colab at this URL.

  • So give it a try for yourself.

  • You've now built your first text-classification model

  • to understand sentiment in text.

  • Give it a try for yourself, and let

  • us know what kind of classifiers you built.

  • I hope you've enjoyed this short series,

  • and there's more on the way.

  • So don't forget to hit that Subscribe button

  • and get the latest and greatest in AI videos

  • right here on the TensorFlow Channel.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

訓練模型來識別文本中的情感(NLP Zero to Hero,第三部分)。 (Training a model to recognize sentiment in text (NLP Zero to Hero, part 3))

  • 3 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字