Placeholder Image

字幕列表 影片播放

  • [MUSIC PLAYING]

  • LAURENCE MORONEY: Welcome to episode 2

  • of this series of Zero to Hero with Natural Language

  • Processing.

  • In the last video, you learned about how to tokenize

  • words using TensorFlow's tools.

  • In this one, you'll take that to the next step,

  • creating sequences of numbers from your sentences

  • and using tools to process them to make them ready for teaching

  • neural networks.

  • Last time, we saw how to take a set of sentences

  • and use the tokenizer to turn the words into numeric tokens.

  • Let's build on that now by also seeing

  • how the sentences containing those words

  • can be turned into sequences of numbers.

  • We'll add another sentence to our set of texts,

  • and I'm doing this because the existing sentences all

  • have four words, and it's important to see

  • how to manage sentences, or sequences,

  • of different lengths.

  • The tokenizer supports a method called texts

  • to sequences which performs most of the work for you.

  • It creates sequences of tokens representing each sentence.

  • Let's take a look at the results.

  • At the top, you can see the list of word-value pairs

  • for the tokens.

  • At the bottom, you can see that the sequences that texts

  • to sequences has returned.

  • We have a few new words such as amazing, think, is, and do,

  • and that's why this index looks a little different than before.

  • And now, we have the sequences.

  • So for example, the first sequence

  • is 4, 2, 1, 3, and these are the tokens for I,

  • love, my, and dog in that order.

  • So now, we have the basic tokenization done,

  • but there's a catch.

  • This is all very well for getting

  • data ready for training a neural network,

  • but what happens when that neural network needs

  • to classify texts, but there are words

  • in the text that it has never seen before?

  • This can confuse the tokenizer, so we'll

  • look at how to handle that next.

  • Let's now look back at the code.

  • I have a set of sentences that I'll use

  • for training a neural network.

  • The tokenizer gets the word index from these

  • and create sequences for me.

  • So now, if I want to sequence these sentences, containing

  • words like manatee that aren't present in the word index,

  • because they weren't in my initial set of data,

  • what's going to happen?

  • Well, let's use the tokenizer to sequence them

  • and print out the results.

  • We see this, I really love my dog.

  • A five-word sentence ends up as 4, 2, 1, 3,

  • a four-word sequence.

  • Why?

  • Because the word really wasn't in the word index.

  • The corpus used to build it didn't contain that word.

  • And my dog loves my manatee ends up

  • as 1, 3, 1, which is my, dog, my,

  • because loves and manatee aren't in the word index.

  • So as you can imagine, you'll need a really big word index

  • to handle sentences that are not in the training set.

  • But in order not to lose the length of the sequence,

  • there is also a little trick that you can use.

  • Let's take a look at that.

  • By using the OOV token property, and setting it as something

  • that you would not expect to see in the corpus, like angle

  • bracket, OOV, angle bracket, the tokenizer

  • will create a token for that, and then

  • replace words that it does not recognize

  • with the Out Of Vocabulary token instead.

  • It's simple, but effective, as you can see here.

  • Now, the earlier sentences are encoded like this.

  • We've still lost some meaning, but a lot less.

  • And the sentences are at least the correct length.

  • That's a handy little trick, right?

  • And while it helps maintain the sequence length

  • to be the same length as the sentence,

  • you might wonder, when it comes to needing

  • to train a neural network, how it can handle

  • sentences of different lengths?

  • With images, they're all usually the same size.

  • So how would we solve that problem?

  • The advanced answer is to use something

  • called a RaggedTensor.

  • That's a little bit beyond the scope of this series,

  • so we'll look at a different and simpler solution, padding.

  • OK.

  • So here's the code that we've been using,

  • but I've added a couple of things.

  • First is to import pad sequences from pre-processing.

  • As its name suggests, you can use it to pad our sequences.

  • Now, if I want to pad my sequences, all I have to do

  • is pass them to pad sequences, and the rest is done for me.

  • You can see the results of our sentences here.

  • First is the word index, and then is

  • the initial set of sequences.

  • The padded sequence is next.

  • So for example, our first sentence is 5, 3, 2, 4.

  • And in the padded sequence, we can

  • see that there are three 0s preceding it.

  • Well, why is that?

  • Well, it's because our longest sentence had seven words in it.

  • So when we pass this corpus to pad sequence,

  • it measured that and ensured that all of the sentences

  • would have equally-sized sequences by padding them

  • with 0s at the front.

  • Note that OOV isn't 0.

  • It's 1.

  • 0 means padding.

  • Now, you might think that you don't want the 0s in front,

  • and you might want them after the sentence.

  • Well, that's easy.

  • You just set the padding parameter to post like this,

  • and that's what you'll get.

  • Or if you don't want the length of the padded sentences

  • to be the same as the longest sentence,

  • you can then specify the desired length

  • with the maxlen parameter like this.

  • But wait, you might ask what happens if sentences are longer

  • than the specified maxlen?

  • Well, then, you can specify how to truncate either

  • chopping off the words at the end, with a post truncation,

  • or from the beginning with a pre-truncation.

  • And here's what a post looks like.

  • But don't take my word for it.

  • Check out the Codelab at this URL,

  • and you can try out all of the code

  • in this video for yourself.

  • Now that you've seen how to tokenize your text

  • and organize it into sequences, in the next video,

  • we'll take that data and train a neural network with text data.

  • We'll look at a data set with sentences that are classified

  • as sarcastic and not sarcastic, and we'll

  • use that to determine if sentences contain sarcasm.

  • Really?

  • No, no.

  • I mean, really.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

序列--把句子變成數據(NLP Zero to Hero,第二部分)。 (Sequencing - Turning sentence into data (NLP Zero to Hero, part 2))

  • 6 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字