字幕列表 影片播放 列印英文字幕 ROBBY NEALE: I'm Robby. Yeah. I'm an engineer at Google, and I'm going to talk to you about how to leverage TF text for your preprocessing and language models inside of TensorFlow. So, for those unfamiliar with language models-- All right, there we go. Forward. They're basically everywhere. You use them in clear understanding, you have related keyword searches, article summaries, spell check, auto-complete, text-to-speech, spam filters, chat bots. You really can't get away from them. And it's really a good time to be into NLP right now, because we're going through somewhat of a Renaissance. Last year, this paper on BERT was released that uses attention and transformers. I'm not going to go too much into it, other than traditionally, obviously when you're working with text, models don't play as well with strings, so you convert those into numbers. And so we've used embeddings, different ways to embed, like GloVe, ELMo, Word2Vec are different ways to create vector representations of your words. And these worked really, pretty well. The one problem is, with some words, when you look them up in your vocab, like bat, am I talking about an animal? Am I talking about baseball equipment? Even words that don't sound the same, entrance and entrance, they're spelled the exact same. So when you're trying to represent these as a vector, you're getting, you're trying to have two different meanings to the same sequence of letters. And so, BERT has gotten around this by-- It's a model that actually uses the context of the sentence to actually create this vector for the words. And so, it's done really well, and this is the Stanford question and answer data set. So BERT was late 2018. The scores before that were in the low 70s. It came out, it jumped up to around 82, and ever since then, people have been iterating on this model through RoBERTa, XLNet, AlBERT And I pulled the scores up from last week. And you can see that the most recent model, AlBERT, is actually outperforming humans, so that's pretty crazy. So it's really exciting right now to be into this. And so let's just jump right in. So what is TF text? Our charter was basically to make programming these language models in TensorFlow easier. Traditionally, it's been very difficult to do this. You would get, let's say you have some data. Like, here's an example of queries, and we want to train on this data, right? Well, before we can do that, we need to do some preprocessing, which is basically tokenization outside of that, because we didn't have that availability inside of TensorFlow. And then once we did this preprocessing, we had to somehow fit it into a tensor. So we would get this preprocessed text, add it into the graph, and then normally we pad out our tenses to make them a uniform shape so that they're available. And then we'd finally train our model, and we'd go to publish it, and we'd put it on our model server. And we're like, OK, we're ready to go, right? And then so, when we get the serving data, well, you can't just plug that serving data right in, right? We had this preprocessing that has to happen. And so, either you're relying on the client to transform the data as well, or you're doing it yourself, and a lot of times it's a different language than what you did your normal scripts in. And I've seen it, even when the preprocessing is exactly the same. It's used the exact same regex, because it's different libraries, one might consider a character class to be punctuation, where the other doesn't. And so you get training skew when these preprocessing steps are different. And so when you actually go to serve the model, you don't have the same performance, and that's problematic, right? So our charter was to make this as easy as possible, to support TensorFlow, or to support text inside of TensorFlow. And to do this, we want to do all the text processing in graph. And we do this through a series of text and sequential APIs that were not available, and actually create a new tensor type called the Ragged Tensors that better represents this text. So if we go back to when it was painful, what we really want to do is just get rid of this preprocessing step, right? Put everything in the graph. And so, all your preprocessing happens in graph, and then when you go to serve the model, you're not relying on the client to perform those same steps when you serve the model, and they call it-- And so really the main thing that was missing was tokenization. So last year, we had a RFC with a tokenizer API. And we wanted to make this as easy as possible and straightforward. So it's very simple, it's an abstract tokenizer class, it has one method, tokenize. It takes input, a string, and gives you back your tokens. And so if we see this, it's very simple. We have a couple of sentences here. We tokenize them into words. The one thing I like to point out-- which is not completely obvious immediately until you see examples-- is that our input is a rank one tensor, and our output is a rank two. The reason why this is, is our tokens are grouped by the string that they're split from. And so it's really easy from the engineer's perspective to be able to tell which string tokens were pulled from which string in the original tensor. The one thing you can't do from this output, is tell where in that originating string it came from. And for that. We have one extra tokenizer with offsets class, abstract class, that has tokenizer with offsets, which is the same thing. You give it an input, tensor of strings, it gives you your tokens, but also gives you where those tokens start and end. So we can see that example here, we call it tokenize with offsets. And we can see the letters. "I" starts at zero, and moves one position, and then "know" is in the second position and moves up six characters. So through these offsets, if you want to know where the tokens are in your originating string, you can do that. And you'll notice the shapes are exactly the same as the shapes of the token. So mapping tokens to starts and limits is very simple from here. So we provide five basic tokenizers. You know, one of the questions when we first did the RFC was, why don't we just have one, one tokenizer to rule them all? The problem is, every model's different. You have different limitations and things you want to get around, and we don't want to push our opinion on you, because they're all different. We just want to build the tools and allow you to make the decision. And so a lot of these are very simple. Whitespace obviously just splits a sentence on Whitespace. Unicode script, so if you know Unicode, characters are grouped together in what they call Unicode scripts. So you would have Latin characters, Greek, Arabic, Japanese are just some examples. And then they also group spaces, punctuation, and numbers as well, and so it splits on those. I would say in the most simple case, if you're just working with English, the main difference between Whitespace is, it splits out the punctuation. So Wordpiece, this was popularized by the BERT model which I mentioned earlier. It basically takes text that you've already tokenized, and then splits those words into even smaller sub-word units. So this is actually, it greatly reduces the size of your vocabulary. As you're trying to encapsulate more information, your vocabulary will grow. And by actually breaking the words down into sub-word units, you can greatly get that smaller, and encapsulate more meaning in less data. And to generate the vocab, we have a beam pipeline in our GitHub so you can generate your own. Or the original BERT model has a vocab you can use. Sentencepiece is a very popular tokenizer. So this is actually released previously, there's a GitHub where people have downloaded, and it's pretty popular. It basically takes a configuration where you set up a bunch of preprocessing steps already, and you feed that to it, and it does that. And so it does sub-word tokenization, word and character. And finally, we're releasing a BERT one that does all the preprocessing that the original BERT paper did, and so you can use, like I said, their Wordpiece tokenization, and it'll do the pre-tokenization steps, some other normalization, and then the Wordpiece tokenization. So now that we have tokenizers, we really just needed a way to represent these. And that's where we created Ragged Tensors, for this better representation of text. So if we look at an example, we have two sentences. And like I said, normally your sentences are never of the same length. So when you try and create a tensor out of these, you get a value error. It needs to be of a uniform shape. And so traditionally, like I said previously, we padded out the strings. And in this example, you're like, OK, three extra values is not so bad. But when you're actually writing out these models, you don't know how long your sentences are going to be, so you have a fixed size. And so a lot of times, I've seen fixed size of 128 characters or 128 words, and you just pad. It has all this extra information that you don't really need inside your tensor. And then if you try and make that smaller, when you do have a long sentence, then those sentences are truncated. So you might think, well, we have Sparse Tensor. And this is also not quite as good, because there's a lot of waste of data that you're having to supply a Sparse Tensor. As you know, or if you don't, Sparse Tensors-- because really, in TensorFlow, everything is made of tensors. So it's actually made of three tensors which is values, a shape, and then where those values exist within your matrix shape. And so you can see, there's actually a pattern. Because Ragged Tensors aren't necessarily, or for our string, it's not necessarily that they're sparse, it's dense, they just have varying lengths. So it would be good if we could say, hey, the first row has indices 0 through 5. The second row has indices 0 through 2, and those make up our sentences. Excuse me. And so that's what we did with Ragged Tensors. It's easy to create, you just had a tf.ragged.constant to create it. It's similar, we built-- like a Sparse Tensor, it's made up of values and row splits. And so it minimizes the waste of information. So you can see that all the values are in one tensor, and then we say where we want to split up that tensor to build up our different rows. It's easier to kind of see it in this form where the gray block on the left side is what the Ragged Tensor is in this representation, and on the right is how it would look represented. And down below is how you would actually do that call, or build this if you're using values inside of TensorFlow. And so, this was the original way we had row splits. We had some people come to us, they represented these in different ways, so we also provide row IDs where the ID tells where that value is inside your tensor, and row lengths that says the lengths of each row. So the first one takes the first four values. You could have empty rows, so 0, 2, and so on. And so we want to treat these like any normal tensor. So Ragged Tensors, they have rank, just like you would see with normal tensors. So in this example, we have a rank of 2. They also have shape, the question mark, when we find our shape is, that denotes the ragged dimension. It's not necessarily always on the end, but in this case, it is on the end. And we can use normal TensorFlow functions and ops like we would with normal tensors. And so here, we're just using gather, it grabs the second, and then the first row, gather_nd, which grabs the index. concat, you can concat on the different axes. And of course, we made this for sequential and text processing, so your string ops work well with Ragged Tensors. So here we decode the strings into code points, and code them back into strings, and conditionals work as well. So in this case, where clause, where we use Ragged Tensors inside. The one case here, is that the Ragged Tensors for where, they must have the same row split. So the rows must be the same length. And it's easy to convert into and out of Ragged Tensors. You can just do from_tensor, or from_sparse to create a RaggedTensor, and then to move back, you just have your RaggedTensor and just call to_tensor or to_sparse. And then to_list actually gives you a list if you want to print it out. So we're also adding support to Keras. These are the layers that are currently available, are compatible, and there's a lot left we have to do on this front. So we're pushing to get more layers compatible with Ragged Tensors If you are using them within your Keras model and come across something that's not compatible, in tensorflow_text, there at the bottom, we do provide a ToDense layer that will just convert it for you. And the real cool thing that I want to point out is the RNN support. So we've seen on tests that we get 10% average speed up with large batches, like 30% or more. This is very exciting, and I won't go into details, but it's very intuitive, because if you think about when you are looping through your tensor, you know when you're at the end of that ragged dimension, you can stop computation. Where before, if you're using tensors, you're using mask values, and masks can be not necessarily at the end, but in the middle, so you have to keep computing until you're end of the full tensor length or width. So yeah, you just have a lot less computation, and so you save a lot there. All right. So I want to go over a couple examples, show you how easy it is to work with. First, you can just install tensorflow_text with pip. Our versions now map to TensorFlow versions, so if you're using TensorFlow 2.0, use tensorflow_text 2.0. If you're using TensorFlow 1.15, use tensorflow_text 1.15. Because of the custom ops, versioning must match. And you can import it like this. We generally import TensorFlow text as text, so in these examples, you will see it written as text. So let's go over a basic preprocessor what you might do. So normally, you'll get your input text. Here we have a couple of sentences. We want to tokenize it, split those sentences into words, and then we want to map those words into IDs inside our vocabulary that we will feed into our model. And so the preprocess function might look like this, where we just instantiate the tokenizer, create a Ragged Tensor out of that input, and then map the table.lookup into our vocabulary along the values of that Ragged Tensor. So if you remember, what the Ragged Tensor looked like underneath, when we have our words and tokens, we have the RaggedTensor above, where the values are set, and then the row_splits are separate, in a separate tensor. So really when we want to map those words to IDs, we're keeping the same shape. We only want to map over the values. And so that's why the map over values is there, because we're just converting, we're doing the look-ups on each word individually. And so the resulting Ragged Tensor is there at the end, and we can see what it actually represents above. And so this is our preprocessing. Once we're done, you, using TF data, normally you might create a data set from it. You map that preprocessing function over your data set. I won't go into model details, but you can create a model with Keras pretty simply, and then fit that data set on the model, and that trains the model. And so you can use that same preprocessing function in your serving input function. So you have the same preprocessing that's done at training time, as it is in serving time with your inference, and this prevents training skew that we have seen multiple times in the past. I have, at least. So let's go over another example, character bigram model, here. Before I jump in, I just want to quickly go over Ngrams. So a bigram is like a form of an ngram with the width of two. It's basically, I say like a grouping of a fixed sized over a series. We provide three different ways to join those together. So there's the string joint, and you can sum values, and also take averages. So let's pull in an example here. So here, we're doing a bigram of words. So we have a sentence, we just tokenize it to split it up into words, and then we call the ngram function in TensorFlow text that groups those words together, which basically is joining a string. So every two words are grouped together, as you can see. And so, that's a generally a bigram. So, trigrams is three, so you can see here, we split our sentence into characters and then we grouped them together with every three characters. We set the width to three. And then in this situation, the default separator is a space, and so we just do the empty string. So the other two, it also works with numbers. So if we have a series here, 2, 4, 6, 8, 10 as our tensor, we want to sum up every two numbers. So 2 plus 4 is 6, 4 plus 6 is 10, and so on. And then also, average, which is the mean reduction type. So where this might-- generally, when you talk about ngrams, you are talking about it in a language context. But where this would be helpful, let's say if you're taking temperature readings every 20 minutes. And so, you had a series of temperature readings every 20 minutes, but what you want to actually feed in your model is an average of those temperatures over an hour period, every 20 minutes, you can do a trigram with a reduction of mean. So it takes the average of those 20 minute intervals, and so you get average temperature over the hour at every 20 minutes, and you can feed that into your model. But generally, like I said, with the bigrams and trigrams, it's often used in NLP. And how that works is you generally split it up, either into words or characters, and then have a vocabulary dictionary you can look up those groupings in. In our example, we cheat a little bit. We get our codepoints from our input. So we have this input. We can get codepoints. As you see, again, the rank is increased. So we have a shape of three, and then had, now a shape of three with a ragged dimension. And we use merged dimensions to actually combine those two dimensions, because we don't care about it in this case. And so it takes the second to last axis and the last axis and combines them. And then we're just summing those up to create kind of our unique ID in this case that we'll feed into the model. I think generally, like I said, you would do string joins and look those up in a vocabulary, but for this case model, it works. And we just cast those values, and this is our preprocessing function, that again, we create a data set using TFRecordDataset, map our preprocessing function on those values. And then the model that's created, we can train using this preprocessing function. Finally, I was going to go over the BERT preprocessing. There's a little bit more code in this one. So I just want to say that we provide the Bert tokenizer for you, so feel comfortable in knowing that you don't really have to write this if you don't want to, you can just use the BERT tokenizer, tokenize, and it does all this stuff for you. But I feel like there's a lot of good examples, and what this does, and if you're doing text preprocessing, these are things you should probably think about and know about, so I wanted to go over it with you. So this is like a slim version of that. So what it does, what BERT did in its preprocessing, it lower cased and normalized the text, and then it did some basic tokenization. It split out Chinese characters and emoji by character splitting, and then it did Wordpiece on top of all that. So with lowercase and normalizing, this is very common that you would do. When you're looking up words in your vocab, you want the words to match and not have duplicate words. So capitalization kind of gets in the way with that. You know, words at the beginning of a sentence are capitalized, so when you look it up, it would be in your dictionary or vocabulary twice. And so it's generally thought that you would lowercase these. And normalization is, a lot of Unicode characters with accents can be represented in different ways. And so normalization basically normalizes that text so it's represented in a single way. And again, so you don't have the same word multiple times in your vocabulary, which would confuse your model, also, as making your vocabulary larger. So we provide case_fold, which is just an aggressive version of to lower. What it does, is it lower-cases characters. It also works with non-Latin characters, accented characters. It doesn't mess up non-letters, so it keeps them as is. And it does NFKC folding and normalization, so I'll talk a little bit more about that. So we do that in our first step. I have examples of what this would look like. So in this example, it really is just lowercasing our I in it's. And then BERT actually normalized to NFD, and because case_fold does NFKC, we're going to normalize to that next. You know, I won't go over this. Just know, again, that letters have many different forms, so it's good to have a single normalization, so when you're working with international characters they're not represented in different ways. So here we are, we just normalized to NFD. Now we're going to do some basic tokenization. We'll split on Unicode scripts, tokenize our text. And then what you might notice here is, while our sentence, it's a trap, has been tokenized, the Chinese characters have not. And that's because it's a single script throughout that whole sentence, and there are no spaces or any other method of defining separations and words. So we want to do is, we want to split that up. So this is kind of where a lot of code comes in. You can follow, and I think the main point is just to know that these things, we've thought about, and if you run across it, there's ways to work around this. I've prepared you, tried to. It's simple ops, when we step through it, you'll see. So first, we just get your codepoints, or-- sorry. Yes, we get codepoints of the characters, and then we just get script IDs of those characters. So you can see that the first sentence is all script 17, which is Han scripts, it's Chinese. And then our Latin characters are 25, and emoji and punctuation is 0. And we can just apply math.equal like you can on our Ragged Tensor. It gives you-- and we're just checking if it's Han script, so we have true, and then we use the slice notation to just grab the first character, because we know they're all the same already from our Unicode script. We also want to check for emoji. In TensorFlow text, we provide a function, Wordshape, which you can ask, basically different questions on words. It's basically like different regular expressions that you want to ask, so here we're asking, is this, does this text have any emoji? Other ones, is there any punctuation? Are there any numbers? Is my string all numbers? And so these are things you might want to find out about, and we provide you a method to do that. And so here we just order two conditions together to say whether we should split or not. It works with Ragged, and then we go ahead and split everything into characters so that when we do our where clause in our conditional, if we should split or not-- if we should split, we grab it from the characters that we've already split. If not, we just grab it from our tokens that we used when we tokenized. And here we just do a little reformatting of how the shape looks. So once we've done that, we can finally Wordpiece tokenize, we provide it with our vocab table. We split it up into sub-words, and we have an extra dimension, so we just get rid of that with merge_dims. All right. We made it through that, it wasn't too bad. And so, we apply this just as we did before. We have a data set we've created with tf.data, we map our preprocessing across from that. Here, we can grab a classifier BERT model from the official BERT models, and just train on that classifier. So I know that was a lot to go through. Hopefully you followed along, but the main thing to know is that as TF.Text, we're looking to basically bring in all that preprocessing inside the graph so you don't have a problem. You don't have to worry about training skew, you can just write your TensorFlow and train. So we do that by giving you what I consider a superior data structure for sequence data, as well as text data through Ragged Tensors and the APIs that are required for this preprocessing. Again, you can install it with pip install tensorflow_text. And, thank you. Here's some links, that's our GitHub. If there is anything that you think is missing that we should add, feel free to add an issue. And we also have a Colab tutorial on tensorflow.org that you should check out, and it'll walk through some of this more slowly through the Colab for you. Thanks. [APPLAUSE] ASHLEY: You still have seven minutes, do you want to do Q&A? ROBBY NEALE: Oh, yeah. OK. ASHLEY: OK, so we still have about seven minutes left, so we can open this up for Q&A. There are some microphones on either side, but I can also help provide those if needed. ROBBY NEALE: Do you want to go to the microphone, or grab these mics? OK. Go ahead. AUDIENCE: Hi. Very nice to see that you have all this support. Just a quick question, does can TF.Text handle Japanese text? It's a mixture of Hiragana, Katakana, Kanji, Romaji all thrown in. Really? ROBBY NEALE: Yeah. So, like in this previous example where we go through the characters, a lot of the core Unicode, we've added to core TensorFlow. But I don't know where we are in here. So when we're searching for the scripts, that says ICU, which is like the open source library scripts. And so you can just as well grab that Kanji, and they have different script tokens there. AUDIENCE: Thank you. ROBBY NEALE: Sure. We'll just switch sides, back and forth. Here you go. AUDIENCE: Thanks for the information. For inferencing, do you send in text, or do you have to put it in a tensor and send it? ROBBY NEALE: Yeah, no. You can send text. So at inference time, in here, the training, we use this preprocessing function. And so you can use that same preprocessing function. When you save your model, and save the model, you give it a serving input function that basically does preprocessing on your input. And so if you send in those full string sentences, you can use the same function or a variation of that in that input function, and so it should process. AUDIENCE: Thank you. ROBBY NEALE: Sure. Back over here. AUDIENCE: Thank you very much. My question kind of relates to his question. So what's the advantages of applying it with the map versus having a layer that does it? Because you could, even with a Lambda layer, or with the new RFC for preprocessing layer, have a layer that does it. ROBBY NEALE: Oh, a layer that does the preprocessing? AUDIENCE: Yeah, just supplies that function, and then it's saved as a check-able as part of the model. ROBBY NEALE: Yeah, no. You could certainly do some of this with layers. Actually, you know, we're looking at what layers we should provide, and someone on our team is helping Keras and building out their preprocessing layer, which is basic. And that there's added functionality that we find that people need, we'll supply it in our TensorFlow text library. So it's really up to you as someone who wants to model it how you want to apply those. AUDIENCE: Thank you for the talk. ROBBY NEALE: Sure. AUDIENCE: Two quick questions. The first one is, do the tokenizers you provide also have a decoding function, so that you go from the tokens, from the integers to the sequence of text? ROBBY NEALE: From the integers-- Yeah. There's TF strings, Unicode, decode. Is that what you're talking about? From codepoints, or are you-- AUDIENCE: I'm talking about, so for instance, if you decode them in, for instance, we have a BERT vocabulary, then you will have all these additional characters there, And then you want to concatenate them into again, a sequence of text that's proper text, right? ROBBY NEALE: Yeah. So I think what you're asking is, so if you send your word through a BERT model and you get a vector representation, if you can translate that vector representation back to text? AUDIENCE: Yeah, because that's the case when you, for instance, generate text. And so, then you may want to map that text back, the sequence of generated tokens back into a string of text. ROBBY NEALE: Yeah. I mean, this is more along the lines of, I think you want an encoder, decoder model. There's models that do this. It's not something that we provide inside the library. AUDIENCE: Well, you can take it-- fine. It's a slightly different question. The second question is why this is not in TensorFlow. ROBBY NEALE: Yeah. So with 2.0, I mean, you might know that TensorFlow has gotten rid of contrib, and it's kind of, for the core team has gotten un-handily. It's too much to handle, the tests are running too long, and it's really, it's too much for one team to maintain. And so I think we'll see more kind of modules like TensorFlow.text that are focused on one particular area like this. As a team, we want to make it easier across, so a lot of the stuff we've done with Ragged Tensors some of the string ops are actually in core TensorFlow. But for some of these things that are outside the scope, like ngrams and tokenization, it's just a separate module. AUDIENCE: Thank you. ROBBY NEALE: Sure. AUDIENCE: Hi. ROBBY NEALE: Go ahead. AUDIENCE: So since TF.text can be incorporate into the TensorFlow graph, is it intended that you actually build a model with this preprocessing step? And if true, what are, are there performance implications in TensorFlow serving? If that's there, has that been measured? ROBBY NEALE: Yeah. There's definitely some performance. I think, you know, it's done at the input level. I'm just going to skip ahead to right here. So this is actually a problem that TF data is looking at, as far as consuming this data and then paralyzing-- I said this wrong earlier today too. Paralyzing these like input functions. And so, if actually your model is on GPU, or TPUs, the input's paralyzed, and then you're feeding as much data as possible. So this is like something you might worry about and look at, but it's also what a lot of other people are looking at. AUDIENCE: OK. And then, yeah. I guess, if it's part of the TensorFlow graph, and TensorFlow serving, how are the nodes allocated and computed, right? Like it's preprocessing on the CPU, or-- ROBBY NEALE: Yeah. Most of this is done on the CPU. AUDIENCE: OK. ROBBY NEALE: I'd say all of it. Yeah. AUDIENCE: Can I just quickly ask, are you compatible with TF 2? Because I just pip installed TensorFlow.text, and it uninstalled TensorFlow 2 , and installed 1.14. ROBBY NEALE: So like I said, the versions need to match. If you just do pip install TensorFlow.text equals equals 2.0.0, which I think, maybe why it did that is because that version is actually a release candidate. So just do RC zero. It'll reinstall TensorFlow 2 for you. AUDIENCE: OK. ROBBY NEALE: So, yeah. Sure. Last question. AUDIENCE: Yeah. First of all, I'd like to say that this is really cool. Second, is do TF.text integrate with other NLP libraries, such as SpaCy or anything in that area? Just out of curiosity. ROBBY NEALE: No, our focus is really just on TensorFlow right now.
B1 中級 用tf.text建立模型(TF World '19) (Building models with tf.text (TF World '19)) 2 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字