矢量詞（單詞嵌入） - Computerphile (Vectoring Words (Word Embeddings) - Computerphile)

字幕列表影片播放

if we're moving from capital, which is similar things.
So we go away from Capitals, Dog, we gotta go beyond in that direction.
Yes.
So the first result is dogs, which is kind of a nonsense result.
The second is pit bull.
So that's like the dog use of dogs, right?
The least cat like dog that feels right.
Yeah.
Yeah.
What if you go the other way?
What?
The most cat like cat, the most undock like, let's find out.
It's gonna be kidding, right?
It's gotta be cats.
Feline kitten.
It's not really giving us anything much to work with.
I thought I would talk a little bit about wording, beddings, word avec and just wording beddings in general, the way I was introduced toward embedding So the sort of context that I'm most familiar with them in is like, how do you represent a word to a neural network?
Well, it's a set of characters, isn't it?
I mean, needed me more than the set of characters that make it up.
Right, So you could do that.
But you remember the thing we were talking about before in language models?
You have a problem of how far back, you can look.
I would much rather be able to look back 50 words than 50 characters.
And like if you if you're training a character based model, a lot of the capacity of your network is gonna be used up.
Just learning what characters counters valid words, right?
What combinations of characters of words.
And so if you're trying to learn something more complicated than that, you're spending a lot of your time training just like what words are.
And a lot of your network capacity is being used for that as well.
But this isn't a hard problem.
We know what the words are, right.
You can give the thing a dictionary, and then you're kind of that gives it gives it a jump start.
The point is, neural networks.
They view things as like a vector of real numbers or victor floats, which is like some of the real numbers, Um, and so if you think about something like on image, representing an image in this way is like, fairly straightforward.
You just take all of the pixels and put them in a long row on Dhe.
If they're black, then it zero.
And if they're white.
Then it's one, and you just have great skill in between.
For example, it's like, fairly straightforward.
And so then you end up with a vector.
The represents that image.
It's a reasonably good representation.
It's sort of reflects some elements of the structure of what you're actually talking about.
So, um, like, if you take, if you take the same the same image and make it a little bit brighter, for example, that is just making that victor a bit longer right or a point in that configuration space that's a bit further from the origin, you could make it darker by moving it close to the origin by reducing the length of the vector.
If you take an image and you apply a small amount of noise to it, that represents just like jiggling that victor around slightly in that configuration space.
So you've got you got a sense in which to vectors that are close to each other, actually, kind of similar images on dhe um, that some of the sort of directions in the vector space air actually meaningful in terms of something that would make sense for images and the same is true with numbers and whatever else.
And this is very useful when you're training, because it allows you to say If you neural network is trying, predict the number and the value you're looking for is 10 and it gives you nine.
You can say no, but that's close.
And if it gave you 7000 you could be like No, and it's not close.
And that gives more information that allows the system to learn.
And in the same way you can say, Yeah, that's almost the image that I want, Um, whereas if you give the thing a dictionary of words, so you've got your 10,000 words and the usual way of representing Mrs with a one heart to vector, if you have 10,000 words, you have a vector that's 10,000 long 10,000 dimensions on dhe, all of the values of zero apart from one of them, which is one so like the first word in the dictionary.
If it's like a then that's represented by a one and then the rest of the 10,002 zeroes, and then the 2nd 1 is like a zero, and then I wanted, but there you're not giving any of those clues.
If the thing is looking for one word and it gets a different word, all you can say is, Yeah, that's the correct one or no, that's not the correct one.
Something that you might try, but you shouldn't because it's a stupid idea is rather than rather than giving it as a one heart vector, you could just give it as a number.
But then you've got this indication that, like two words that are next to each other in the dictionary are similar, and that's not really true, right?
Like if you have a language model and you're trying to predict the next word and it's saying, I love playing with my pet blank like the word you're looking for his cat and the word it gives you his car lexical.
Graphically, they're pretty similar, but you don't want to be saying to your network, you know, close.
That was very nearly right, because it's not very nearly right.
It's a nonsense prediction.
But then, if it said like dog, you should be able to say no.
But that's close, right, because that is a plausible completion for that sentence on Dhe.
The reason that that makes sense is that cat and dog are like similar words.
What does it mean for a word to be similar to another word?
Um And so the assumption that word in beddings use is that two words a similar if they are often used in similar contexts.
So if you look at all of the instances of the word cat in a giant database, you know, John, corpus of text and all of the instances of the word dog they're gonna be surrounded by, you know, words like pet and words like, you know, feed and words like play.
And you know, that kind of thing cute, such right on.
So that gives some indication that these years of similar words, the challenge that word in beddings are trying to come up with It's like, How do you represent words as vectors?
Such that too similar vectors to similar words, and possibly so that directions have some meeting as well.
Um, because then that should allow our networks to be able to understand better what we're talking about in the text.
So the thing people realised was, if you have a language model that's able to get good performance if, like predicting the next word in a sentence.
Um, and the architecture of that model is such that it doesn't have that many neurons in its hidden layers.
It has to be compressing that information down efficiently.
So you've gone the inputs to your network.
Let's say, for the sake of simplicity, your language model is just taking a word and trying to guess the next word.
So we only have to deal with having one word in every port.
But so are input is this very tool thing, right?
10,000 toll, and these then feed into a hidden layer, which is much smaller.
I mean more than five.
But it might be like a few 100.
Maybe, let's say 300.
And these are sort of the connections, and all of these is connected to all of these, and it feeds in, and then coming out the other end, you're back out to 10,000 again, right?
Because your output is it's gonna make one of these high.
You do something like soft Max to turn that into a probability distribution.
So you give it a word from the dictionary.
It then does something on DDE What comes out?
The other end is probability distribution, where you can just like, look at the highest value on the output, and that's what it thinks The next word will be on the higher that value is, the more like confident it is.
But the point is, you're going from 10,000 to 300 back out to 10,000.
So this 300 has to be.
If this if this is doing well at its task, this 300 has to be encoding sort of compressing information about the word because the information is passing through and it's going through this thing that suddenly 300 wide.
So in order to in order to be good at this task, it has to be doing this.
So then they were thinking, Well, how do we pull that knowledge out?
It's kind of like an egg drop competition.
Is this where you have to devise some method of safely getting the egg to the floor, right?
It's not like the teachers actually want to get an egg safely to the ground, right, But they've chosen the tasks such that if you can do well at this task, you have to have learned some things about physics and things about engineering and probably teamwork.
And yes, right, Right.
Exactly.
So it's there is the friends you make along the way.
So So the way that they the way that they build this is rather than, um trying to predict the next word.
Although that will work that will actually give you Worden beddings.
But they're not that good because they're only based on the immediately adjacent would u um, you look sort of around the word.
So you give it a word and then you sample from the neighborhood of that word randomly.
Another word.
And you trained the network to predict that.
So the idea is that at the end, when that when this thing is fully trained, you give it any word and it's gonna give you a probability distribution over all of the words in your dictionary, which is like, How likely are each of these words to show up within five words of this first word?
Well, within 10 or something like that.
If this system can get really good at this task than the weights of this hidden layer in the middle have to encode something meaningful about that input word.
And so if you imagine the word cat comes in in order to do well, the probability distribution off surrounding words is gonna end up looking pretty similar to the output that you would want for the word dog.
So it's gonna have to put those two words close together if it wants to do well at this task.
Um, and that's literally all you do.
So So So if you run this on a lot, it's It's absurdly simple, right?
But if you run it on a large enough that data set and given enough compute to actually perform really well, um, it ends up giving you each, giving you for each word.
Ah, vector.
That's off length, however many, uh, units you have in your hidden layer, which for which, er the nearby nous of those vectors expresses something meaningful about how similar the contexts are That those words appear in on our assumption is that words that appear in similar contexts are similar words, and ah, it's slightly surprising how well that works and how much information it's able to extract.
So it ends up being a little bit similar actually to the way that the generative adversarial network, uh, does things where we're training it to produce good images from random noise.
And in the process of doing that, it creates this mapping from the Layton space to images by doing basic arithmetic, like just adding and subtracting vectors on the latent space would actually produce meaningful changes in the image.
So what you end up with this is that same principle, but forwards.
So if you take, for example, the vector and it's required by law that all explanations of wording beddings use the same example to start with.
So, uh, if you take the vector for, um, King, subtract the vector for man and add the vector for woman, you get another factor out.
And if you find the nearest point in your word and beddings to that vector, it's the word queen.
And so there's a whole giant swathe of like ways that, um, ways that ideas about gender are encoded in the language, which are all kind of captured by this vector, which we won't get into.
But it's interesting to explore.
I have it running, and we can play around with some of these vectors and see where they end up.
So I have this running in Google column, which is very handy.
I'm using Worden beddings that were found with the word Tyvek algorithm using Google News.
Each word is mapped to 300 numbers.
Let's check whether what we've got satisfies our first condition.
We want dog and cat to be relatively close to each other, and we want cat to be like, further away from car than it is from Doc, right?
We can just measure the distance between these different factors.
I believe you just do model dark distance, distance between car and cat.
Okay, 0.784 and then the distance between, Let's say, dog and cat 0.23 right dog encounter closer to each other.
This is a good start, right?
Um and in fact, we can Ah, let's find all of the words that are closest to camp, for example.
Okay, so the most similar word to cat is cats makes sense, followed by dog kitten, feline beagle, puppy pup, pet felines and Chihuahua.
Right, so So this is already useful.
It's already handy that you can throw any word of this and I will give you a list of the words that is similar.
Whereas like if I put in car, I get vehicle calls.
SUV, minivan.
Truck.
Right.
So this is working.
The question of directions is pretty interesting.
So, yes, let's do the classic example.
Which is this?
If you take the vector for King, subtract the vector for man.
Add the vector for woman.
What you get somewhat predictably is queen on Dhe.
If you put in, boy, here you get girl.
If you put in Father, you get mother.
Yeah, and if you put in shirt, you get blouse.
So this is reflecting something about gender.
That's that's in the data set that it's using.
This reminds me a little bit off the unicorn thing where, you know, the transformer was able to infer all sorts of appeared to have knowledge about the world because of language, right?
Right.
But the, um, the thing that I like about this that that is that that transformer is working with 1.5 billion parameters and here were literally just taken each word and giving 300 numbers.
You know, if I go from London and then subtract England and then add um I don't know, Japan.
We'd hoped for Tokyo.
We hope the Tokyo and we get Tokyo.
We got to go twice.
Weirdly Tokyo!
Tokyo!
Why?
Oh!
Oh, sorry.
It's No, we don't.
We get Tokyo and Toyko a typo, I guess.
And so yeah, Uh, U s A in New York.
Ah, care Interesting.
Maybe it's thinking larger city of right, Right?
Like the exact relationship here isn't clear.
We haven't specified that.
What does it give us for Australia?
I bet it's yet.
Sydney, Sydney, Melbourne!
So it's Yeah, it's not doing capital.
It's just the largest city, Right?
But that's cool.
It's cool that we can extract the largest city in like this is completely unsupervised.
It was just given a huge number of news articles, I suppose, and it's pulled out that there's this relationship and that you can follow it for different things.
You can take the vector from pig toe point, right?
And then, like you put cow in there, that's move.
You put a cat in there and you get now ing.
But dog in there you get a box right close enough for me?
Yeah, Yeah, you put.
But then then it it gets surreal.
You put Santa in there.
Ho, ho, ho!
Right.
What does the fox say?
Good.
This'll, Phoebe.
What?
So it doesn't know?
Basically, although the second thing is Chickering two foxes chitter gabble Don't go.
Tingling ring.
Ding ding, ding, ding, ding a ling ding ding!
Not in this data set.