12.1:什麼是word2vec？- 使用文本編程 (12.1: What is word2vec? - Programming with Text)

字幕列表影片播放

(bell rings)
- Hello, welcome to a new session from, I don't know,
is it the Machine Learning Course,
is it the Programming with Text Course, I don't know?
I'm just here.
I'm just a person who's here.
And this session which will be a whole bunch of videos,
is about a topic, word2vec.
(bell rings)
I'm ringing the bell way too much.
So, first of all, I want to mention something very important.
I've known about word2vec.
And I've used it in projects for a little while,
but I don't think I ever really understood it.
(laughs)
And I don't even know that I really do understand it.
But I definitely improved my understanding of it vastly,
after reading this amazing tutorial by Allison Parrish.
It's posted as a Gist on Github.
It's a Python Notebook, Understanding Word Vectors,
by Allison Parrish.
You know, honestly, if I'm being truthful,
you should just stop this video right now
and read this instead.
But, you know if you, some people seem
to like listening to me prattle on.
Which is fine, you could keep watching if you so choose.
Read this after then, at the very least.
And so this tutorial is released under Creative Commons
by 4.0 license.
The code itself is the Creative Commons Zero license.
So you can re-use this material,
which is what I'm doing right now.
I don't usually do this.
I mean my stuff is always based on other people's stuff,
but this first video I'm really going to like,
talk through what's in this tutorial in my own words.
But if you do the same please reference
with attribution according to the license.
Okay, so I also want to mention
that Allison Parrish has a wonderful Talk,
it's on YouTube, I will link to it,
called Experimental Creative Writing
with the Vectorized Word.
From the Strange Loop Conference.
So I also encourage you to take a look
at that as inspiration and background
for what it is I want to show you.
My end goal with this tutorial is to get to the point
where I have a P5 JavaScript sketch in the browser,
where I can do stuff with word2vec.
What is word2vec?
The point of this video that you're watching right now,
which I'm taking a very long time to start,
is just to answer the question what is word2vec?
By the end of it, I want to use word2vec in the projects
to make weird stuff happen with text on a web page.
All right.
How are we feeling?
So, all right.
So let me come over here for a second.
Because I've written word2vec up here,
and that's going to help me.
The idea of word2vec and others,
this is a machine learning process,
similar to other things that I've done
that looked at like, classification.
Is this image a cat or a dog?
Or a regression analysis.
What's the, what's, can you predict the price of this house
based on certain properties of that house?
These are classic machine learning examples.
Word2vec is a particular machine learning model
that produces something called a word embedding.
Now, and that's a very, very fancy term.
And what it means is that any given word,
like apple, can be associated
with numbers, a vector.
This we can basically somehow come up
with this sort of like numeric mathematical
essence of this word, as some array of numbers.
Like 0.7 and 1.2
and -0.345, et cetera, et cetera.
And there's going to be some amount of numbers in here.
This seems like a crazy thing.
Why would I ever want to have a word
associated with an array of numbers?
Well one of the things that one can do
with arrays of numbers is math.
Linear algebra, multiplying,
subtracting, averaging, adding.
So, we know we can do that with arrays of numbers.
And this is the kind of thing that happens
in lots of my other tutorials with programming graphics,
and pixel processing and machine learning.
But one thing we wouldn't know how to do
is how would we say, you know,
apple plus, I was going to say plus orange, but that could be.
I was trying to like come up with something, a good example.
This is what happens when you don't plan
these tutorials in advance.
I could come up with an example on the fly.
Apple plus purple,
could this equal plum, maybe, right?
Like, in other words, like, I'm trying to come up
with some like pseudo math.
Like, let's take these two words and add them together.
like cat plus cute maybe that equals kitten.
And can I take, like, we're not talking about
concatenation, apple-purple.
We're taking apple plus purple.
Could I get that sort of mathematical essence
of these words, add them together and get a new word?
Well, the theory, the prompt, the idea here,
the argument that I am making to you
is that word2vec is a mechanism.
By which you can do stuff like this.
Right there in your code.
If I could quantify the word apple
as a series of numbers and I could quantify
the word purple as a series of numbers,
then couldn't I just add all those numbers together?
I would get a new series of numbers.
And then I might look and find which word,
or has a set of numbers that is most close
to this set of numbers.
How could I find the similarity?
I could calculate a similarity score
between any two sets of numbers.
I could find the word that is the most similar
to this plus this and maybe it would be plum.
Why would it be plum?
Is that magic?
Is it because of what data that's word2vec model
was trained on?
Well, yes, it's the latter.
But, and so I want to get to all of that.
Okay, this is my sort of like zoomed out view
of why we're doing this.
Let's come over and look at what Allison
has in her particular tutorial here.
Which is a really nice example.
If I look at this, we can say like,
well imagine like a really simple case, right?
I was sort of saying, over here, each word
gets a list of maybe 100 numbers, maybe its 300 numbers,
maybe its 1000 numbers.
This is up to us to sort of figure out and decide
based on what we're trying to do.
But what if we simplify that.
And here's Alison's example.
Where each word gets essentially two numbers.
And those numbers are data properties of that word.
Like a cuteness score from zero to 100,
and a size from zero to 100.
So you could say kitten is 95, 15.
A hamster is 80 comma eight, right?
There are these numbers that sort of like,
the label is tied to a set of data properties.
So if that's the case, then we can look,
we could graph all of those and we could say something like,
oh, you know, like a horse and a dolphin
are kind of like similar in terms of like size and cuteness.
And then we could start to do things by,
but actually like, we could do a mathematical analysis.
Like what is the actual Euclidean distance.
Euclidean distance means the number of,
well in this case pixels or units,
between these two words right here.
These are very similar because they're
physically close to each other.
And we can also do things, you can think of those as,
and this is a nice demonstration of this idea.
This is why we talk about it as vectors, right?
I have a whole set of tutorials about vectors,
describing as, describing points in space.
So, for example, a vector, a velocity vector,
if I have a particle in a particle system,
and I want it to go from here to here,
this is its velocity.
Its change in location.
In essence, this is basically what I'm doing
with an operation like this.
For example what if I said,
okay, well apple is over here.
And then I'm going to add purple to it.
I'm going to move by purple's numbers,
and over here I now find plum.
So when we look at this in two dimensions,
it kind of makes, we can sort of like,
our brains can understand that.
Two dimensions is like the easiest dimension.
I mean I actually find two dimensions
to be easier than one dimension.
One dimension is weird sometimes.
So, what Allison is showing here
is by moving from let's say,
one word to another word physically in space,
we can establish this idea of word relationships.
Chicken is to kitten as tarantula is to hamster.
Now this is all very arbitrary,
with like hard-coded word vectors.
So, but this is just for demonstration purposes
in two dimensions so that our brains
can kind of process it.
Ultimately, if we have a lot more information, somehow,
about all of these words in higher dimensional space,
in vectors that have 100 dimensions, 100 numbers,
we can't visualize that so easily.
There are interesting techniques
called dimension reductionality.
Reducing the dimensionality that we could then draw,
like word clusters and stuff.
And maybe I'll get to that later.
But what I'm trying to say here
is that we can establish sophisticated
complex relationships between words
in higher dimensional space.
But in order to do that it's useful
to look at a single example that ties words
to numbers in a low dimensional space,
that we can either visualize or if we like,
put into our brains.
And so, I've kind of described to you
what word2vec is.
What the model looks like when it's complete.
I haven't looked at all about the training process, right?
The animals example is hard-coded.
I'm going to show you.
I'm going to do a port of one of Allison's examples
of words associated with colors associated
with numbers, right?
A word red is 255,0,0 that's a word to a vector.
And that's going to be from a dataset.
And then the third thing that I'm going to do
is look at what is traditionally thought
of as word2vec.
These higher dimensional large dictionaries
of words and their associated vectors.
Those word embeddings.
So that's going to be the journey here.
I don't know how many videos it's going to be.
Three, four, five, 471.
Something like that.
And then at some point I'll try to also,
do some projects with that.
So in the next video I'm going to do a port
of Allison's project which you can find all in Python,
all of the code in Python on that tutorial
that's linked in the description.
And I'm going to do a JavaScript port of it.
Okay, so I'll see you there.
Maybe, maybe not.
Go read that page, its excellent.
Okay, goodbye.
(upbeat music)