字幕列表 影片播放 列印英文字幕 (bell rings) - Hello, welcome to a new session from, I don't know, is it the Machine Learning Course, is it the Programming with Text Course, I don't know? I'm just here. I'm just a person who's here. And this session which will be a whole bunch of videos, is about a topic, word2vec. (bell rings) I'm ringing the bell way too much. So, first of all, I want to mention something very important. I've known about word2vec. And I've used it in projects for a little while, but I don't think I ever really understood it. (laughs) And I don't even know that I really do understand it. But I definitely improved my understanding of it vastly, after reading this amazing tutorial by Allison Parrish. It's posted as a Gist on Github. It's a Python Notebook, Understanding Word Vectors, by Allison Parrish. You know, honestly, if I'm being truthful, you should just stop this video right now and read this instead. But, you know if you, some people seem to like listening to me prattle on. Which is fine, you could keep watching if you so choose. Read this after then, at the very least. And so this tutorial is released under Creative Commons by 4.0 license. The code itself is the Creative Commons Zero license. So you can re-use this material, which is what I'm doing right now. I don't usually do this. I mean my stuff is always based on other people's stuff, but this first video I'm really going to like, talk through what's in this tutorial in my own words. But if you do the same please reference with attribution according to the license. Okay, so I also want to mention that Allison Parrish has a wonderful Talk, it's on YouTube, I will link to it, called Experimental Creative Writing with the Vectorized Word. From the Strange Loop Conference. So I also encourage you to take a look at that as inspiration and background for what it is I want to show you. My end goal with this tutorial is to get to the point where I have a P5 JavaScript sketch in the browser, where I can do stuff with word2vec. What is word2vec? The point of this video that you're watching right now, which I'm taking a very long time to start, is just to answer the question what is word2vec? By the end of it, I want to use word2vec in the projects to make weird stuff happen with text on a web page. All right. How are we feeling? So, all right. So let me come over here for a second. Because I've written word2vec up here, and that's going to help me. The idea of word2vec and others, this is a machine learning process, similar to other things that I've done that looked at like, classification. Is this image a cat or a dog? Or a regression analysis. What's the, what's, can you predict the price of this house based on certain properties of that house? These are classic machine learning examples. Word2vec is a particular machine learning model that produces something called a word embedding. Now, and that's a very, very fancy term. And what it means is that any given word, like apple, can be associated with numbers, a vector. This we can basically somehow come up with this sort of like numeric mathematical essence of this word, as some array of numbers. Like 0.7 and 1.2 and -0.345, et cetera, et cetera. And there's going to be some amount of numbers in here. This seems like a crazy thing. Why would I ever want to have a word associated with an array of numbers? Well one of the things that one can do with arrays of numbers is math. Linear algebra, multiplying, subtracting, averaging, adding. So, we know we can do that with arrays of numbers. And this is the kind of thing that happens in lots of my other tutorials with programming graphics, and pixel processing and machine learning. But one thing we wouldn't know how to do is how would we say, you know, apple plus, I was going to say plus orange, but that could be. I was trying to like come up with something, a good example. This is what happens when you don't plan these tutorials in advance. I could come up with an example on the fly. Apple plus purple, could this equal plum, maybe, right? Like, in other words, like, I'm trying to come up with some like pseudo math. Like, let's take these two words and add them together. like cat plus cute maybe that equals kitten. And can I take, like, we're not talking about concatenation, apple-purple. We're taking apple plus purple. Could I get that sort of mathematical essence of these words, add them together and get a new word? Well, the theory, the prompt, the idea here, the argument that I am making to you is that word2vec is a mechanism. By which you can do stuff like this. Right there in your code. If I could quantify the word apple as a series of numbers and I could quantify the word purple as a series of numbers, then couldn't I just add all those numbers together? I would get a new series of numbers. And then I might look and find which word, or has a set of numbers that is most close to this set of numbers. How could I find the similarity? I could calculate a similarity score between any two sets of numbers. I could find the word that is the most similar to this plus this and maybe it would be plum. Why would it be plum? Is that magic? Is it because of what data that's word2vec model was trained on? Well, yes, it's the latter. But, and so I want to get to all of that. Okay, this is my sort of like zoomed out view of why we're doing this. Let's come over and look at what Allison has in her particular tutorial here. Which is a really nice example. If I look at this, we can say like, well imagine like a really simple case, right? I was sort of saying, over here, each word gets a list of maybe 100 numbers, maybe its 300 numbers, maybe its 1000 numbers. This is up to us to sort of figure out and decide based on what we're trying to do. But what if we simplify that. And here's Alison's example. Where each word gets essentially two numbers. And those numbers are data properties of that word. Like a cuteness score from zero to 100, and a size from zero to 100. So you could say kitten is 95, 15. A hamster is 80 comma eight, right? There are these numbers that sort of like, the label is tied to a set of data properties. So if that's the case, then we can look, we could graph all of those and we could say something like, oh, you know, like a horse and a dolphin are kind of like similar in terms of like size and cuteness. And then we could start to do things by, but actually like, we could do a mathematical analysis. Like what is the actual Euclidean distance. Euclidean distance means the number of, well in this case pixels or units, between these two words right here. These are very similar because they're physically close to each other. And we can also do things, you can think of those as, and this is a nice demonstration of this idea. This is why we talk about it as vectors, right? I have a whole set of tutorials about vectors, describing as, describing points in space. So, for example, a vector, a velocity vector, if I have a particle in a particle system, and I want it to go from here to here, this is its velocity. Its change in location. In essence, this is basically what I'm doing with an operation like this. For example what if I said, okay, well apple is over here. And then I'm going to add purple to it. I'm going to move by purple's numbers, and over here I now find plum. So when we look at this in two dimensions, it kind of makes, we can sort of like, our brains can understand that. Two dimensions is like the easiest dimension. I mean I actually find two dimensions to be easier than one dimension. One dimension is weird sometimes. So, what Allison is showing here is by moving from let's say, one word to another word physically in space, we can establish this idea of word relationships. Chicken is to kitten as tarantula is to hamster. Now this is all very arbitrary, with like hard-coded word vectors. So, but this is just for demonstration purposes in two dimensions so that our brains can kind of process it. Ultimately, if we have a lot more information, somehow, about all of these words in higher dimensional space, in vectors that have 100 dimensions, 100 numbers, we can't visualize that so easily. There are interesting techniques called dimension reductionality. Reducing the dimensionality that we could then draw, like word clusters and stuff. And maybe I'll get to that later. But what I'm trying to say here is that we can establish sophisticated complex relationships between words in higher dimensional space. But in order to do that it's useful to look at a single example that ties words to numbers in a low dimensional space, that we can either visualize or if we like, put into our brains. And so, I've kind of described to you what word2vec is. What the model looks like when it's complete. I haven't looked at all about the training process, right? The animals example is hard-coded. I'm going to show you. I'm going to do a port of one of Allison's examples of words associated with colors associated with numbers, right? A word red is 255,0,0 that's a word to a vector. And that's going to be from a dataset. And then the third thing that I'm going to do is look at what is traditionally thought of as word2vec. These higher dimensional large dictionaries of words and their associated vectors. Those word embeddings. So that's going to be the journey here. I don't know how many videos it's going to be. Three, four, five, 471. Something like that. And then at some point I'll try to also, do some projects with that. So in the next video I'm going to do a port of Allison's project which you can find all in Python, all of the code in Python on that tutorial that's linked in the description. And I'm going to do a JavaScript port of it. Okay, so I'll see you there. Maybe, maybe not. Go read that page, its excellent. Okay, goodbye. (upbeat music)
B1 中級 12.1:什麼是word2vec?- 使用文本編程 (12.1: What is word2vec? - Programming with Text) 3 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字