字幕列表 影片播放
Katrin Erk>> It's so incredibly to process language which is ridiculous
because little children can speak. Little kids learn language
by being in the middle of the world. So, they have all these
visions, sound and they have language to go along with that.
The way that computers learn language nowadays is
more like if you sat a baby down with
a huge pile of The New York Times and said, "OK, here you go, now learn language."
One problem with human language is that it's horribly
ambiguous. Take, for example, the word "run." You can
run a race, run a company or you
run your car into a bog. Run-ing, running.
Traditionally what people have done is try to get the computer to
pick up on patterns using a dictionary. So,
here is this clear list of senses, so, run has say, 20
and here's the list. And then,
if you had one occurrence like "he ran the company" you would have to say,
"OK, that is sense #12 and not sense #11 and not sense #13."
Trouble is, all of those meanings are somewhat related
but, they're still different because you draw different conclusions.
Now, what's the poor computer to do. So, I'm thinking
maybe we can't distinguish senses as clearly say, here's where one
sense begins, here's where the next sense stops. What I'm doing
is to represent each time you use a word like run
with the context in which it appears. So, all of this context
you can put into numbers and then present such a context as a point
in a high dimensional space. And in order to represent
words as these points in high dimensional space I need a whole
lot of data. I need to have all that context and I need to have it in a form
that I can count it, which means, 100 million words is good,
a billion is better, 2 billion or 3 billion words, yeah,
then you can actually get decent models. But if you want to
compute with that amount of data and you do this on a single desktop machine
then you better be prepared to wait for a long time and I did that for a while.
I'd start an experiment, wait 3 weeks to see how it got out,
that's painful, really painful.
With TACC I can do the same things in a couple hours. So, we need super-computing because
all of natural language processing these days but, in particular when you want to do
stuff with word meanings you need to use a lot of data and I mean a lot of data.