字幕列表 影片播放
Are you surprised at the advances
that have come in the last several years?
Oh, yes, definitely. I didn’t imagine
it would become this impressive.
What’s strange to me,
is that we create these models,
but we don’t really understand
how the knowledge is encoded.
To see what’s in there,
it’s almost like a black box,
although we see the innards,
and so understanding why it does so well,
or so poorly, we’re still pretty naive.
One thing I’m really excited about
is our lack of understanding
on both types of intelligence,
artificial and human intelligence.
It really opens new intellectual problems.
There’s something odd
about how these large language models,
that we often call LLMs,
acquire knowledge in such an opaque way.
It can perform some tests extremely well,
while surprising us
with silly mistakes somewhere else.
It’s been interesting that,
even when it makes mistakes,
sometimes if you just
change the prompt a little bit,
then all of a sudden,
even that boundary is somewhat fuzzy,
as people play around.
Totally.
Quote-unquote "prompt engineering"
became a bit of a black art
where some people say that you have to
really motivate the transformers
in the way that you motivate humans.
One custom instruction that I found online
was supposed to be about
how you first tell LLM’s
“you are brilliant at reasoning,
you really think carefully,”
then somehow the performance is better,
which is quite fascinating.
But I find two very divisive reactions
to the different results that you can get
from prompt engineering.
On one side, there are people
who tend to focus primarily
on the success case.
So long as there is one answer
that is correct, it means
the transformers, or LLMs,
do know the correct answer;
it’s your fault that
you didn’t ask nicely enough.
Whereas there is the other side,
the people who tend to focus
a lot more on the failure cases,
therefore nothing works.
Both are some sort of extremes.
The answer may be
somewhere in between,
but this does reveal
surprising aspects of this thing. Why?
Why does it make
these kinds of mistakes at all?
We saw a dramatic improvement
from the models the size of GPT-3
going up to the size of ChatGPT-4.
I thought of 3 as kind of a funny toy,
almost like a random sentence generator
that I wrote 30 years ago.
It was better than that,
but I didn’t see it as that useful.
I was shocked that ChatGPT-4
used in the right way
can be pretty powerful.
If we go up in scale,
say another factor of 10 or 20 above GPT-4,
will that be a dramatic improvement,
or a very modest improvement?
I guess it’s pretty unclear.
Good question, Bill.
I honestly don’t know
what to think about it.
There’s uncertainty,
is what I’m trying to say.
I feel there’s a high chance
that we’ll be surprised again,
by an increase in capabilities.
And then we will also be really surprised
by some strange failure modes.
More and more, I suspect that
the evaluation will become harder,
because people tend to have a bias
towards believing the success case.
We do have cognitive biases in the way that
we interact with these machines.
They are more likely to be adapted
to those familiar cases,
but then when you really start trusting it,
it might betray you
with unexpected failures.
Interesting time, really.
One domain that is almost counterintuitive
that it’s not as good at is mathematics.
You almost have to laugh that
something like a simple Sudoku puzzle
is one of the things that it can’t figure out,
whereas even humans can do that.
Yes, it’s like reasoning in general,
that humans are capable of,
that these ChatGPT
are not as reliable right now.
The reaction to that
in the current scientific community,
it’s a bit divisive.
On one hand, that people might believe
that with more scale,
the problems will all go away.
Then there’s the other camp
who tend to believe that, wait a minute,
there’s a fundamental limit to it,
and there should be better, different ways
of doing it that are much more efficient.
I tend to believe the latter.
Anything that requires a symbolic reasoning
can be a little bit brittle.
Anything that requires
a factual knowledge can be brittle.
It’s not a surprise when you actually look at
the simple equation that we optimize
for training these larger language models
because, really, there’s no reason why
suddenly such capability should pop out.
I wonder if the future architecture may have
more of a self-understanding
of reusing knowledge in a much richer way
than just this forward-chaining
set of multiplications.
Yes, right now the transformers, like GPT-4,
can look at such a large amount of context.
It’s able to remember so many words
as spoken just now.
Whereas humans, you and I,
we both have a very small working memory.
The moment we hear
new sentences from each other,
we kind of forget exactly
what you said earlier,
but we remember the abstract of it.
We have this amazing capability
of abstracting away instantaneously
and have such a small working memory,
whereas right now GPT-4
has enormous working memory,
so much bigger than us.
But I think that’s actually the bottleneck,
in some sense,
hurting the way that it’s learning,
because it’s just relying on the patterns,
a surface of patterns overlay,
as opposed to trying to abstract away
the true concepts underneath any text.
Subscribe to ”Unconfuse Me” wherever you listen to podcasts.