自下而上的解析 - Computerphile (Parsing Bottom Up - Computerphile)

字幕列表影片播放

Last time you [Sean] did a really good animation about top-down parsing. We had
these sentences in this totally artificial grammar, which has only got
about 30 words in it, all in all. But we started here with is
and, in the top-down approach, you basically say: "OK, I'll push those on
the stack in reverse order: " And, starting at the top, of the
stack, at the left, you say: "Well, what can a be? So, you do these things on
your stack, top down, one at a time, and you start off at the root of the tree
and you develop the component parts, left to right, one at a time. It is a lot
easier, totally by hand, to write a top-down parser rather than a bottom-up one.
But what I shall be getting to, later on and in fact I'm starting right now, is to
say: "Well what happens if instead of starting up at the root and developing
the leaves of the tree, as it were, you start right down at the bottom, with the
text string that you know is correct, and try and work upwards you know can you
work back upwards, from the leaves to the root?"
So, in fact let's write that test sentence below here: the robot stroked two furry ..."
And actually, since layout doesn't matter too much at the moment I'll squeeze the
word 'dice' in at the right there. Now, be clear, we're talking about
bottom-up parsing now. In bottom-up parsing you start with the string that
you think is correct and you say: "Starting with a string can I look into the
rules and see how to work up the tree, not down the tree?" And, yes, so therefore
you're looking at possible matches on the right-hand side [of the rules] for components of
this string, reading from left to right. OK 'the'. How many ways are there you
can match the string 'the' against one of these classifications here? Well the
text - the string 'the' - is the right-hand side possibility of an icle. That's
one way to do it. Oh! and then look up here, right at the top of the grammar, if
you have an icle followed by a , like 'the' followed by something else,
that could be a ect, and that's looking good because right up at the top of the
grammar we want to end up with ect . Looking just at "the robot",
and looking at the grammar right-hand sides, I could do it by saying: "Well it's
the subject of the sentence, it's at the left-hand side, and if I go icle
I get 'the' and I get 'robot'. But what I've done, just to act as a talking point
and it illustrates a lot of things here, I've given you a shortcut. If you want to,
you can just do "the robot", with no further interior analysis at all. It's an
allowed phrase; it's the ect. Now I have to say that, as we develop this
story, we will get into bottom-up parsing. Because one of the tools were going to
use called 'yacc' basically produces bottom-up parsers for you, not top-down ones.
And it's a yacc behaviour symptom that it
loves, when you're trying to match text strings,
it likes to match the longest one that it can [legally] manage. So it is going to seize on
"the robot", all as one phrase, as being a wonderful long solution. Why doesn't it
use 'the' and then wait patiently for ? That's not the bottom up way.
If you can see a longer ... Oh! and this thing by the way, that you're looking at, is
built upon a stack, of course, it's called the 'handle'. I get a longer handle by
going for this option here and getting "the robot", all in one go. OK "the robot"
then, and you've got to get used to reading from right to left now,
in bottom-up parsing, "the robot" all as one phrase is an example of a ect. So, we
can now say: "OK, 'the robot' is my ect". Now that act - of picking up a substring
from your sentence and going upwards, and making it more abstract if you like - it's
called 'reducing', in bottom-up parsing. So, looking for a longer and longer and
longer string, to get your handle, that's called 'shifting', because you're shifting
characters, one after another, and making the string longer and longer and saying
"... can I go any further?" That's shifting but when you say: "Ooh! that's a nice long
string - and it matches" and then you go up and say: "Oh! that's my subject", that is
called 'reduction' because you're going to something simpler further up the tree.
So, you can tick that off as being done bottom-up. [The] next thing is, you see this
string of characters called 'stroked' and, once again, it's right-hand-side driven.
What is there, on the right hand side, and which rule is it, that could possibly
match 'stroked'? You see in here, against , 'bit', 'kicked' or 'stroked'. Those three
strings are your possibilities. So, that's fine. Going right to left you say 'stroked'
is an example of a . So we've got our there. Now,again,
I've cheated but it's wonderful fun! I have not analyzed "two furry dice" into
adjectives and nouns or anything like that. I've just put it in as a interesting
short-cut to have there. And it is an example of what I would call an object-phrase
Some of you, who are really good English linguists, may want to go on about
my lack of understanding about what a direct and indirect object are - not to
mention 'predicates' and so on. But please forgive me. I regard it as being a phrase
in an object position. So, I'm saying there's a quick match here and bottom-up
I love this: "two furry dice" is a great long handle. Oh! and if I match it
there, what's the left-hand side it corresponds
to? [Answer is] . OK then, we've won! We have worked bottom-up to having ect ect
on our stack starting with the string. And, what's more, we've exhausted the
[input] string now. It's the end of it. There's a sort of full stop after that.
There we are then. We've got top-down which tends to be
more - how shall we say? Eager?- you know a top-down parse would very probably leap on the
word 'the', and not bother to go any further because it's found a quick match for it,
whereas bottom-up is the other way round. It's basically saying: "I want the longest
possible handle". Even at this stage in the late 50s and early 60s there
was a sneaking suspicion, coming around, that actually bottom-up parsing was a
little bit more powerful than top-down. I'm going to put out a set of notes for
this so that you can look up for yourself. Just examples of why it [i.e. bottom up] is more
powerful. But roughly speaking I think you can sense that because you've not
only got something you're looking for but you've [also] got a handle that you've
already accumulated, it's like gathering more contextual information - going
bottom-up. But, on the other hand, handling the stack and working out what's
happening is a darn sight more complicated - if you do it by hand - coming bottom up.
Rather than doing it all by hand, why not me and you lot [together]. It's a good way
to learn 'lex' and 'yacc' In other words don't write the C directly yourself. Get
a software tool, like these two, to do it for you. So, that's exactly what
we're going to do. I've got the program 'putty' that does 'ssh' connected here.
I'm talking to my other Linux machine in the other room, where I have got set up a
parser - complete parser: front-end lexical analyzer [then] syntax analysis - for this
'furry grammar' and all legal sentences in it. And I know, first of all, you will want me
to call up the program that implements this and the test sentence first of all is:
[In unison] "the robot stroked two furry dice". Here we go!
So, "furry". It's hanging there, it's waiting for us to give a correct furry sentence [reads from screen] "... dice ... return"
Look at that! It's happy with it! I'm giving it in subject-verb-object
order and I have numbered those rules, in the grammar as I showed you earlier, and
I now have, as it were, a map of how it has parsed it. Rule 3?
Now that is the one that effectively says I can do "the robot" all as one
phrase. It has chosen not to go for 'the' and 'robot' as icle and ,
as separate entities. It might well have done that, had I gone top-down, but because this yacc-confected
parser system goes bottom-up it's gone for the longest possible
handle at that stage and it's matched it. Rule 4: the middle piece is matched
'stroked' as the verb and finally it has spotted, right at the very end, that I put
in another sneaky short-cut to "two furry dice" is Rule 6. And that is my parse.
So, should we try one more just to make sure? Go on Sean, tell me which one to try next.
>> Sean: Try "the woman bit the dog" >> DFB: Yep, "the woman bit the dog"
There you are look - Rule 2 for "the woman" now. Rule 2 not rule 3. if it has followed Rule 2 it's
gone down the icle route, which means it knows that's the only way to
match "the woman". There is no shortcut way. OK?
Rule 4 - a rule again; it chose 'bit' Rule 5: now this time, again, there is no
short-cut to "two furry dice" at Rule 6 You've got to go the long way around and
following Rule 5, you break it down into icle again: 'the' and 'dog'
So, there we are. You could say well you've written a compiler for the
'furry' language, with the help of lex and yacc. We could go into details of that
later, if need be, but not now. It's fair enough but it's not doing anything really is it?
What more shall we do with this, now we've written it this far then Sean?
You tell me ?! >> Sean: Well I think next time we need to come up with an action, it needs to do something ....
>> DFB: We need to transform that grammar in some way.
Those of you who, in the previous video, actually bothered to look at the EXTRA BITS
may have had a sneak preview as to what we're going to do, as our much more
interesting actions now we have recognized the innate structure.
So, remember - always watch the EXTRA BITS !