功能解析 - Computerphile (Functional Parsing - Computerphile)

字幕列表影片播放

Today we're going to do some more live coding, and we're going to talk
about something which is quite close to my heart, because I wrote some of the
early papers on it many years ago, and that's something called functional
parsing, or combinator parsing. Before we get started with actual live coding, I
want to begin with the question. And the question is: what actually is a parser?
So for me a parser is a program which takes a string of characters as its input, and
as its output produces some form of tree. And the idea is that the tree makes
explicit the structure in the input string. So that's a bit of a mouthful, so
let's have a simple example to explain what's going on with this. So what we've
got here is we've got a string of five characters -- 2 plus 3 times 4 -- and
that's our input string. But when we look at this we know that that's not just a
random sequence of five characters, it's actually got some structure to it. So in
particular, we've got three numbers here -- 2, 3 and 4 -- and we've got two
arithmetic operators -- we've got the plus and the times -- and one of the things we
learn at school is that the times happens before the plus. So you really do 3 times 4
here, and then you add the 2 on at the end. So that's our input string.
So what a parser does, is it takes a string of characters -- like that -- and it
tries to recognize the structure in the form of a tree. And the tree we would get
from a parser is something like this here. And you see we've got some leaves
in the tree -- the leaves are the numbers 2, 3 and 4 -- and the nodes in the
tree are the two arithmetic operators -- that's the plus symbol and the times
symbol -- and you can see the structure of the tree reflects the fact that we do
the multiplication. First we multiply 3 by 4, and then we do the
addition second. So that's the basic idea of what a parser is: it takes as input a
string of characters, and as output it produces a tree. So it's really a
function -- it's taking as input a string and as output as a tree. So we've seen
what a parser is. It's a function that takes a string as an input and produces
a tree as an output. So now we want to think how can we actually implement this:
how can we implement the idea of a parser? And I'm going to do this in Haskell
today, but it doesn't matter if you don't know anything about Haskell because
I'll be explaining everything as I go along. And actually, nothing I'm going
to show you today is specific to a language like Haskell.
You can do it in any general-purpose programming language. So if you do a web
search for a 'combinator parsing' or 'functional parsing', and then whichever
language you're interested in, you'll find the same kind of stuff which I'm
going to show you today. So it's not specific to the Haskell setting. So
what we're going to do then first, is we're going to define precisely, in our
programming language, what it means to be a parser. We're going to define a new
type called Parser, and it's simply going to be a function which takes a string as
an input and produces a tree as an output. So it captures the very simple
idea of what we are wanting to do -- a parser is a function that takes a string
as an input, and gives the tree as an output, and the arrow here just means
that we have a function from one thing into the other. So this captures the
basic idea of what a parser is. But unfortunately, it's not sufficient to
program with. We need to refine this a little bit to actually write programs
with this. So the first little refinement we're going to make is that a parser
might not consume all of its input. So for example, if we're trying to parse a
number, like 2, we maybe we find the 2, and then maybe we've got some more stuff
left in the input string that we need to parse. And if we want to chain parsers
together, we're going to need to have access to the remaining input string
that we didn't manage to consume. So the first little refinement that we'll make
is rather than just returning a single tree, we're actually going to return a
pair now. We're going to return two things -- we're going to return a tree as
before, and we're also going to return the unconsumed part of the input string.
So that's the first refinement we've made. The second refinement we're going
to make is that a parser may not always succeed. We may be trying to parse a
number, and we don't find the number, we find something else. So we need to have a
way of representing that a parser can fail. So the way we're going to do that, is
we're actually going to make a parser return a list of results, rather than a
single result. So lists in Haskell are denoted using square brackets. So this
simply means that rather than returning one pair of results, we could return zero,
or one, or two, or as many as we like. And the idea is going to be if our parser can't
parse, so it doesn't succeed, we'll return an empty list of results. And if it does
succeed, we'll return a list with one pair -- we'll return a tree,
that represents the structure of the input string, and we'll return the
unconsumed part of the input. But because we're working with lists here,
we could actually be more flexible. We could return two, three, or four, or five,
or as many as we like parses. And this is actually quite a good flexibility,
because for some languages the input string may be ambiguous -- maybe we're
trying to parse English, and English sentences don't always have one parse,
they can be interpreted in many ways. So this type here is giving us a
flexibility to return many results if we wanted. We're not actually going to use
that flexibility today, but it's nice to actually have it. So I haven't told you
what the tree data type is here. And that's because I'm actually going to get
rid of that now. Sometimes you may want to return a number, or a program, or some
kind of other structure. So we're going to replace that specific type of trees
by some arbitrary type 'a'. And I'll make this a parameter of my type. This is our
final type, which we're going to work with today. What we're saying is that a
parser whose results have type 'a' is simply a function which takes a string
as an input, and then it gives a list of results. And each result is a pair
comprising a single value of type 'a' -- maybe a tree, maybe a number, maybe
something else -- and then an unconsumed part of the input string. Okay, so this is
our final type. And if you look in any of the articles, or books, about these kind
of parser combinators, or functional parsers, you'll find a type very
similar to this there. It's quite a mouthful again, so let's think about how
we could understand this in a simpler way. And actually, we can write a little
rhyme to understand what's going on with this type. So let me write the rhyme out
for you as a comment. So what we can say is: a parser for things, is a function
from strings, to lists of pairs, of things and strings. This is a little Dr. Seuss
rhyme to tell you what a parser is, or what a functional parser is. And that's
actually how I remember this type. So that's our basic type now. We've seen so
far, a parser is basically a function from strings to trees, but in order to
actually program with these kind of things, we need to refine the type a
little bit. And this is the type we'll be working with today. But you don't need to
worry about the details of it. Just basically think it's a function from
strings into trees, or some other kind of structure. What we're going to do now,
is we're going to load up the parsing library. So I'll start up
the compiler. This is a library, which contains a whole bunch of parsing stuff,
which allows us to program with parsers of this form. And this is a parsing library I
wrote myself. I'll see if I can get Sean to upload it as part of the video.
A parsing library comes with any programming language you can think off.
And again, if you just search for parser combinators, functional parsing,
whichever language you like, you'll find the library, which gives you all sorts of
basic ways of building parsers. And that's what I'm going to show you now.
All of these libraries work in the same way. So you have some basic primitives, or
basic building blocks, for parsers. And then you have a way of combining parsers
to build bigger parsers. So it's like a kind of Lego kit, or a construction kit.
You have some basic bricks that you can do things with. And then you can put
those bricks, or components, or primitives, together in all sorts of different ways.
So I'm going to show you a few of the primitives, and a few of the combining
forms. And then we'll do an example. The first primitive I want to show you is
very simple. It's just a way of parsing a single digit. So what the digit parser
does, is it takes a string of characters, and it tries to consume a single numeric
digit off the start of that string. And you might think, well, what does 'parse' do
here? What parse does, is it takes a parser, which in this case is just digit,
and it takes an input string to that parser, and it just applies one to the
other. So of course, this parser here is going to succeed, because we do have a
digit at the start of the input string. So we get exactly the expected result.
We get a list with one pair. And the first thing in the pair is the actual digit. We
get the character '1'. And the second thing in the pair is the unconsumed part of
the input. And that's something we could then try and parse subsequently with
another parser. We can test, does this thing fail properly? So, if I give it an
input string that doesn't have a single digit at the start, then it's going to
fail. So if I give it the input string "abc", there's no digit at the beginning, so
we're just going to get the empty list of results. So I'll show you one more
quick primitive. If I parse a single character, say an 'a', from that string, then
that will do the right thing. If I parse single character 'a', and I didn't have an
'a' at the beginning, then it will fail. So we've seen two basic parsing
primitives here -- we've seen a way of parsing a digit,
a way of parsing a specific character, and we've seen that these things can
succeed or fail. And in the library, or in any of these libraries, there'll be a
bunch of these basic building blocks, or basic bricks, or primitives, that you can
use to build up your parsers. Where things get more interesting is when you
think how do you combine these kind of things, how do you use these basic bricks
to build actually useful parsers? Let me show you an example of this. So there's a
parsing combining form called 'some'. And what it does is it takes a parser as its
input, and it tries to apply it one or more times, as many times as possible. So
if we're trying to parse 'some' digits, what we're trying to do is consume one
digit, then two, then three, and as many as we can until we don't find any more
digits. So if we apply the 'some digit' parser to the string "123" then
it will do the expected thing. It will consume all three of the digits, and then
we'll get the empty string left here. So that's 'some', it gives us a form of
repetition. And we also have a very simple way of making a choice as well.
So if I want to make a choice here, between a digit, and a letter. And let me
parse the string "abc123". So, what we've got here, is this funny symbol
here -- with the three symbols -- that's a choice operator. It says do that, or do
that. So if I try to parse a digit, or a letter, what it's going to try first is
it will take the first character in the input string, and say is it a digit? If so,
I'll parse it. And if it's not a digit, then I'll go over to the other side, and
say is it a letter? And I will try and parse that. You can see what's happening
with this particular example here -- if I look at the first character, it's not a
digit, it's a letter. So when I apply the digit parser it would fail, and then the
'or' operator, or the choice operator here, will go over to the other side and say,
well, is it a letter instead? And of course it is, so we can parse the single
'a' off the front here, and then we get everything else as unconsumed.
And of course, if we wanted to be a bit more clever we could combine some of
these things. So I could say, some digit or letter -- get my brackets right --
"abc123", and then that will parse everything. Because all I'm doing here is
I'm repeating, or iterating, the choice between either parsing a single digit, or
a single letter. And I've got a string of digits and letters here,
so I can parse the whole thing. So I've consumed them all. And then I get nothing
left at the end. So what we've seen so far, is some basic building blocks, and
we've seen a couple of combining forms -- we've seen a way of doing
repetition, which is 'some', and we've seen a way of making a choice, which is the
funny operator in the middle there. What I haven't shown you so far, is how to do
some form of sequencing. And this is the most common thing you typically want to
do with parsers. You want to say do this, and then do that, and maybe do that as
well. You want to sequence things together. So I'll actually show you a bit
of the parsing library here. So here's the parsing library. And I don't want to
go through all the details of this, but one thing I want to know, is it's quite
short. If I kind of scroll down here, I think it's about four and a half screen
fulls. And I've got quite a big font here, and this is actually already quite a
sophisticated parsing library. So it shows you the power of this method, that you
don't need hundreds of lines of code to write parsers -- four and a half screen
fulls is a library which is fully fledged and you can
basically implement any parser that you like using this. So I'll
show you a couple of examples of sequencing. The first example I want to
show you is a parser for natural numbers. So what's a natural number? It's just a
non-negative integer, like 0, 1, 2, 3, or 10, or something like that. So you think how
do you parse a natural number? Well I'm going to use the sequencing notation for
parsers, which is to do notation. And the do notation is very simple -- you write
the word 'do', and then you have a whole bunch of parsers one after the other, and
it just runs them each in sequence. So the first thing here, is we're going to
parse 'some digits'. Because that's the basics of what a natural number is -- it's
just some digits. And if that succeeds, I'm going to call all those digits 'xs'.
So 'xs' is just going to be a list of all the digits. And then what I'm going to do
here to be a bit more flexible, probably when I parse a number I don't want a
string of characters back, I actually want the number. So I'm going to pass the
string in to a little function called 'read', which just converts the string into
a number. And then I'm going to simply return it. So the basic idea here is
we're sequencing two things together -- we're reading some digits, or parsing
some digits, and then we're translating those into a normal number, and then
we're returning it. And we're sequencing those things together using the 'do'
notation here. So just one more little example, because we'll use this in a
minute. Here's how you could parse an integer. So an integer is either a
negative number, or a positive number. So there's a
choice there. So we're going to use the choice operator. So here's the 'or'
operator, which we've seen a few minutes ago. And the two parts here just say -- have
we got a negative number, or have we got a positive number? So the parser for a
negative number, we use the 'do' notation, because we need to do three things.
So the three things are here. So if we're trying to parse a negative number, the
first thing we do is we parse a minus sign. So we're using the 'char' primitive
that we saw previously -- that will parse a minus sign. Then we're going to parse a
number, and call it 'n'. And then we need to remember that we need to make it
negative, so we negate it and then we return it. So again here, we're just using
the simple idea of sequencing three parsers, one after the other. And then the
'or' here says, or we can just have a simple natural number. Okay so this
illustrates the idea of sequencing. And if you've seen some of my previous
videos, you may recognize the 'do' notation here. And this is because parsers form an
example of what's known as a 'monad'. And in fact, for me, parsers being monadic is
one of the key ways to understand what a monad is. So if you've seen the monad
video, or even if you haven't seen it, maybe you have a look back at that,
and if you find it interesting. maybe look up some of the work which people
have done on monadic parsing. And it's a really good way to get a very good feel
for what's going on with both these kind of parsers, and monads as well. We've seen
that parsers are basically functions -- they take a string as an input, and they
produce essentially a tree as an output. We've seen some basic primitives for
consuming single digits, and single characters, and things like that. And
we've seen some basic combining forms -- we can have repetition, with 'some', we can
have choice, and we can have sequencing. So we've got our basic kind of building
blocks for making larger parsers. So let's wrap this up now by doing a little
example. And the example I want to do is to build a really simple parser for the
kind of expressions, or arithmetic expressions, that we saw back at the
start. So things like two plus three times four. So what I'm doing here is I'm
writing a Haskell program, which is going to implement this parser. What I've got
in the first line here is simply importing the parsing library, which
we've just seen -- it's just four and a half pages of code -- it's very
straightforward. And what we've got here in the comments, is a simple way of
writing down what the syntax, or form, or structure, of expressions are.
And this is what's known as a 'grammar'. But it doesn't matter if you
don't know what a grammar is, because the basic idea is very simple here. The first
line says an expression can be one of two things, so this means 'or' here, in
grammars. So an expression can either be a term plus an expression, or it can be a
term. And then in turn, a term could either be a factor times a term, or a
factor. And finally, a factor can be a bracketed expression, or an integer. So
there's three simple rules here, which explain the form, or structure, of what a
simple arithmetic expression can be. There's actually quite a lot of things going on
here, but the only key thing I want to point out is we've got three rules,
because there's three different levels of priority in an expression. So the
highest level of priority is brackets. So that's one thing you learn in school, you
learn that you do the brackets first, that's the highest priority thing.
The middle level of priority here is multiplication, so that's thing in the
middle rule. And the lowest level of priority is you do addition, and that's
sitting at the top rule. And again, this priority order is something you learn at
school -- you do brackets first, then you do multiplication second, and then you do
things like addition last. And these three rules are just making that precise.
And again, if you want to know more about grammars, you can search on that,
and you'll find a lot of information about that online.
So what we want to do now, is take this little grammar, and implement it as an
actual parser. And this is very straightforward to do, because we're
using this functional parsing idea. Essentially, we just take those three
grammatical rules, and we just implement them using the combining forms, and the
primitives that we've seen. So it's a very straightforward translation.
So the first one, is we want to say an expression can either be a term plus an
expression, or a term. So let's do the first part of that. So we've got term
plus expression. So what we're going to do is parse a term, and if that succeeds
call it 'x'. Then we're going to parse a '+' character, then we're going to parse
an expression and call it 'y', and then we're going to return x plus y. So there's
four things going on in sequence here. We're first of all parsing a term.
Then we're parsing the '+' character. Then we're parsing an expression, and
we're getting the values x and y -- these will be numbers. And then we're going to
simply add those two numbers together. So you can see here we're actually doing
more than just parsing -- we're actually evaluating the expression as well. And
that's one of the advantages of this approach to parsing, that it's not just
about building a tree, you can actually process things as you're going along. And
we're actually processing them here by doing complete evaluation. So x and y
will be numbers which, result from this term and this expression, and then we just
add them together here, and return the result. Then the last part of parsing an
expression is we can either be a term plus an expression, or we could be a term.
So we just use the choice operator, and we get term. And these five lines here
are our full parser for expressions. So we had this one rule up here -- an
expression could be a term plus an expression, or a term, and we just
translate it directly into our parsing notation. And again the key observation
here is that the grammatical rule here looks basically the same as the parser.
So let's look at the second rule, and in fact it's pretty much the same as the
first rule, except a few symbols are changed. So let me just copy it. And I can
just change it. So I can say a term can be a factor times a term, and then I can
do a multiply there, or I get a factor. Okay, so in just a few key presses, I've
managed to implement my parser for terms. And again the point to note here is that
the grammatical rule here looks basically exactly the same as the parser.
Okay, I've just got a few more symbols in here, because I'm actually writing a
program to do parsing, or actually evaluation, but it's the same basic
structure. And then just to wrap things up, I can implement what a factor is.
So a factor is either a bracketed expression, or it's an integer. So let me
write the parser for that. So a bracketed expression, I just parse a character, and
then I parse as an expression and call it 'x', and then I parse a closed bracket, and
then I return the 'x'. Or, I can parse an integer. And again, if you look at the
structure of the rule up here -- a factor is a bracketed expression or an integer.
I've got exactly the same thing down here -- here I'm parsing a bracketed
expression, and here I'm parsing an integer. And this is actually our entire
parser. We've got three lines up here, and we've just got kind of ten or fifteen
lines down here, and this is actually a complete parser, and evaluator,
for arithmetic expressions. And again, the beauty of this approach is
the parser looks basically the same as the grammar. So let's try and see if this
works, and hopefully I haven't made any mistakes. So let's load it into the
system. Okay that's great, no errors. And we can see that we've loaded two files
in now -- we've loaded the parsing file in, which is about four and a half pages
of definitions, and we've loaded the example program in now. So now we can check,
does our parser actually do what we want. So let's try out our parser with the
little example that we had at the start, 2 plus 3 times 4. So we're going
to parse an expression, and the expression is 2 plus 3 times 4.
And we press return, and we hope we get the right result, and we do. So remember
from school, you do the multiply first, you do the 3 times 4, so you get 12,
and then you add the 2 on at the end, and you get 14. So we've managed to
get the result 14 here, and there's no portion of the input string left. So
we've got a successful result -- we've got a list, and we've got one result value,
we've got 14, and we've managed to consume the whole thing. Or we could check,
does this actually work with more sophisticated examples? So let's try
putting some brackets in, and let's put brackets around the 2 plus 3, so we get 2
plus 3 times 4. So we hope then that we do the addition first, and we get 5, and
then times by 4 to get 20. Yes, and it works. We can try more sophisticated
examples. So let's do something like 2 plus 7 times 10 plus 8 times 20, and if
I've got the brackets right, yes then it works fine. We can also check what
happens if you give it something which doesn't parse. So suppose I do something
like, I parse expression, so 2 plus 3 times, and I forget to write the 4 at the
end. What's going to happen? Well, the parser will still manage to
succeed, because it will manage to parse the 2 plus 3, and we'll get the 5 out,
but it doesn't know what to do with this symbol sitting on its own. So you get
that back as an unconsumed part of the input. And again, we can try another
example. Suppose I forget to close the brackets, so I do something like 2 plus 3,
and I forget to close the brackets, then it won't know what to do with that at
all, and we'll just get the empty string. That's basically it -- this is the idea of
functional parsing, or combinator parsing. The idea is very simple --
parsers are basically functions, you define a library with some
basic building blocks, or primitives, some combining forms that let you put these
things together, and then you can end up writing parsers as we've seen that look
very similar to the grammars that you write to describe languages.