改進中間代碼 - Computerphile (Improving Intermediate Codes - Computerphile)

字幕列表影片播放

In what we've done so far we've ranged over quite an area, because just the act
of using the T-diagrams has forced us to address lots of interesting and
profound questions, over the years, about how compilers actually work. So we've
looked at the fact that we think of our programs being written in a high-level
language and the brain goes blurry. We neglect to think, all the time, that
although you they were written in C it doesn't execute directly. You have to
compile the C into binary. So, really, your beautiful program runs on a suitable
architecture in a suitable binary. It has an input, it has an output. Then we
started looking at more advanced things about ... saying: "Well if that's how a C
compiler works why not write another C compiler, in C, and then compile it with
the old compiler?" And I think we just about emerged from that with brain intact.
If you want to go back and revise some of this stuff before catching up
with what we're going to do today I would recommend the one [video] which we'll put a
link out to: 'Self-compiling Compilers". There's another one we put out
also, on the UNCOL problem which was this business about: "Is there a common
universal intermediate code?" There's more of a consensus now than there used to be.
And the LLVM system is a good example. It won the Turing ACM Award for - not quite
for 'best intermediate code ever' - but you know ... The reason, I think, that it became
less of a problem - not totally solvable but less of a problem - thinking
about this the other night, is that actually, over the years, certain things
about computer architectures have converged and moved together to
consensus. And principally it seems to me is the idea that your unit of discourse
isn't just the bit, it's the byte. And the idea of having, y'know, 8-bit ASCII
characters fit into a byte. The idea of being able to 'glue two bytes together' to
make a 16-bit entity, or four to make a 32-bit entity.
That has become more and more and more prevalent. The reason why in some of the
other videos I've said: "Oh! you know there was such massive differences between
machines" There were! In the 1970s. I can quote [this] to you as a fact. Because there I was
at the University of Nottingham working on a machine with 24-bit words. Not byte
addressed at all. What did you put inside your 24-bit words if you were mad keen on
characters? Four 6-bit characters. How did you dig them out from that word?
With great difficulty(!) The word-addressed machine would give you the whole word.
It was your responsibility with bit-shift operations and a garden spade (more or
less!) to dig out four 6-bit characters. Non-standard, not 8-bit characters.
And everybody said: "Ah well, it's all right for IBM but it's so much more *expensive* building
byte-addressed machines!" But in the end it prevailed. And I do believe that things
like that about the fundamental way you address memory and being able to, sort of,
put units together, preferably in powers of two not multiples of two.
So, right at the beginning I had a 24-bit word; Atlas had a 48-bit word;
DEC 10s, I think, had a 36-bit word. And Seymour Cray, CDC, had a 60 bit word. All of those
are multiples of 2, but I don't think any of them, that I've just quoted, are
powers of 2. And it really mattered. it turned out to have a power-of-two basic
unit - bigger than a bit - 8-bit bytes. So, anyway, I use that to introduce the
idea of intermediate codes but we're now I think - in tying things up - in a
situation to revisit that now and say: "Intermediate codes really are useful".
and come into their own, if you like, with the idea of wanting to port a compiler from
one architecture, perhaps to a very very different architecture. What I call the
'semantic gap' between your high-level language, and your program eventually
that runs on binary, is HUGE! It's not so huge in C, you can
feel the assembler and the binary poking up through the C, But you start
trying to do a Haskell interprete,r or compiler, and you'll soon discover that
this thing running down here is - so to speak - miles away from the abstract stuff
you were writing up at the top. So, everybody started saying don't we really
need intermediate codes to help us bridge the gap?
Hence Z-code, Bytecode for Java. All these kind of things became discussed
more and more and more. And I think I, at one stage, said: "Don't imagine it's always
fairly close to the hardware". You could end up in a situation where C is your
intermediate code. Bjarne Stroustrup got a C++ compiler started by - in
its early days - just making it produce C, which you know you can cope with.
What I want to have a look at, today, is this whole business of: How do intermediate
codes help you port a compiler from one architecture to another?" And you've got
to remember that, in the worst case, those machines and architectures could be very
very different indeed. I did an example, I think, of running a C compiler on a PDP-11
in PDP-11 binary, whose cross-compilation effect was to spit out
Z80 binary, which is very very different. How can you cope, then, with
cross-compilation? And how does cross-compilation lead you on to being
able to think about porting your compiler? Not just producing code for a
foreign machine but, in a way to mount an invasion of the foreign machine
and to say: "I'm not just going to push boatloads of code over, I'm going to set up
a bridgehead. We're going to land and I'm going to set up a lot of my software tools
on the far machine - not just fling raw binary at it.
So let's start to discover a bit more about this then. But in order to get into
the details I have, at great personal mental expense, made myself something
like 40 or 50 T-diagram blanks and let us hope that those prove sufficient
for the task. I just wanted to talk you through this, first of all, as being the
basis for what we're going to discuss and then I'll put it to one side. But
I'll bring it back if I need to refer to it again. We are getting in to a land of
intermediate codes. I've glued four things to the page. What are they?
Well, these two, at the top left, are what I will call Source Texts for your cross compilation /
compiler porting efforts. What you're saying is: from now on we
don't compile directly from H, the high-level language, to produce binary, B,
in the code generator. We do it in two steps. We have an H compiler written in H
producing I, the intermediate code, which I see is not on this list. Let's add it:
"I = Intermediate Code". Now, on the left at the top are the source texts for
doing this. But, if you consult the previous things we have done on
compilation you will understand that it's not directly those that we can use
because we can't directly execute H. We have to put these through an H compiler.
And we end up with the binary executable versions of them. Now, a little bit of
extra notation here. If you see at the bottom, for this executable B', I
use a single dash (prime) to mean: "The computer I currently possess, my old computer, the
one I do all my work on". If you see things like B'' that means the
new machine that I'm trying to port things to. So, we'll eventually get to
that stage of getting stuff across and you'll see B'' is appearing.
Inevitably, if you go this route, your compilation down to binary is now a
two-stage process. It may be hidden from you but it has
to be there. The other half, you see, is [that] if you're only going half way, to intermedia
code, the other half of the journey is to go from intermediate code down to
runnable binary for the whole thing. There's your intermediate code
interpreter, or compiler, written in B' running B', producing B'. So, I've
numbered these 1, 2, 3 and 4 and eventually I'll come back and say:
"We've now created a new number 3", or something like that. What I'm going to do
in this. Whereas in previous episodes I've talked about recoding a C compiler
to make better binary come out I did that previously by calling it 'subscript
B for better', but I've decided now to use an asterisk. If you see an asterisk
somewhere it means it's a 'better' version of what went before. And I is
'intermediate code'. So, let's put that to one side to be dragged back as and when
we need it. When you write a program you have in mind a certain input, it executes
and it produces a certain form of output. And you're very happy. And it all works
beautifully. Rather than writing 'C', down here, as I have been doing all along, I'll
try and generalize it a bit, say to some high-level language (H) that you're
confident with and have used for ages. But all the while you know. So this is, if
you like, 'Userprog' here you are. You know that H can't execute directly so you
rely on the fact that, bootstrapped up over several generations, we just happen
to have an H compiler that runs in B' and produces B'. By slotting that into
there [uses T-diag template] you remember - there's an implicit arrow there. That's H feeds into that one there
and the transformation done in this compiler is to take the H and convert
into B', with a compiler that is now an executable binary itself. So, the net
result of all that, which we'll show up here - arrows - is what that makes, once
you've compiled is of course something that has your treasured input and output
but is running, H running on B' produces B'. So, that is the binary executable
version of your program. What happens if your input and output was H itself?
Can you write a compiler in itself? Of course you can! It's what all this is about. You want
to produce an H compiler. We'll start off by writing an H compiler in H. So we'll
put this back to one side now these two [T-shapes]. What happens if I were to say: "Well, I've
written an H compiler in H and it produces B'*. What I'm saying is:
"I am fed up with plain old B' because it's slow and it's inefficient.
And it was wonderful when I first did it and actually this thing up here has been
bootstrapped up through being written in assembler and heaven knows what".
See previous episodes if that sentence doesn't make any sense to you. But now we
have got a situation [where] you want to improve the quality of your binary so here's a
bit of revision. What you do is you say: "OK I'll write a better C compiler - better
in the sense of 'better quality binary' ". I feed it to the old compiler that we've
got working already, which takes in H runs on B', produces B'.
Anbd what does that end you up with? Answer: an H producing binary. It's running on
old binary that's not super quality but it's ok it doesn't crash. It's a bit slow.
Just to end this little exercise off, before we get on to genuine porting compilers.,
You've got that [points at T-diagram] and you naturally then say: "Well why not feed it
to itself again?" And if you get another version of that, feed one into the other,
you can end up with H written in better binary, producing better binary. And if
you look up [the video] "Self-compiling Compilers" that's exactly what we do. So it's all
very well doing this simple-minded stuff. We take one great flying leap in our [previous]
compilers and require the code generator to be able to get very low-level very
specific and yet be very very tight and wonderful for all sorts
of different binaries for different machines B', B'' etc.
That's hard. Sso we've decided that intermediate codes might be the answer.
So how do we cope then, using intermediate codes just with this
business of improving your code? How would you do it? Well it has to be a
two-stage process, no question about that. It has to be. So, this time I will do an H
compiler, written in H. And this time I am going to make it produce better
intermediate code than ever before. So you see ... think of it this way. When you
upgrade a compiler you've now got two halves to upgrade. Do you want to upgrade
the H-to-intermediate-code piece, the front end, so-called. Or do you want to
upgrade the intermediate code interpreter, going down to binary, the
back-end? It's a mix and match. You can do either, or both, or in whatever
order you like. But you've got to remember it is a two-stage process.
Fortunately for me I have, already working, an old compiler for H, running on
B' - original machine binary - and producing
ordinary intermediate code. Not "super* intermediate code. I've written a better
version of myself now and I'm upgrading this front-end. So we've got H written in
H producing I* - better-quality intermediate code in some sense - than went before.
But the old version of the compiler I've been using for months now
just has H running on B' producing ordinary intermediate code - not optimized.
But it's good enough for compiling this one. The only thing that's different from
what we've got before is that we don't directly end up producing binary as the output.
We produce intermediate code as the output. So this first stage, then, will
get me the following: H feeds into H running on B' produces I, means that
this thing takes in H produces I* and
is running on I. OK fine. So we've kind of 'recompiled the compiler'
but we haven't gone far enough yet. And this is the the ball and chain
around your ankle when you go for intermediate codes is you've got a
better thing but it is reliant on running - being able to cope with - the fact
that's an intermediate code stage. OK that doesn't put us off. What I've said,
all along, is that these are my key source texts these are my executables
and what I'm pointing out now, is, if you use intermediate codes you don't just
need a source-end translator you need a back-end translator. You need a thing
that says, you know: "My system produces intermediate code but I am the back
end that takes in intermediate code and produces real binary that really will run."
So, I want one of those now [points at T-shape (3)] to put into my diagram. This is what we're at so far.
Let me gobble up yet another T-diagram
and to say: "If I put in here what everybody ought to have available to them,
which is an I producing B' written in B', I can now slot that in
there like that. And just look what happens. You've got your first-stage
compilation. It's doing wonderful things but it's executing intermediate code.
And maybe I have a test interpreter that sort of, to show you: "Well that's kind of working."
But, in the end, you want faster, more efficient, code. So you decide you
will compile your intermediate code into proper binary. And this kind of choice
between: "Do I interpret it slowly and see if it's working versus do I in the end
commit to compiling it?" is the sort of thing available in stuff like Java bytecode and so on.
Start off by interpreting - feels a bit slow but it's looking good,
let's compile it. Now watch what happens as you trace through here. I goes in here,
this is the intermediate code compiler, if you like, now producing genuine binary.
That shoots through there and what you end up with
is H producing I* i.e. r better-quality intermediate code, running on B'.
Right, well we're getting close. There's one final thing, though, that you need to do.
What we've got ourselves now is a situation where we've got a compiler
that takes in statements in H. It is running on B' binary, ordinary
unimproved binary. but it does produce better intermediate code. We're now going
to pull the same trick that we've done before in Self-compiling Compilers.
The only difference here. though. is we've got this I to cope with this time. But the
principles of what we're doing are just the same. If I bring this one [T shape] down to
stop my having to write out another one, we've now made a binary out of that, feed
it to itself. Originally I started with H I* to H. I compiled it all the way
through to intermediate code, as best I could. It's good quality intermediate
code. Now take that thing, your executable, feed the original thing to itself and
look what happens! The H goes in there and produces I*. So. you end up with H [producing]
I* [written in] I*. Final stage coming up, I promise. So that, if you like, when you're
through its machinations produces that. One final stage - and I promise the
torture will end - right! Now you might say: "There - no big deal. I can write an
intermediate code interpreter and check out rough-and-ready. It'll be slow but [we can]
check out whether everything works". But then somebody in the end says: "No - come on
let's compile it - it might go a lot faster." So if you remember item number 4
in our original toolkit was a thing that takes I and turns it into binary. So I'm
going to put one of those in place now for that all the way through. What will that
produce you? Don't forget we're starting H producing good quality intermediate
code so that is what it produces but what about here? If you trace through,
better quality intermediate code should hopefully give you better quality binary
when you compile it. So I can put down here that this is basically - if there
aren't too many suffixes and superscripts there - it's producing you a
H to I compiler but running on better quality binary. So one way or another
this is the new 3 because, referring back to our original map, what i've said is
how can i improve that and i have managed to improve it. I've got better
quality intermediate code and better quality binary. It is improved but it has
been a fairly messy two-stage process that's hidden from you. But the idea of
that is to give you just the notion of the fact that there is a penalty to pay,
you have to have a two-stage process. So we've expended a lot of brain cells I
think between us now discovering how we can improve our compiler, which we did do
before in Self compiling Compilers episode, but this time it's different.
How can you improve your compiler when there's an intermediate code involved?
And we've done it! And we've seen exactly how we can do it. We feel weary - or at
least I do - but that is it. And you might say: "Why bother with intermediate codes? It's just
producing more stages that we have to go through?" Well, the answer us I think
that it does make life more easy for you if you say: "OK, instead of improving
ourselves now within an intermediate code context how about we say 'I don't
want better binary out of it - I want different binary for another machine' !
So, we will find that the diagrams that I've just drawn with some subtle adaptation
- and me getting all tangled up probably - can be adapted for producing binary for
a completely new architecture which we will be calling B''.
We're now going to find in this next one, we're doing is we're just changing the
rules slightly. Instead of B* - improved binary - we're moving from B' to B''.