字幕列表 影片播放 列印英文字幕 In what we've done so far we've ranged over quite an area, because just the act of using the T-diagrams has forced us to address lots of interesting and profound questions, over the years, about how compilers actually work. So we've looked at the fact that we think of our programs being written in a high-level language and the brain goes blurry. We neglect to think, all the time, that although you they were written in C it doesn't execute directly. You have to compile the C into binary. So, really, your beautiful program runs on a suitable architecture in a suitable binary. It has an input, it has an output. Then we started looking at more advanced things about ... saying: "Well if that's how a C compiler works why not write another C compiler, in C, and then compile it with the old compiler?" And I think we just about emerged from that with brain intact. If you want to go back and revise some of this stuff before catching up with what we're going to do today I would recommend the one [video] which we'll put a link out to: 'Self-compiling Compilers". There's another one we put out also, on the UNCOL problem which was this business about: "Is there a common universal intermediate code?" There's more of a consensus now than there used to be. And the LLVM system is a good example. It won the Turing ACM Award for - not quite for 'best intermediate code ever' - but you know ... The reason, I think, that it became less of a problem - not totally solvable but less of a problem - thinking about this the other night, is that actually, over the years, certain things about computer architectures have converged and moved together to consensus. And principally it seems to me is the idea that your unit of discourse isn't just the bit, it's the byte. And the idea of having, y'know, 8-bit ASCII characters fit into a byte. The idea of being able to 'glue two bytes together' to make a 16-bit entity, or four to make a 32-bit entity. That has become more and more and more prevalent. The reason why in some of the other videos I've said: "Oh! you know there was such massive differences between machines" There were! In the 1970s. I can quote [this] to you as a fact. Because there I was at the University of Nottingham working on a machine with 24-bit words. Not byte addressed at all. What did you put inside your 24-bit words if you were mad keen on characters? Four 6-bit characters. How did you dig them out from that word? With great difficulty(!) The word-addressed machine would give you the whole word. It was your responsibility with bit-shift operations and a garden spade (more or less!) to dig out four 6-bit characters. Non-standard, not 8-bit characters. And everybody said: "Ah well, it's all right for IBM but it's so much more *expensive* building byte-addressed machines!" But in the end it prevailed. And I do believe that things like that about the fundamental way you address memory and being able to, sort of, put units together, preferably in powers of two not multiples of two. So, right at the beginning I had a 24-bit word; Atlas had a 48-bit word; DEC 10s, I think, had a 36-bit word. And Seymour Cray, CDC, had a 60 bit word. All of those are multiples of 2, but I don't think any of them, that I've just quoted, are powers of 2. And it really mattered. it turned out to have a power-of-two basic unit - bigger than a bit - 8-bit bytes. So, anyway, I use that to introduce the idea of intermediate codes but we're now I think - in tying things up - in a situation to revisit that now and say: "Intermediate codes really are useful". and come into their own, if you like, with the idea of wanting to port a compiler from one architecture, perhaps to a very very different architecture. What I call the 'semantic gap' between your high-level language, and your program eventually that runs on binary, is HUGE! It's not so huge in C, you can feel the assembler and the binary poking up through the C, But you start trying to do a Haskell interprete,r or compiler, and you'll soon discover that this thing running down here is - so to speak - miles away from the abstract stuff you were writing up at the top. So, everybody started saying don't we really need intermediate codes to help us bridge the gap? Hence Z-code, Bytecode for Java. All these kind of things became discussed more and more and more. And I think I, at one stage, said: "Don't imagine it's always fairly close to the hardware". You could end up in a situation where C is your intermediate code. Bjarne Stroustrup got a C++ compiler started by - in its early days - just making it produce C, which you know you can cope with. What I want to have a look at, today, is this whole business of: How do intermediate codes help you port a compiler from one architecture to another?" And you've got to remember that, in the worst case, those machines and architectures could be very very different indeed. I did an example, I think, of running a C compiler on a PDP-11 in PDP-11 binary, whose cross-compilation effect was to spit out Z80 binary, which is very very different. How can you cope, then, with cross-compilation? And how does cross-compilation lead you on to being able to think about porting your compiler? Not just producing code for a foreign machine but, in a way to mount an invasion of the foreign machine and to say: "I'm not just going to push boatloads of code over, I'm going to set up a bridgehead. We're going to land and I'm going to set up a lot of my software tools on the far machine - not just fling raw binary at it. So let's start to discover a bit more about this then. But in order to get into the details I have, at great personal mental expense, made myself something like 40 or 50 T-diagram blanks and let us hope that those prove sufficient for the task. I just wanted to talk you through this, first of all, as being the basis for what we're going to discuss and then I'll put it to one side. But I'll bring it back if I need to refer to it again. We are getting in to a land of intermediate codes. I've glued four things to the page. What are they? Well, these two, at the top left, are what I will call Source Texts for your cross compilation / compiler porting efforts. What you're saying is: from now on we don't compile directly from H, the high-level language, to produce binary, B, in the code generator. We do it in two steps. We have an H compiler written in H producing I, the intermediate code, which I see is not on this list. Let's add it: "I = Intermediate Code". Now, on the left at the top are the source texts for doing this. But, if you consult the previous things we have done on compilation you will understand that it's not directly those that we can use because we can't directly execute H. We have to put these through an H compiler. And we end up with the binary executable versions of them. Now, a little bit of extra notation here. If you see at the bottom, for this executable B', I use a single dash (prime) to mean: "The computer I currently possess, my old computer, the one I do all my work on". If you see things like B'' that means the new machine that I'm trying to port things to. So, we'll eventually get to that stage of getting stuff across and you'll see B'' is appearing. Inevitably, if you go this route, your compilation down to binary is now a two-stage process. It may be hidden from you but it has to be there. The other half, you see, is [that] if you're only going half way, to intermedia code, the other half of the journey is to go from intermediate code down to runnable binary for the whole thing. There's your intermediate code interpreter, or compiler, written in B' running B', producing B'. So, I've numbered these 1, 2, 3 and 4 and eventually I'll come back and say: "We've now created a new number 3", or something like that. What I'm going to do in this. Whereas in previous episodes I've talked about recoding a C compiler to make better binary come out I did that previously by calling it 'subscript B for better', but I've decided now to use an asterisk. If you see an asterisk somewhere it means it's a 'better' version of what went before. And I is 'intermediate code'. So, let's put that to one side to be dragged back as and when we need it. When you write a program you have in mind a certain input, it executes and it produces a certain form of output. And you're very happy. And it all works beautifully. Rather than writing 'C', down here, as I have been doing all along, I'll try and generalize it a bit, say to some high-level language (H) that you're confident with and have used for ages. But all the while you know. So this is, if you like, 'Userprog' here you are. You know that H can't execute directly so you rely on the fact that, bootstrapped up over several generations, we just happen to have an H compiler that runs in B' and produces B'. By slotting that into there [uses T-diag template] you remember - there's an implicit arrow there. That's H feeds into that one there and the transformation done in this compiler is to take the H and convert into B', with a compiler that is now an executable binary itself. So, the net result of all that, which we'll show up here - arrows - is what that makes, once you've compiled is of course something that has your treasured input and output but is running, H running on B' produces B'. So, that is the binary executable version of your program. What happens if your input and output was H itself? Can you write a compiler in itself? Of course you can! It's what all this is about. You want to produce an H compiler. We'll start off by writing an H compiler in H. So we'll put this back to one side now these two [T-shapes]. What happens if I were to say: "Well, I've written an H compiler in H and it produces B'*. What I'm saying is: "I am fed up with plain old B' because it's slow and it's inefficient. And it was wonderful when I first did it and actually this thing up here has been bootstrapped up through being written in assembler and heaven knows what". See previous episodes if that sentence doesn't make any sense to you. But now we have got a situation [where] you want to improve the quality of your binary so here's a bit of revision. What you do is you say: "OK I'll write a better C compiler - better in the sense of 'better quality binary' ". I feed it to the old compiler that we've got working already, which takes in H runs on B', produces B'. Anbd what does that end you up with? Answer: an H producing binary. It's running on old binary that's not super quality but it's ok it doesn't crash. It's a bit slow. Just to end this little exercise off, before we get on to genuine porting compilers., You've got that [points at T-diagram] and you naturally then say: "Well why not feed it to itself again?" And if you get another version of that, feed one into the other, you can end up with H written in better binary, producing better binary. And if you look up [the video] "Self-compiling Compilers" that's exactly what we do. So it's all very well doing this simple-minded stuff. We take one great flying leap in our [previous] compilers and require the code generator to be able to get very low-level very specific and yet be very very tight and wonderful for all sorts of different binaries for different machines B', B'' etc. That's hard. Sso we've decided that intermediate codes might be the answer. So how do we cope then, using intermediate codes just with this business of improving your code? How would you do it? Well it has to be a two-stage process, no question about that. It has to be. So, this time I will do an H compiler, written in H. And this time I am going to make it produce better intermediate code than ever before. So you see ... think of it this way. When you upgrade a compiler you've now got two halves to upgrade. Do you want to upgrade the H-to-intermediate-code piece, the front end, so-called. Or do you want to upgrade the intermediate code interpreter, going down to binary, the back-end? It's a mix and match. You can do either, or both, or in whatever order you like. But you've got to remember it is a two-stage process. Fortunately for me I have, already working, an old compiler for H, running on B' - original machine binary - and producing ordinary intermediate code. Not "super* intermediate code. I've written a better version of myself now and I'm upgrading this front-end. So we've got H written in H producing I* - better-quality intermediate code in some sense - than went before. But the old version of the compiler I've been using for months now just has H running on B' producing ordinary intermediate code - not optimized. But it's good enough for compiling this one. The only thing that's different from what we've got before is that we don't directly end up producing binary as the output. We produce intermediate code as the output. So this first stage, then, will get me the following: H feeds into H running on B' produces I, means that this thing takes in H produces I* and is running on I. OK fine. So we've kind of 'recompiled the compiler' but we haven't gone far enough yet. And this is the the ball and chain around your ankle when you go for intermediate codes is you've got a better thing but it is reliant on running - being able to cope with - the fact that's an intermediate code stage. OK that doesn't put us off. What I've said, all along, is that these are my key source texts these are my executables and what I'm pointing out now, is, if you use intermediate codes you don't just need a source-end translator you need a back-end translator. You need a thing that says, you know: "My system produces intermediate code but I am the back end that takes in intermediate code and produces real binary that really will run." So, I want one of those now [points at T-shape (3)] to put into my diagram. This is what we're at so far. Let me gobble up yet another T-diagram and to say: "If I put in here what everybody ought to have available to them, which is an I producing B' written in B', I can now slot that in there like that. And just look what happens. You've got your first-stage compilation. It's doing wonderful things but it's executing intermediate code. And maybe I have a test interpreter that sort of, to show you: "Well that's kind of working." But, in the end, you want faster, more efficient, code. So you decide you will compile your intermediate code into proper binary. And this kind of choice between: "Do I interpret it slowly and see if it's working versus do I in the end commit to compiling it?" is the sort of thing available in stuff like Java bytecode and so on. Start off by interpreting - feels a bit slow but it's looking good, let's compile it. Now watch what happens as you trace through here. I goes in here, this is the intermediate code compiler, if you like, now producing genuine binary. That shoots through there and what you end up with is H producing I* i.e. r better-quality intermediate code, running on B'. Right, well we're getting close. There's one final thing, though, that you need to do. What we've got ourselves now is a situation where we've got a compiler that takes in statements in H. It is running on B' binary, ordinary unimproved binary. but it does produce better intermediate code. We're now going to pull the same trick that we've done before in Self-compiling Compilers. The only difference here. though. is we've got this I to cope with this time. But the principles of what we're doing are just the same. If I bring this one [T shape] down to stop my having to write out another one, we've now made a binary out of that, feed it to itself. Originally I started with H I* to H. I compiled it all the way through to intermediate code, as best I could. It's good quality intermediate code. Now take that thing, your executable, feed the original thing to itself and look what happens! The H goes in there and produces I*. So. you end up with H [producing] I* [written in] I*. Final stage coming up, I promise. So that, if you like, when you're through its machinations produces that. One final stage - and I promise the torture will end - right! Now you might say: "There - no big deal. I can write an intermediate code interpreter and check out rough-and-ready. It'll be slow but [we can] check out whether everything works". But then somebody in the end says: "No - come on let's compile it - it might go a lot faster." So if you remember item number 4 in our original toolkit was a thing that takes I and turns it into binary. So I'm going to put one of those in place now for that all the way through. What will that produce you? Don't forget we're starting H producing good quality intermediate code so that is what it produces but what about here? If you trace through, better quality intermediate code should hopefully give you better quality binary when you compile it. So I can put down here that this is basically - if there aren't too many suffixes and superscripts there - it's producing you a H to I compiler but running on better quality binary. So one way or another this is the new 3 because, referring back to our original map, what i've said is how can i improve that and i have managed to improve it. I've got better quality intermediate code and better quality binary. It is improved but it has been a fairly messy two-stage process that's hidden from you. But the idea of that is to give you just the notion of the fact that there is a penalty to pay, you have to have a two-stage process. So we've expended a lot of brain cells I think between us now discovering how we can improve our compiler, which we did do before in Self compiling Compilers episode, but this time it's different. How can you improve your compiler when there's an intermediate code involved? And we've done it! And we've seen exactly how we can do it. We feel weary - or at least I do - but that is it. And you might say: "Why bother with intermediate codes? It's just producing more stages that we have to go through?" Well, the answer us I think that it does make life more easy for you if you say: "OK, instead of improving ourselves now within an intermediate code context how about we say 'I don't want better binary out of it - I want different binary for another machine' ! So, we will find that the diagrams that I've just drawn with some subtle adaptation - and me getting all tangled up probably - can be adapted for producing binary for a completely new architecture which we will be calling B''. We're now going to find in this next one, we're doing is we're just changing the rules slightly. Instead of B* - improved binary - we're moving from B' to B''.
B1 中級 改進中間代碼 - Computerphile (Improving Intermediate Codes - Computerphile) 2 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字