字幕列表 影片播放 列印英文字幕 The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: Hey, everybody. It's my pleasure once again to welcome TB Schardl, who is the author of your taper compiler, to talk about the Cilk runtime system. TAO SCHARDL: Thanks, Charles. Can anyone hear me in the back, seem good? OK. Thanks for the introduction. Today I'll be talking about the Cilk runtime system. This is pretty exciting for me. This is a lecture that's not about compilers. I get to talk about something a little different for once. It should be a fun lecture. Recently, as I understand it, you've been looking at storage allocation, both in the serial case as well as the parallel case. And you've already done Cilk programming for a while, at this point. This lecture, honestly, is a bit of a non sequitur in terms of the overall flow of the course. And it's also an advanced topic. The Cilk runtime system is a pretty complicated piece of software. But nevertheless, I believe you should have enough background to at least start to understand and appreciate some of the aspects of the design of the Cilk runtime system. So that's why we're talking about that today. Just to quickly recall something that you're all, I'm sure, intimately familiar with by this point, what's Cilk programming all about? Well, Cilk is a parallel programming language that allows you to make your software run faster using parallel processors. And to use Cilk, it's pretty straightforward. You may start with some serial code that runs in some running time-- we'll denote that as Ts for certain parts of the lecture. If you wanted to run in parallel using Cilk, you just insert Cilk keywords in choice locations. For example, you can parallelize the outer loop in this matrix multiply kernel, and that will let your code run in time Tp on P processors. And ideally, Tp should be less than Ts. Now, just adding keywords is all you need to do to tell Cilk to execute the computation in parallel. What does Cilk do in light of those keywords? At a very high level, Cilk and specifically its runtime system takes care of the task of scheduling and load balancing the computation on the parallel processors and on the multicore system in general. So after you've denoted logical parallel in the program using spawn, Cilk spawn, Cilk sync, and Cilk four, the Cilk scheduler maps that computation onto the processors. And it does so dynamically at runtime, based on whatever processing resources happen to be available, and still uses a randomized work stealing scheduler which guarantees that that mapping is efficient and the execution runs efficiently. Now you've all been using the Cilk platform for a while. In its basic usage, you write some Cilk code, possibly by parallelizing ordinary serial code, you feed that to a compiler, you get a binary, you run the binary the binary with some particular input on a multicore system. You get parallel performance. Today, we're going to look at how exactly does Cilk work? What's the magic that goes on, hidden by the boxes on this diagram? And the very first thing to note is that this picture is a little bit-- the first simplification that we're going to break is that it's not really just Cilk source and the Cilk compiler. There's also a runtime system library, libcilkrts.so, in case you've seen that file or messages about that file on your system. And really it's the compiler and the runtime library, that work together to implement Cilk's runtime system, to do the work stealing and do the efficient scheduling and load balancing. Now we might suspect that if you just take a look at the code that you get when you compile a Cilk program, that might tell you something about how Cilk works. Here's C pseudocode for the results when you compile a simple piece of Cilk code. It's a bit complicated. I think that's fair to say. There's a lot going on here. There is one function in the original program, now there are two. There's some new variables, there's some calls to functions that look a little bit strange, there's a lot going on in the compiled results. This isn't exactly easy to interpret or understand, and this doesn't even bring into the picture the runtime system library. The runtime system library, you can find the source code online. It's a little less than 20,000 lines of code. It's also kind of complicated. So rather than dive into the code directly, what we're going to do today is an attempt at a top-down approach to understanding how the Cilk runtime system works, and some of the design considerations. So we're going to start by talking about some of the required functionality that we need out of the Cilk runtime system, as well as some performance considerations for how the runtime system should work. And then we'll take a look at how the worker deques in Cilk get implemented, how spawning actually works, how stealing a computation works, and how synchronization works within Cilk. That all sound good? Any questions so far? This should all be review, more or less. OK, so let's talk a little bit about required functionality. You've seen this picture before, I hope. This picture illustrated the execution model of a Cilk program. Here we have everyone's favorite exponential time Fibonacci routine, parallelized using Cilk. This is not an efficient way to compute Fibonacci numbers, but it's a nice didactic example for understanding parallel computation, especially the Cilk model. And as we saw many lectures ago, when you run this program on a given input, the execution of the program can be modeled as a computation dag. And this computation dag unfolds dynamically as the program executes. But I want to stop and take a hard look at exactly what that dynamic execution looks like when we've got parallel processors and work stealing all coming into play. So we'll stick with this Fibonacci routine, and we'll imagine we've just got one processor on the system, to start. And we're just going to use this one processor to execute fib(4). And it's going to take some time to do it, just to make the story interesting. So we start executing this computation, and that one processor is just going to execute the Fibonacci routine from beginning up to the Cilk spawn statement, as if it's ordinary serial code, because it is ordinary serial code. At this point the processor hits the Cilk spawn statement. What happens now? Anyone remember? What happens to the dag? AUDIENCE: It branches down [INAUDIBLE] TAO SCHARDL: It branches downward and spawns another process, more or less. The way we model that-- the Cilk spawn is of a routine fib of n minus 1. In this case, that'll be fib(3). And so, like an ordinary function call, we're going to get a brand new frame for fib(3). And that's going to have some strand that's available to execute. But the spawn is not your typical function call. It actually allows some other computation to run in parallel. And so the way we model that in this picture is that we get a new frame for fib(3). There's a strand available to execute there. And the continuation, the green strand, is now available in the frame fib(4). But no one's necessarily executing it. It's just kind of faded in the picture. So once the spawn has occurred, what's the processor going to do? The processor is actually going to dive in and start executing fib(3), as if it were an ordinary function call. Yes, there's a strand available within the frame of fib(4), but the processor isn't going to worry about that strand. It's just going to say, oh, fib(4) calls fib(3), going to start computing for fib(3). Sound good? And so the processor dives down from pink strand to pink strand. The instruction pointer for the processor returns to the beginning of the fib routine, because we're now calling fib once again. And this process repeats. It executes the pink strand up until the Cilk spawn, just like ordinary serial code. The spawn occurs-- and we've already seen this picture before-- the spawn allows another strand to execute in parallel. But it also creates a frame for fib(2). And the processor dives into fib(2), resetting the instruction pointer to the beginning fib, P1 executes up to the spawn. Once again, we get another string to execute, as well as an invocation of fib(1). Processor dives even further. So that's fine. This is just the processor doing more or less ordinary serial execution of this fib routine, but it's also allowing some strands to be executed in parallel. This is the one processor situation, looks pretty good so far. Right, and in the fib(1) case, it doesn't make it as far through the pink strand because, in fact, we hit the base case. But now let's bring in some more processors. Suppose that another processor finally shows up, says I'm bored, I want to do some work, and decides to steal some computation. It's going to discover the green strand in the frame fib(4), and P2 is just going to jump in there and start executing that strand. And if we think really hard about what this means, P2 is another processor on the system. It has its own set of registers. It has its own instruction pointer. And so what Cilk somehow allows to happen is for P2 to just jump right into the middle of this fib(4) routine, which is already executing. It just sets the instruction pointer to point at that green instruction, at the call to fib of n minus 2. And it's just going to pick up where processor 1 left off, when it executed up to this point in fib(4), somehow. In this case, it executes fib of n minus 2. That calls fib(2), creates a new strand, it's just an ordinary function call. It's going to descend into that new frame. It's going to return to the beginning of fib. All that's well and good. Another processor might come along and steal another piece of the computation. It steals another green strand, and so once again, this processor needs to jump into the middle of an executing function. Its instruction pointer is just going to point at this call of the fib of n minus 2. Somehow, it's going to have the state of this executing function available, despite having independent registers. And it needs to just start from this location, with all the parameters set appropriately, and start executing this function as if it's an ordinary function. It calls fib(3) minus 2 is 1. And now these processors might start executing in parallel.