Placeholder Image

字幕列表 影片播放

  • The following content is provided under a Creative

  • Commons license.

  • Your support will help MIT OpenCourseWare

  • continue to offer high-quality educational resources for free.

  • To make a donation or to view additional materials

  • from hundreds of MIT courses, visit MIT OpenCourseWare

  • at ocw.mit.edu.

  • JULIAN SHUN: Good afternoon, everyone.

  • So let's get started.

  • So today, we're going to be talking

  • about races and parallelism.

  • And you'll be doing a lot of parallel programming

  • for the next homework assignment and project.

  • One thing I want to point out is that it's important

  • to meet with your MITPOSSE as soon as possible,

  • if you haven't done so already, since that's

  • going to be part of the evaluation for the Project 1

  • grade.

  • And if you have trouble reaching your MITPOSSE members,

  • please contact your TA and also make a post on Piazza

  • as soon as possible.

  • So as a reminder, let's look at the basics of Cilk.

  • So we have cilk_spawn and cilk_sync statements.

  • In Cilk, this was the code that we

  • saw in last lecture, which computes the nth Fibonacci

  • number.

  • So when we say cilk_spawn, it means

  • that the named child function, the function right

  • after the cilk_spawn keyword, can execute in parallel

  • with the parent caller.

  • So it says that fib of n minus 1 can

  • execute in parallel with the fib function that called it.

  • And then cilk_sync says that control cannot pass this point

  • until all of this spawned children have returned.

  • So this is going to wait for fib of n minus 1

  • to finish before it goes on and returns the sum of x and y.

  • And recall that the Cilk keywords grant permission

  • for parallel execution, but they don't actually

  • force parallel execution.

  • So this code here says that we can execute fib of n minus 1

  • in parallel with this parent caller,

  • but it doesn't say that we necessarily have

  • to execute them in parallel.

  • And it's up to the runtime system

  • to decide whether these different functions will

  • be executed in parallel.

  • We'll talk more about the runtime system today.

  • And also, we talked about this example,

  • where we wanted to do an in-place matrix transpose.

  • And this used the cilk_for keyword.

  • And this says that we can execute

  • the iterations of this cilk_for loop in parallel.

  • And again, this says that the runtime system

  • is allowed to schedule these iterations in parallel,

  • but doesn't necessarily say that they

  • have to execute in parallel.

  • And under the hood, cilk_for statements

  • are translated into nested cilk_spawn and cilk_sync calls.

  • So the compiler is going to divide the iteration

  • space in half, do a cilk_spawn on one of the two halves,

  • call the other half, and then this

  • is done recursively until we reach

  • a certain size for the number of iterations

  • in a loop, at which point it just

  • creates a single task for that.

  • So any questions on the Cilk constructs?

  • Yes?

  • AUDIENCE: Is Cilk smart enough to recognize issues

  • with reading and writing for matrix transpose?

  • JULIAN SHUN: So it's actually not

  • going to figure out whether the iterations are

  • independent for you.

  • The programmer actually has to reason about that.

  • But Cilk does have a nice tool, which we'll talk about,

  • that will tell you which places your code might possibly

  • be reading and writing the same memory location,

  • and that allows you to localize any possible race

  • bugs in your code.

  • So we'll actually talk about races.

  • But if you just compile this code,

  • Cilk isn't going to know whether the iterations are independent.

  • So determinacy races-- so race conditions

  • are the bane of concurrency.

  • So you don't want to have race conditions in your code.

  • And there are these two famous race bugs that cause disaster.

  • So there is this Therac-25 radiation therapy machine,

  • and there was a race condition in the software.

  • And this led to three people being killed

  • and many more being seriously injured.

  • The North American blackout of 2003

  • was also caused by a race bug in the software,

  • and this left 50 million people without power.

  • So these are very bad.

  • And they're notoriously difficult to discover

  • by conventional testing.

  • So race bugs aren't going to appear every time

  • you execute your program.

  • And in fact, the hardest ones to find, which cause these events,

  • are actually very rare events.

  • So most of the times when you run your program,

  • you're not going to see the race bug.

  • Only very rarely will you see it.

  • So this makes it very hard to find these race bugs.

  • And furthermore, when you see a race bug,

  • it doesn't necessarily always happen

  • in the same place in your code.

  • So that makes it even harder.

  • So what is a race?

  • So a determinacy race is one of the most basic forms of races.

  • And a determinacy race occurs when

  • two logically parallel instructions

  • access the same memory location, and at least one

  • of these instructions performs a write to that location.

  • So let's look at a simple example.

  • So in this code here, I'm first setting x equal to 0.

  • And then I have a cilk_for loop with two iterations,

  • and each of the two iterations are

  • incrementing this variable x.

  • And then at the end, I'm going to assert that x is equal to 2.

  • So there's actually a race in this program here.

  • So in order to understand where the race occurs,

  • let's look at the execution graph here.

  • So I'm going to label each of these statements with a letter.

  • The first statement, a, is just setting x equal to 0.

  • And then after that, we're actually

  • going to have two parallel paths, because we

  • have two iterations of this cilk_for loop, which

  • can execute in parallel.

  • And each of these paths are going to increment x by 1.

  • And then finally, we're going to assert that x is equal to 2

  • at the end.

  • And this sort of graph is known as a dependency graph.

  • It tells you what instructions have

  • to finish before you execute the next instruction.

  • So here it says that B and C must

  • wait for A to execute before they proceed,

  • but B and C can actually happen in parallel, because there

  • is no dependency among them.

  • And then D has to happen after B and C finish.

  • So to understand why there's a race bug here,

  • we actually need to take a closer look

  • at this dependency graph.

  • So let's take a closer look.

  • So when you run this code, x plus plus

  • is actually going to be translated into three steps.

  • So first, we're going to load the value

  • of x into some processor's register, r1.

  • And then we're going to increment r1,

  • and then we're going to set x equal to the result of r1.

  • And the same thing for r2.

  • We're going to load x into register r2, increment r2,

  • and then set x equal to r2.

  • So here, we have a race, because both of these stores,

  • x1 equal to r1 and x2 equal to r2,

  • are actually writing to the same memory location.

  • So let's look at one possible execution of this computation

  • graph.

  • And we're going to keep track of the values of x, r1 and r2.

  • So the first instruction we're going to execute

  • is x equal to 0.

  • So we just set x equal to 0, and everything's good so far.

  • And then next, we can actually pick one of two instructions

  • to execute, because both of these two instructions

  • have their predecessors satisfied already.

  • Their predecessors have already executed.

  • So let's say I pick r1 equal to x to execute.

  • And this is going to place the value 0 into register r1.

  • Now I'm going to increment r1, so this

  • changes the value in r1 to 1.

  • Then now, let's say I execute r2 equal to x.

  • So that's going to read x, which has a value of 0.

  • It's going to place the value of 0 into r2.

  • It's going to increment r2.

  • That's going to change that value to 1.

  • And then now, let's say I write r2 back to x.

  • So I'm going to place a value of 1 into x.

  • Then now, when I execute this instruction, x1 equal to r1,

  • it's also placing a value of 1 into x.

  • And then finally, when I do the assertion,

  • this value here is not equal to 2, and that's wrong.

  • Because if you executed this sequentially,

  • you would get a value of 2 here.

  • And the reason-- as I said, the reason why this occurs

  • is because we have multiple writes

  • to the same shared memory location, which

  • could execute in parallel.

  • And one of the nasty things about this example

  • here is that the race bug doesn't necessarily always

  • occur.

  • So does anyone see why this race bug doesn't necessarily

  • always show up?

  • Yes?

  • AUDIENCE: [INAUDIBLE]

  • JULIAN SHUN: Right.

  • So the answer is because if one of these two branches

  • executes all three of its instructions

  • before we start the other one, then the final result in x

  • is going to be 2, which is correct.

  • So if I executed these instructions

  • in order of 1, 2, 3, 7, 4, 5, 6, and then, finally, 8, the value

  • is going to be 2 in x.

  • So the race bug here doesn't necessarily always occur.

  • And this is one thing that makes these bugs hard to find.

  • So any questions?

  • So there are two different types of determinacy races.

  • And they're shown in this table here.

  • So let's suppose that instruction A and instruction

  • B both access some location x, and suppose A is parallel to B.

  • So both of the instructions can execute in parallel.

  • So if A and B are just reading that location,

  • then that's fine.

  • You don't actually have a race here.

  • But if one of the two instructions

  • is writing to that location, whereas the other one is

  • reading to that location, then you

  • have what's called a read race.

  • And the program might have a non-deterministic result

  • when you have a read race, because the final answer might

  • depend on whether you read A first before B

  • updated the value, or whether A read the updated

  • value before B reads it.

  • So the order of the execution of A and B

  • can affect the final result that you see.

  • And finally, if both A and B write

  • to the same shared location, then you have a write race.

  • And again, this will cause non-deterministic behavior

  • in your program, because the final answer could depend on

  • whether A did the write first or B did the write first.

  • And we say that two sections of code

  • are independent if there are no determinacy races between them.

  • So the two pieces of code can't have a shared location,

  • where one computation writes to it

  • and another computation reads from it,

  • or if both computations write to that location.

  • Any questions on the definition?

  • So races are really bad, and you should avoid

  • having races in your program.

  • So here are some tips on how to avoid races.

  • So I can tell you not to write races in your program,

  • and you know that races are bad, but sometimes,

  • when you're writing code, you just

  • have races in your program, and you can't help it.

  • But here are some tips on how you can avoid races.

  • So first, the iterations of a cilk_for loop

  • should be independent.

  • So you should make sure that the different iterations

  • of a cilk_for loop aren't writing to the same memory

  • location.

  • Secondly, between a cilk_spawn statement

  • and a corresponding cilk_sync, the code of the spawn child

  • should be independent of the code of the parent.

  • And this includes code that's executed by additional spawned

  • or called children by the spawned child.

  • So you should make sure that these pieces of code

  • are independent-- there's no read

  • or write races between them.

  • One thing to note is that the arguments to a spawn function

  • are evaluated in the parent before the spawn actually

  • occurs.

  • So you can't get a race in the argument evaluation,

  • because the parent is going to evaluate these arguments.

  • And there's only one thread that's doing this,

  • so it's fine.

  • And another thing to note is that the machine word

  • size matters.

  • So you need to watch out for races

  • when you're reading and writing to packed data structures.

  • So here's an example.

  • I have a struct x with two chars, a and b.

  • And updating x.a and x.b may possibly cause a race.

  • And this is a nasty race, because it

  • depends on the compiler optimization level.

  • Fortunately, this is safe on the Intel machines

  • that we're using in this class.

  • You can't get a race in this example.

  • But there are other architectures

  • that might have a race when you're updating the two

  • variables a and b in this case.

  • So with the Intel machines that we're

  • using, if you're using standard data types like chars, shorts,

  • ints, and longs inside a struct, you won't get races.

  • But if you're using non-standard types--

  • for example, you're using the C bit fields facilities,

  • and the sizes of the fields are not one of the standard sizes,

  • then you could possibly get a race.

  • In particular, if you're updating individual bits

  • inside a word in parallel, then you might see a race there.

  • So you need to be careful.

  • Questions?

  • So fortunately, the Cilk platform

  • has a very nice tool called the--

  • yes, question?

  • AUDIENCE: [INAUDIBLE] was going to ask, what causes that race?

  • JULIAN SHUN: Because the architecture might actually

  • be updating this struct at the granularity of more

  • than 1 byte.

  • So if you're updating single bytes inside this larger word,

  • then that might cause a race.

  • But fortunately, this doesn't happen on Intel machines.

  • So the Cilksan race detector--

  • if you compile your code using this flag,

  • minus f sanitize equal to cilk, then

  • it's going to generate a Cilksan instrumentive program.

  • And then if an ostensibly deterministic Cilk program

  • run on a given input could possibly behave any differently

  • than its serial elision, then Cilksan

  • is going to guarantee to report and localize

  • the offending race.

  • So Cilksan is going to tell you which memory location there

  • might be a race on and which of the instructions

  • were involved in this race.

  • So Cilksan employs a regression test methodology

  • where the programmer provides it different test inputs.

  • And for each test input, if there could possibly

  • be a race in the program, then it will report these races.

  • And it identifies the file names, the lines,

  • the variables involved in the races,

  • including the stack traces.

  • So it's very helpful when you're trying to debug your code

  • and find out where there's a race in your program.

  • One thing to note is that you should

  • ensure that all of your program files are instrumented.

  • Because if you only instrument some of your files and not

  • the other ones, then you'll possibly miss out

  • on some of these race bugs.

  • And one of the nice things about the Cilksan race detector

  • is that it's always going to report a race if there

  • is possibly a race, unlike many other race detectors, which

  • are best efforts.

  • So they might report a race some of the times

  • when the race actually occurs, but they don't necessarily

  • report a race all the time.

  • Because in some executions, the race doesn't occur.

  • But the Cilksan race detector is going

  • to always report the race, if there is potentially

  • a race in there.

  • Cilksan is your best friend.

  • So use this when you're debugging your homeworks

  • and projects.

  • Here's an example of the output that's generated by Cilksan.

  • So you can see that it's saying that there's a race detected

  • at this memory address here.

  • And the line of code that caused this race

  • is shown here, as well as the file name.

  • So this is a matrix multiplication example.

  • And then it also tells you how many races it detected.

  • So any questions on determinacy races?

  • So let's now talk about parallelism.

  • So what is parallelism?

  • Can we quantitatively define what parallelism is?

  • So what does it mean when somebody tells you

  • that their code is highly parallel?

  • So to have a formal definition of parallelism,

  • we first need to look at the Cilk execution model.

  • So this is a code that we saw before for Fibonacci.

  • Let's now look at what a call to fib of 4 looks like.

  • So here, I've color coded the different lines of code here

  • so that I can refer to them when I'm

  • drawing this computation graph.

  • So now, I'm going to draw this computation graph corresponding

  • to how the computation unfolds during execution.

  • So the first thing I'm going to do

  • is I'm going to call fib of 4.

  • And that's going to generate this magenta node

  • here corresponding to the call to fib of 4,

  • and that's going to represent this pink code here.

  • And this illustration is similar to the computation graphs

  • that you saw in the previous lecture,

  • but this is happening in parallel.

  • And I'm only labeling the argument here,

  • but you could actually also write the local variables

  • there.

  • But I didn't do it, because I want to fit everything

  • on this slide.

  • So what happens when you call fib of 4?

  • It's going to get to this cilk_spawn statement,

  • and then it's going to call fib of 3.

  • And when I get to a cilk_spawn statement, what I do

  • is I'm going to create another node that corresponds

  • to the child that I spawned.

  • So this is this magenta node here in this blue box.

  • And then I also have a continue edge

  • going to a green node that represents the computation

  • after the cilk_spawn statement.

  • So this green node here corresponds to the green line

  • of code in the code snippet.

  • Now I can unfold this computation graph

  • one more step.

  • So we see that fib 3 is going to call fib of 2,

  • so I created another node here.

  • And the green node here, which corresponds

  • to this green line of code-- it's

  • also going to make a function call.

  • It's going to call fib of 2.

  • And that's also going to create a new node.

  • So in general, when I do a spawn,

  • I'm going to have two outgoing edges out of a magenta node.

  • And when I do a call, I'm going to have one outgoing edge out

  • of a green node.

  • So this green node, the outgoing edge

  • corresponds to a function call.

  • And for this magenta node, its first outgoing edge

  • corresponds to spawn, and then its second outgoing edge

  • goes to the continuation strand.

  • So I can unfold this one more time.

  • And here, I see that I'm creating some more spawns

  • and calls to fib.

  • And if I do this one more time, I've

  • actually reached the base case.

  • Because once n is equal to 1 or 0,

  • I'm not going to make any more recursive calls.

  • And by the way, the color of these boxes that I used here

  • correspond to whether I called that function

  • or whether I spawned it.

  • So a box with white background corresponds to a function

  • that I called, whereas a box with blue background

  • corresponds to a function that I spawned.

  • So now I've gotten to the base case,

  • I need to now execute this blue statement, which

  • sums up x and y and returns the result to the parent caller.

  • So here I have a blue node.

  • So this is going to take the results of the two

  • recursive calls, sum them together.

  • And I have another blue node here.

  • And then it's going to pass its value

  • to the parent that called it.

  • So I'm going to pass this up to its parent,

  • and then I'm going to pass this one up as well.

  • And finally, I have a blue node at the top level, which

  • is going to compute my final result,

  • and that's going to be the output of the program.

  • So one thing to note is that this computation dag

  • unfolds dynamically during the execution.

  • So the runtime system isn't going

  • to create this graph at the beginning.

  • It's actually going to create this on the fly

  • as you run the program.

  • So this graph here unfolds dynamically.

  • And also, this graph here is processor-oblivious.

  • So nowhere in this computation dag

  • did I mention the number of processors

  • I had for the computation.

  • And similarly, in the code here, I never

  • mentioned the number of processors that I'm using.

  • So the runtime system is going to figure out

  • how to map these tasks to the number of processors

  • that you give to the computation dynamically at runtime.

  • So for example, I can run this on any number of processors.

  • If I run it on one processor, it's

  • just going to execute these tasks in parallel.

  • In fact, it's going to execute them

  • in a depth-first order, which corresponds to the what

  • the sequential algorithm would do.

  • So I'm going to start with fib of 4, go to fib of 3, fib of 2,

  • fib of 1, and go pop back up and then do fib of 0

  • and go back up and so on.

  • So if I use one processor, it's going

  • to create and execute this computation

  • dag in the depth-first manner.

  • And if I have more than one processor,

  • it's not necessarily going to follow a depth-first order,

  • because I could have multiple computations going on.

  • Any questions on this example?

  • I'm actually going to formally define some terms

  • on the next slide so that we can formalize the notion

  • of a computation dag.

  • So dag stands for directed acyclic graph,

  • and this is a directed acyclic graph.

  • So we call it a computation dag.

  • So a parallel instruction stream is

  • a dag G with vertices V and edges E.

  • And each vertex in this dag corresponds to a strand.

  • And a strand is a sequence of instructions

  • not containing a spawn, a sync, or a return from a spawn.

  • So the instructions inside a strand

  • are executed sequentially.

  • There's no parallelism within a strand.

  • We call the first strand the initial strand,

  • so this is the magenta node up here.

  • The last strand-- we call it the final strand.

  • And then everything else, we just call it a strand.

  • And then there are four types of edges.

  • So there are spawn edges, call edges, return edges,

  • or continue edges.

  • And a spawn edge corresponds to an edge to a function

  • that you spawned.

  • So these spawn edges are going to go to a magenta node.

  • A call edge corresponds to an edge that goes to a function

  • that you called.

  • So in this example, these are coming out of the green nodes

  • and going to a magenta node.

  • A return edge corresponds to an edge going back up

  • to the parent caller.

  • So here, it's going into one of these blue nodes.

  • And then finally, a continue edge is just the other edge

  • when you spawn a function.

  • So this is the edge that goes to the green node.

  • It's representing the computation

  • after you spawn something.

  • And notice that in this computation dag,

  • we never explicitly represented cilk_for,

  • because as I said before, cilk_fors

  • are converted to nested cilk_spawns

  • and cilk_sync statements.

  • So we don't actually need to explicitly represent cilk_fors

  • in the computation DAG.

  • Any questions on this definition?

  • So we're going to be using this computation

  • dag throughout this lecture to analyze how much parallelism

  • there is in a program.

  • So assuming that each of these strands executes in unit time--

  • this assumption isn't always true in practice.

  • In practice, strands will take different amounts of time.

  • But let's assume, for simplicity,

  • that each strand here takes unit time.

  • Does anyone want to guess what the parallelism

  • of this computation is?

  • So how parallel do you think this is?

  • What's the maximum speedup you might get on this computation?

  • AUDIENCE: 5.

  • JULIAN SHUN: 5.

  • Somebody said 5.

  • Any other guesses?

  • Who thinks this is going to be less than five?

  • A couple people.

  • Who thinks it's going to be more than five?

  • A couple of people.

  • Who thinks there's any parallelism

  • at all in this computation?

  • Yeah, seems like a lot of people think there is some parallelism

  • here.

  • So we're actually going to analyze how much parallelism

  • is in this computation.

  • So I'm not going to tell you the answer now,

  • but I'll tell you in a couple of slides.

  • First need to go over some terminology.

  • So whenever you start talking about parallelism,

  • somebody is almost always going to bring up Amdahl's Law.

  • And Amdahl's Law says that if 50% of your application

  • is parallel and the other 50% is serial,

  • then you can't get more than a factor of 2 speedup,

  • no matter how many processors you run the computation on.

  • Does anyone know why this is the case?

  • Yes?

  • AUDIENCE: Because you need it to execute for at least 50%

  • of the time in order to get through the serial portion.

  • JULIAN SHUN: Right.

  • So you have to spend at least 50%

  • of the time in the serial portion.

  • So in the best case, if I gave you

  • an infinite number of processors,

  • and you can reduce the parallel portion of your code

  • to 0 running time, you still have the 50% of the serial time

  • that you have to execute.

  • And therefore, the best speedup you can get is a factor of 2.

  • And in general, if a fraction alpha of an application

  • must be run serially, then the speedup can be at most 1

  • over alpha.

  • So if 1/3 of your program has to be executed sequentially,

  • then the speedup can be, at most, 3.

  • Because even if you reduce the parallel portion of your code

  • to tab a running time of 0, you still

  • have the sequential part of your code that you have to wait for.

  • So let's try to quantify the parallelism in this computation

  • here.

  • So how many of these nodes have to be executed sequentially?

  • Yes?

  • AUDIENCE: 9 of them.

  • JULIAN SHUN: So it turns out to be less than 9.

  • Yes?

  • AUDIENCE: 7.

  • JULIAN SHUN: 7.

  • It turns out to be less than 7.

  • Yes?

  • AUDIENCE: 6.

  • JULIAN SHUN: So it turns out to be less than 6.

  • AUDIENCE: 4.

  • JULIAN SHUN: Turns out to be less than 4.

  • You're getting close.

  • AUDIENCE: 2.

  • JULIAN SHUN: 2.

  • So turns out to be more than 2.

  • AUDIENCE: 2.5.

  • JULIAN SHUN: What's left?

  • AUDIENCE: 3.

  • JULIAN SHUN: 3.

  • OK.

  • So 3 of these nodes have to be executed sequentially.

  • Because when you're executing these nodes,

  • there's nothing else that can happen in parallel.

  • For all of the remaining nodes, when you're executing them,

  • you can potentially be executing some

  • of the other nodes in parallel.

  • But for these three nodes that I've colored in yellow,

  • you have to execute those sequentially,

  • because there's nothing else that's going on in parallel.

  • So according to Amdahl's Law, this

  • says that the serial fraction of the program is 3 over 18.

  • So there's 18 nodes in this graph here.

  • So therefore, the serial factor is 1 over 6,

  • and the speedup is upper bound by 1 over that, which is 6.

  • So Amdahl's Law tells us that the maximum speedup we can get

  • is 6.

  • Any questions on how I got this number here?

  • So it turns out that Amdahl's Law actually gives us

  • a pretty loose upper bound on the parallelism,

  • and it's not that useful in many practical cases.

  • So we're actually going to look at a better

  • definition of parallelism that will give us

  • a better upper bound on the maximum speedup we can get.

  • So we're going to define T sub P to be the execution time

  • of the program on P processors.

  • And T sub 1 is just the work.

  • So T sub 1 is if you executed this program on one processor,

  • how much stuff do you have to do?

  • And we define that to be the work.

  • Recall in lecture 2, we looked at many ways

  • to optimize the work.

  • This is the work term.

  • So in this example, the number of nodes

  • here is 18, so the work is just going to be 18.

  • We also define T of infinity to be the span.

  • The span is also called the critical path

  • length, or the computational depth, of the graph.

  • And this is equal to the longest directed path

  • you can find in this graph.

  • So in this example, the longest path is 9.

  • So one of the students answered 9 earlier,

  • and this is actually the span of this graph.

  • So there are 9 nodes along this path here,

  • and that's the longest one you can find.

  • And we call this T of infinity because that's actually

  • the execution time of this program

  • if you had an infinite number of processors.

  • So there are two laws that are going

  • to relate these quantities.

  • So the work law says that T sub P

  • is greater than or equal to T sub 1 divided by P.

  • So this says that the execution time on P processors

  • has to be greater than or equal to the work

  • of the program divided by the number of processors you have.

  • Does anyone see why the work law is true?

  • So the answer is that if you have P processors, on each time

  • stub, you can do, at most, P work.

  • So if you multiply both sides by P,

  • you get P times T sub P is greater than or equal to T1.

  • If P times T sub P was less than T1, then

  • that means you're not done with the computation,

  • because you haven't done all the work yet.

  • So the work law says that T sub P

  • has to be greater than or equal to T1 over P.

  • Any questions on the work law?

  • So let's look at another law.

  • This is called the span law.

  • It says that T sub P has to be greater than or equal to T sub

  • infinity.

  • So the execution time on P processors

  • has to be at least execution time on an infinite number

  • of processors.

  • Anyone know why the span law has to be true?

  • So another way to see this is that if you

  • had an infinite number of processors,

  • you can actually simulate a P processor system.

  • You just use P of the processors and leave all

  • the remaining processors idle.

  • And that can't slow down your program.

  • So therefore, you have that T sub P

  • has to be greater than or equal to T sub infinity.

  • If you add more processors to it,

  • the running time can't go up.

  • Any questions?

  • So let's see how we can compose the work

  • and the span quantities of different computations.

  • So let's say I have two computations, A and B.

  • And let's say that A has to execute before B.

  • So everything in A has to be done

  • before I start the computation in B. Let's say

  • I know what the work of A and the work of B individually are.

  • What would be the work of A union B?

  • Yes?

  • AUDIENCE: I guess it would be T1 A plus T1 B.

  • JULIAN SHUN: Yeah.

  • So why is that?

  • AUDIENCE: Well, you have to execute sequentially.

  • So then you just take the time and [INAUDIBLE] execute A,

  • then it'll execute B after that.

  • JULIAN SHUN: Yeah.

  • So the work is just going to be the sum of the work of A

  • and the work of B. Because you have to do all of the work of A

  • and then do all of the work of B,

  • so you just add them together.

  • What about the span?

  • So let's say I know the span of A

  • and I know the span of B. What's the span of A union B?

  • So again, it's just a sum of the span of A and the span of B.

  • This is because I have to execute everything

  • in A before I start B. So I just sum together the spans.

  • So this is series composition.

  • What if I do parallel composition?

  • So let's say here, I'm executing the two

  • computations in parallel.

  • What's the work of A union B?

  • So it's not it's not going to be the maximum.

  • Yes?

  • AUDIENCE: It should still be T1 of A plus T1 of B.

  • JULIAN SHUN: Yeah, so it's still going

  • to be the sum of T1 of A and T1 of B.

  • Because you still have the same amount of work

  • that you have to do.

  • It's just that you're doing it in parallel.

  • But the work is just the time if you had one processor.

  • So if you had one processor, you wouldn't

  • be executing these in parallel.

  • What about the span?

  • So if I know the span of A and the span of B,

  • what's the span of the parallel composition of the two?

  • Yes?

  • AUDIENCE: [INAUDIBLE]

  • JULIAN SHUN: Yeah, so the span of A union B

  • is going to be the max of the span of A and the span of B,

  • because I'm going to be bottlenecked

  • by the slower of the two computations.

  • So I just take the one that has longer span,

  • and that gives me the overall span.

  • Any questions?

  • So here's another definition.

  • So T1 divided by TP is the speedup on P processors.

  • If I have T1 divided by TP less than P, then

  • this means that I have sub-linear speedup.

  • I'm not making use of all the processors.

  • Because I'm using P processors, but I'm not

  • getting a speedup of P.

  • If T1 over TP is equal to P, then I'm

  • getting perfect linear speedup.

  • I'm making use of all of my processors.

  • I'm putting P times as many resources into my computation,

  • and it becomes P times faster.

  • So this is the good case.

  • And finally, if T1 over TP is greater than P,

  • we have something called superlinear speedup.

  • In our simple performance model, this

  • can't actually happen, because of the work law.

  • The work law says that TP has to be at least T1 divided by P.

  • So if you rearrange the terms, you'll

  • see that we get a contradiction in our model.

  • In practice, you might sometimes see that you have a superlinear

  • speedup, because when you're using more processors,

  • you might have access to more cache,

  • and that could improve the performance of your program.

  • But in general, you might see a little bit of superlinear

  • speedup, but not that much.

  • And in our simplified model, we're

  • just going to assume that you can't have a superlinear

  • speedup.

  • And getting perfect linear speedup is already very good.

  • So because the span law says that TP has to be at least T

  • infinity, the maximum possible speedup

  • is just going to be T1 divided by T infinity,

  • and that's the parallelism of your computation.

  • This is a maximum possible speedup you can get.

  • Another way to view this is that it's

  • equal to the average amount of work

  • that you have to do per step along the span.

  • So for every step along the span,

  • you're doing this much work.

  • And after all the steps, then you've done all of the work.

  • So what's the parallelism of this computation dag here?

  • AUDIENCE: 2.

  • JULIAN SHUN: 2.

  • Why is it 2?

  • AUDIENCE: T1 is 18 and T infinity is 9.

  • JULIAN SHUN: Yeah.

  • So T1 is 18.

  • There are 18 nodes in this graph.

  • T infinity is 9.

  • And the last time I checked, 18 divided by 9 is 2.

  • So the parallelism here is 2.

  • So now we can go back to our Fibonacci example,

  • and we can also analyze the work and the span of this

  • and compute the maximum parallelism.

  • So again, for simplicity, let's assume

  • that each of these strands takes unit time to execute.

  • Again, in practice, that's not necessarily true.

  • But for simplicity, let's just assume that.

  • So what's the work of this computation?

  • AUDIENCE: 17.

  • JULIAN SHUN: 17.

  • Right.

  • So the work is just the number of nodes

  • you have in this graph.

  • And you can just count that up, and you get 17.

  • What about the span?

  • Somebody said 8.

  • Yeah, so the span is 8.

  • And here's the longest path.

  • So this is the path that has 8 nodes in it,

  • and that's the longest one you can find here.

  • So therefore, the parallelism is just 17

  • divided by 8, which is 2.125.

  • And so for all of you who guessed that the parallelism

  • was 2, you were very close.

  • This tells us that using many more than two processors

  • can only yield us marginal performance gains.

  • Because the maximum speedup we can get is 2.125.

  • So we throw eight processors at this computation,

  • we're not going to get a speedup beyond 2.125.

  • So to figure out how much parallelism

  • is in your computation, you need to analyze

  • the work of your computation and the span of your computation

  • and then take the ratio between the two quantities.

  • But for large computations, it's actually pretty tedious

  • to analyze this by hand.

  • You don't want to draw these things out

  • by hand for a very large computation.

  • And fortunately, Cilk has a tool called the Cilkscale

  • Scalability Analyzer.

  • So this is integrated into the Tapir/LLVM compiler

  • that you'll be using for this course.

  • And Cilkscale uses compiler instrumentation

  • to analyze a serial execution of a program,

  • and it's going to generate the work and the span quantities

  • and then use those quantities to derive

  • upper bounds on the parallel speedup of your program.

  • So you'll have a chance to play around

  • with Cilkscale in homework 4.

  • So let's try to analyze the parallelism of quicksort.

  • And here, we're using a parallel quicksort algorithm.

  • The function quicksort here takes two inputs.

  • These are two pointers.

  • Left points to the beginning of the array that we want to sort.

  • Right points to one element after the end of the array.

  • And what we do is we first check if left is equal to right.

  • If so, then we just return, because there are no elements

  • to sort.

  • Otherwise, we're going to call this partition function.

  • The partition function is going to pick a random pivot--

  • so this is a randomized quicksort algorithm--

  • and then it's going to move everything that's

  • less than the pivot to the left part of the array

  • and everything that's greater than

  • or equal to the pivot to the right part of the array.

  • It's also going to return us a pointer to the pivot.

  • And then now we can execute two recursive calls.

  • So we do quicksort on the left side and quicksort

  • on the right side.

  • And this can happen in parallel.

  • So we use the cilk_spawn here to spawn off one of these calls

  • to quicksort in parallel.

  • And therefore, the two recursive calls are parallel.

  • And then finally, we sync up before we

  • return from the function.

  • So let's say we wanted to sort 1 million numbers

  • with this quicksort algorithm.

  • And let's also assume that the partition function here

  • is written sequentially, so you have

  • to go through all of the elements, one by one.

  • Can anyone guess what the parallelism

  • is in this computation?

  • AUDIENCE: 1 million.

  • JULIAN SHUN: So the guess was 1 million.

  • Any other guesses?

  • AUDIENCE: 50,000.

  • JULIAN SHUN: 50,000.

  • Any other guesses?

  • Yes?

  • AUDIENCE: 2.

  • JULIAN SHUN: 2.

  • It's a good guess.

  • AUDIENCE: Log 2 of a million.

  • JULIAN SHUN: Log base 2 of a million.

  • Any other guesses?

  • So log base 2 of a million, 2, 50,000, and 1 million.

  • Anyone think it's more than 1 million?

  • No.

  • So no takers on more than 1 million.

  • So if you run this program using Cilkscale,

  • it will generate a plot that looks like this.

  • And there are several lines on this plot.

  • So let's talk about what each of these lines mean.

  • So this purple line here is the speedup

  • that you observe in your computation

  • when you're running it.

  • And you can get that by taking the single processor

  • running time and dividing it by the running

  • time on P processors.

  • So this is the observed speedup.

  • That's the purple line.

  • The blue line here is the line that you get from the span law.

  • So this is T1 over T infinity.

  • And here, this gives us a bound of about 6 for the parallelism.

  • The green line is the bound from the work law.

  • So this is just a linear line with a slope of 1.

  • It says that on P processors, you

  • can't get more than a factor of P speedup.

  • So therefore, the maximum speedup you can get

  • has to be below the green line and below the blue line.

  • So you're in this lower right quadrant of the plot.

  • There's also this orange line, which

  • is the speedup you would get if you used a greedy scheduler.

  • We'll talk more about the greedy scheduler

  • later on in this lecture.

  • So this is the plot that you would get.

  • And we see here that the maximum speedup is about 5.

  • So for those of you who guessed 2 and log base 2 of a million,

  • you were the closest.

  • You can also generate a plot that

  • just tells you the execution time versus the number

  • of processors.

  • And you can get this quite easily

  • just by doing a simple transformation

  • from the previous plot.

  • So Cilkscale is going to give you these useful plots that you

  • can use to figure out how much parallelism is in your program.

  • And let's see why the parallelism here is so low.

  • So I said that we were going to execute this partition

  • function sequentially, and it turns out

  • that that's actually the bottleneck to the parallelism.

  • So the expected work of quicksort is order n log n.

  • So some of you might have seen this

  • in your previous algorithms courses.

  • If you haven't seen this yet, then you

  • can take a look at your favorite textbook, Introduction

  • to Algorithms.

  • It turns out that the parallel version of quicksort

  • also has an expected work bound of order n log n,

  • if you pick a random pivot.

  • So the analysis is similar.

  • The expected span bound turns out to be at least n.

  • And this is because on the first level of recursion,

  • we have to call this partition function, which

  • is going to go through the elements one by one.

  • So that already has a linear span.

  • And it turns out that the overall span is also order n,

  • because the span actually works out

  • to be a geometrically decreasing sequence and sums to order n.

  • And therefore, the maximum parallelism you can get

  • is order log n.

  • So you just take the work divided by the span.

  • So for the student who guessed that the parallelism is log

  • base 2 of n, that's very good.

  • Turns out that it's not exactly log base

  • 2 of n, because there are constants in these work

  • and span bounds, so it's on the order of log of n.

  • That's the parallelism.

  • And it turns out that order log n parallelism is not very high.

  • In general, you want the parallelism to be much higher,

  • something polynomial in n.

  • And in order to get more parallelism

  • in this algorithm, what you have to do

  • is you have to parallelize this partition

  • function, because right now I I'm

  • just executing this sequentially.

  • But you can actually indeed write a parallel partition

  • function that takes linear your work in order log n span.

  • And then this would give you an overall span bound of log

  • squared n.

  • And then if you take n log n divided by log squared n,

  • that gives you an overall parallelism of n

  • over log n, which is much higher than order log n here.

  • And similarly, if you were to implement a merge sort,

  • you would also need to make sure that the merging routine is

  • implemented in parallel, if you want

  • to see significant speedup.

  • So not only do you have to execute the two recursive calls

  • in parallel, you also need to make sure

  • that the merging portion of the code is done in parallel.

  • Any questions on this example?

  • AUDIENCE: In the graph that you had, sometimes

  • when you got to higher processor numbers, it got jagged,

  • and so sometimes adding a processor was making it slower.

  • What are some reasons [INAUDIBLE]??

  • JULIAN SHUN: Yeah so I believe that's just due to noise,

  • because there's some noise going on in the machine.

  • So if you ran it enough times and took

  • the average or the median, it should be always going up,

  • or it shouldn't be decreasing, at least.

  • Yes?

  • AUDIENCE: So [INAUDIBLE] is also [INAUDIBLE]??

  • JULIAN SHUN: So at one level of recursion,

  • the partition function takes order log n span.

  • You can show that there are log n levels of recursion

  • in this quicksort algorithm.

  • I didn't go over the details of this analysis,

  • but you can show that.

  • And then therefore, the overall span

  • is going to be order log squared.

  • And I can show you on the board after class,

  • if you're interested, or I can give you a reference.

  • Other questions?

  • So it turns out that in addition to quicksort,

  • there are also many other interesting practical parallel

  • algorithms out there.

  • So here, I've listed a few of them.

  • And by practical, I mean that the Cilk program running

  • on one processor is competitive with the best

  • sequential program for that problem.

  • And so you can see that I've listed the work and the span

  • of merge sort here.

  • And if you implement the merge and parallel,

  • the span of the overall computation

  • would be log cubed n.

  • And log n divided by log cubed n is n over log squared n.

  • That's the parallelism, which is pretty high.

  • And in general, all of these computations

  • have pretty high parallelism.

  • Another thing to note is that these algorithms are practical,

  • because their work bound is asymptotically

  • equal to the work of the corresponding sequential

  • algorithm.

  • That's known as a work-efficient parallel algorithm.

  • It's actually one of the goals of parallel algorithm design,

  • to come up with work-efficient parallel algorithms.

  • Because this means that even if you

  • have a small number of processors,

  • you can still be competitive with a sequential algorithm

  • running on one processor.

  • And in the next lecture, we actually

  • see some examples of these other algorithms,

  • and possibly even ones not listed on this slide,

  • and we'll go over the work and span analysis

  • and figure out the parallelism.

  • So now I want to move on to talk about some scheduling theory.

  • So I talked about these computation dags earlier,

  • analyzed the work and the span of them,

  • but I never talked about how these different strands are

  • actually mapped to processors at running time.

  • So let's talk a little bit about scheduling theory.

  • And it turns out that scheduling theory is actually

  • very general.

  • It's not just limited to parallel programming.

  • It's used all over the place in computer science, operations

  • research, and math.

  • So as a reminder, Cilk allows the program

  • to express potential parallelism in an application.

  • And a Cilk scheduler is going to map these strands

  • onto the processors that you have available dynamically

  • at runtime.

  • Cilk actually uses a distributed scheduler.

  • But since the theory of distributed schedulers

  • is a little bit complicated, we'll

  • actually explore the ideas of scheduling first

  • using a centralized scheduler.

  • And a centralized scheduler knows everything

  • about what's going on in the computation,

  • and it can use that to make a good decision.

  • So let's first look at what a centralized scheduler does,

  • and then I'll talk a little bit about the Cilk

  • distributed scheduler.

  • And we'll learn more about that in a future lecture as well.

  • So we're going to look at a greedy scheduler.

  • And an idea of a greedy scheduler

  • is to just do as much as possible

  • in every step of the computation.

  • So has anyone seen greedy algorithms before?

  • Right.

  • So many of you have seen greedy algorithms before.

  • So the idea is similar here.

  • We're just going to do as much as possible

  • at the current time step.

  • We're not going to think too much about the future.

  • So we're going to define a ready strand

  • to be a strand where all of its predecessors in the computation

  • dag have already executed.

  • So in this example here, let's say

  • I already executed all of these blue strands.

  • Then the ones shaded in yellow are

  • going to be my ready strands, because they

  • have all of their predecessors executed already.

  • And there are two types of steps in a greedy scheduler.

  • The first kind of step is called a complete step.

  • And in a complete step, we have at least P strands ready.

  • So if we had P equal to 3, then we have a complete step now,

  • because we have 5 strands ready, which is greater than 3.

  • So what are we going to do in a complete step?

  • What would a greedy scheduler do?

  • Yes?

  • AUDIENCE: [INAUDIBLE]

  • JULIAN SHUN: Yeah, so a greedy scheduler would just

  • do as much as it can.

  • So it would just run any 3 of these, or any P in general.

  • So let's say I picked these 3 to run.

  • So it turns out that these are actually the worst 3 to run,

  • because they don't enable any new strands to be ready.

  • But I can pick those 3.

  • And then the incomplete step is one

  • where I have fewer than P strands ready.

  • So here, I have 2 strands ready, and I have 3 processors.

  • So what would I do in an incomplete step?

  • AUDIENCE: Just run through the strands that are ready.

  • JULIAN SHUN: Yeah, so just run all of them.

  • So here, I'm going to execute these two strands.

  • And then we're going to use complete steps

  • and incomplete steps to analyze the performance

  • of the greedy scheduler.

  • There's a famous theorem which was first

  • shown by Ron Graham in 1968 that says

  • that any greedy scheduler achieves

  • the following time bound--

  • T sub P is less than or equal to T1 over P plus T infinity.

  • And you might recognize the terms on the right hand side--

  • T1 is the work, and T infinity is the span

  • that we saw earlier.

  • And here's a simple proof for why this time bound holds.

  • So we can upper bound the number of complete steps

  • in the computation by T1 over P. And this

  • is because each complete step is going to perform P work.

  • So after T1 over P completes steps,

  • we'll have done all the work in our computation.

  • So that means that the number of complete steps

  • can be at most T1 over P.

  • So any questions on this?

  • So now, let's look at the number of incomplete steps

  • we can have.

  • So the number of incomplete steps we can have

  • is upper bounded by the span, or T infinity.

  • And the reason why is that if you look at the unexecuted dag

  • right before you execute an incomplete step,

  • and you measure the span of that unexecuted dag,

  • you'll see that once you execute an incomplete step,

  • it's going to reduce the span of that dag by 1.

  • So here, this is the span of our unexecuted dag

  • that contains just these seven nodes.

  • The span of this is 5.

  • And when we execute an incomplete step,

  • we're going to process all the roots of this unexecuted dag,

  • delete them from the dag, and therefore, we're

  • going to reduce the length of the longest path by 1.

  • So when we execute an incomplete step,

  • it decreases the span from 5 to 4.

  • And then the time bound up here, T sub P,

  • is just upper bounded by the sum of these two types of steps.

  • Because after you execute T1 over P complete steps

  • and T infinity incomplete steps, you

  • must have finished the entire computation.

  • So any questions?

  • A corollary of this theorem is that any greedy scheduler

  • achieves within a factor of 2 of the optimal running time.

  • So this is the optimal running time of a scheduler

  • that knows everything and can predict the future and so on.

  • So let's let TP star be the execution time produced

  • by an optimal scheduler.

  • We know that TP star has to be at least the max of T1

  • over P and T infinity.

  • This is due to the work and span laws.

  • So it has to be at least a max of these two terms.

  • Otherwise, we wouldn't have finished the computation.

  • So now we can take the inequality

  • we had before for the greedy scheduler bound--

  • so TP is less than or equal to T1 over P plus T infinity.

  • And this is upper bounded by 2 times the max of these two

  • terms.

  • So A plus B is upper bounded by 2 times the max of A and B.

  • And then now, the max of T1 over P and T

  • infinity is just upper bounded by TP star.

  • So we can substitute that in, and we

  • get that TP is upper bounded by 2 times

  • TP star, which is the running time of the optimal scheduler.

  • So the greedy scheduler achieves within a factor

  • of 2 of the optimal scheduler.

  • Here's another corollary.

  • This is a more interesting corollary.

  • It says that any greedy scheduler achieves

  • near-perfect linear speedup whenever T1 divided by T

  • infinity is greater than or equal to P.

  • To see why this is true--

  • if we have that T1 over T infinity

  • is much greater than P--

  • so the double arrows here mean that the left hand

  • side is much greater than the right hand side--

  • then this means that the span is much less than T1 over P.

  • And the greedy scheduling theorem gives us

  • that TP is less than or equal to T1 over P plus T infinity,

  • but T infinity is much less than T1 over P,

  • so the first term dominates, and we have

  • that TP is approximately equal to T1 over P.

  • And therefore, the speedup you get is T1 over P, which is P.

  • And this is linear speedup.

  • The quantity T1 divided by P times T

  • infinity is known as the parallel slackness.

  • So this is basically measuring how much more

  • parallelism you have in a computation than the number

  • of processors you have.

  • And if parallel slackness is very high,

  • then this corollary is going to hold,

  • and you're going to see near-linear speedup.

  • As a rule of thumb, you usually want the parallel slackness

  • of your program to be at least 10.

  • Because if you have a parallel slackness of just 1,

  • you can't actually amortize the overheads of the scheduling

  • mechanism.

  • So therefore, you want the parallel slackness

  • to be at least 10 when you're programming in Cilk.

  • So that was the greedy scheduler.

  • Let's talk a little bit about the Cilk scheduler.

  • So Cilk uses a work-stealing scheduler,

  • and it achieves an expected running time

  • of TP equal to T1 over P plus order T infinity.

  • So instead of just summing the two terms,

  • we actually have a big O in front of the T infinity,

  • and this is used to account for the overheads of scheduling.

  • The greedy scheduler I presented earlier--

  • I didn't account for any of the overheads of scheduling.

  • I just assumed that it could figure out which of the tasks

  • to execute.

  • So this Cilk work-stealing scheduler

  • has this expected time provably, so you

  • can prove this using random variables and tail

  • bounds of distribution.

  • So Charles Leiserson has a paper that

  • talks about how to prove this.

  • And empirically, we usually see that TP is more like T1

  • over P plus T infinity.

  • So we usually don't see any big constant in front of the T

  • infinity term in practice.

  • And therefore, we can get near-perfect linear speedup,

  • as long as the number of processors is much less than T1

  • over T infinity, the maximum parallelism.

  • And as I said earlier, the instrumentation in Cilkscale

  • will allow you to measure the work and span

  • terms so that you can figure out how much parallelism

  • is in your program.

  • Any questions?

  • So let's talk a little bit about how the Cilk runtime

  • system works.

  • So in the Cilk runtime system, each worker or processor

  • maintains a work deque.

  • Deque stands for double-ended queue,

  • so it's just short for double-ended queue.

  • It maintains a work deque of ready strands,

  • and it manipulates the bottom of the deck,

  • just like you would in a stack of a sequential program.

  • So here, I have four processors, and each one of them

  • have their own deques, and they have these things on the stack,

  • these function calls, saves the return address

  • to local variables, and so on.

  • So a processor can call a function,

  • and when it calls a function, it just

  • places that function's frame at the bottom of its stack.

  • You can also spawn things, so then it places a spawn frame

  • at the bottom of its stack.

  • And then these things can happen in parallel,

  • so multiple processes can be spawning and calling

  • things in parallel.

  • And you can also return from a spawn or a call.

  • So here, I'm going to return from a call.

  • Then I return from a spawn.

  • And at this point, I don't actually

  • have anything left to do for the second processor.

  • So what do I do now, when I'm left with nothing to do?

  • Yes?

  • AUDIENCE: Take a [INAUDIBLE].

  • JULIAN SHUN: Yeah, so the idea here

  • is to steal some work from another processor.

  • So when a worker runs out of work to do,

  • it's going to steal from the top of a random victim's deque.

  • So it's going to pick one of these processors at random.

  • It's going to roll some dice to determine who to steal from.

  • And let's say that it picked the third processor.

  • Now it's going to take all of the stuff

  • at the top of the deque up until the next spawn

  • and place it into its own deque.

  • And then now it has stuff to do again.

  • So now it can continue executing this code.

  • It can spawn stuff, call stuff, and so on.

  • So the idea is that whenever a worker runs out of work to do,

  • it's going to start stealing some work

  • from other processors.

  • But if it always has enough work to do, then it's happy,

  • and it doesn't need to steal things from other processors.

  • And this is why MIT gives us so much work to do,

  • so we don't have to steal work from other people.

  • So a famous theorem says that with sufficient parallelism,

  • workers steal very infrequently, and this gives us

  • near-linear speedup.

  • So with sufficient parallelism, the first term

  • in our running bound is going to dominate the T1 over P term,

  • and that gives us near-linear speedup.

  • Let me actually show you a pseudoproof of this theorem.

  • And I'm allowed to do a pseudoproof.

  • It's not actually a real proof, but a pseudoproof.

  • So I'm allowed to do this, because I'm not

  • the author of an algorithms textbook.

  • So here's a pseudo proof.

  • AUDIENCE: Yet.

  • JULIAN SHUN: Yet.

  • So a processor is either working or stealing at every time step.

  • And the total time that all processors spend working

  • is just T1, because that's the total work that you have to do.

  • And then when it's not doing work, it's stealing.

  • And each steal has a 1 over P chance

  • of reducing the span by 1, because one of the processors

  • is contributing to the longest path in the compilation dag.

  • And there's a 1 over P chance that I'm

  • going to pick that processor and steal

  • some work from that processor and reduce

  • the span of my remaining computation by 1.

  • And therefore, the expected cost of all steals

  • is going to be order P times T infinity,

  • because I have to steal P things in expectation before I

  • get to the processor that has the critical path.

  • And therefore, my overall costs for stealing is order P times T

  • infinity, because I'm going to do this T infinity times.

  • And since there are P processors,

  • I'm going to divide the expected time by P,

  • so T1 plus O of P times T infinity divided by P,

  • and that's going to give me the bound--

  • T1 over P plus order T infinity.

  • So this pseudoproof here ignores issues with independence,

  • but it still gives you an intuition

  • of why we get this expected running time.

  • If you want to actually see the full proof,

  • it's actually quite interesting.

  • It uses random variables and tail bounds of distributions.

  • And this is the paper that has this.

  • This is by Blumofe and Charles Leiserson.

  • So another thing I want to talk about

  • is that Cilk supports C's rules for pointers.

  • So a pointer to a stack space can be passed from a parent

  • to a child, but not from a child to a parent.

  • And this is the same as the stack rule for sequential C

  • programs.

  • So let's say I have this computation on the left here.

  • So A is going to spawn off B, and then it's

  • going to continue executing C. In then C

  • is going to spawn off D and execute E.

  • So we see on the right hand side the views of the stacks

  • for each of the tasks here.

  • So A sees its own stack.

  • B sees its own stack, but it also

  • sees A's stack, because A is its parent.

  • C will see its own stack, but again, it

  • sees A's stack, because A is its parent.

  • And then finally, D and E, they see the stack of C,

  • and they also see the stack of A.

  • So in general, a task can see the stack

  • of all of its ancestors in this computation graph.

  • And we call this a cactus stack, because it

  • sort of looks like a cactus, if you draw this upside down.

  • And Cilk's cactus stack supports multiple views

  • of the stacks in parallel, and this

  • is what makes the parallel calls to functions work in C.

  • We can also bound the stack space used by a Cilk program.

  • So let's let S sub 1 be the stack space required

  • by the serial execution of a Cilk program.

  • Then the stack space required by a P-processor execution

  • is going to be bounded by P times S1.

  • So SP is the stack space required

  • by a P-processor execution.

  • That's less than or equal to P times S1.

  • Here's a high-level proof of why this is true.

  • So it turns out that the work-stealing algorithm in Cilk

  • maintains what's called the busy leaves property.

  • And this says that each of the existing leaves that are still

  • active in the computation dag have some work

  • they're executing on it.

  • So in this example here, the vertices

  • shaded in blue and purple--

  • these are the ones that are in my remaining computation dag.

  • And all of the gray nodes have already been finished.

  • And here-- for each of the leaves

  • here, I have one processor on that leaf executing

  • the task associated with it.

  • So Cilk guarantees this busy leaves property.

  • And now, for each of these processors,

  • the amount of stack space it needs

  • is it needs the stack space for its own task

  • and then everything above it in this computation dag.

  • And we can actually bound that by the stack space needed

  • by a single processor execution of the Cilk program, S1,

  • because S1 is just the maximum stack space we need,

  • which is basically the longest path in this graph.

  • And we do this for every processor.

  • So therefore, the upper bound on the stack space

  • required by P-processor execution is just P times S1.

  • And in general, this is a quite loose upper bound,

  • because you're not necessarily going

  • all the way all the way down in this competition dag

  • every time.

  • Usually you'll be much higher in this computation dag.

  • So any questions?

  • Yes?

  • AUDIENCE: In practice, how much work is stolen?

  • JULIAN SHUN: In practice, if you have enough parallelism, then

  • you're not actually going to steal

  • that much in your algorithm.

  • So if you guarantee that there's a lot of parallelism,

  • then each processor is going to have a lot of its own work

  • to do, and it doesn't need to steal very frequently.

  • But if your parallelism is very low

  • compared to the number of processors--

  • if it's equal to the number of processors,

  • then you're going to spend a significant amount of time

  • stealing, and the overheads of the work-stealing algorithm

  • are going to show up in your running time.

  • AUDIENCE: So I meant in one steal--

  • like do you take half of the deque,

  • or do you take one element of the deque?

  • JULIAN SHUN: So the standard Cilk work-stealing scheduler

  • takes everything at the top of the deque up

  • until the next spawn.

  • So basically that's a strand.

  • So it takes that.

  • There are variants that take more than that,

  • but the Cilk work-stealing scheduler

  • that we'll be using in this class

  • just takes the top strand.

  • Any other questions?

  • So that's actually all I have for today.

  • If you have any additional questions,

  • you can come talk to us after class.

  • And remember to meet with your MITPOSSE mentors soon.

The following content is provided under a Creative

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

7.種族和平行性 (7. Races and Parallelism)

  • 4 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字