Placeholder Image

字幕列表 影片播放

  • ANNOUNCER: The following content is provided under a Creative

  • Commons license.

  • Your support will help MIT Open Courseware

  • continue to offer high quality educational resources for free.

  • To make a donation or to view additional materials

  • from hundreds of MIT courses, visit MIT OpenCourseWare

  • at ocw.mit.edu.

  • JULIAN SHUN: All right.

  • So today we're going to talk about cache

  • oblivious algorithms.

  • Who remembers what a cache oblivious algorithm is?

  • So what is a cache oblivious algorithm oblivious to?

  • Cache size.

  • So a cache oblivious algorithm is an algorithm

  • that automatically tunes for the cache size on your machine.

  • So it achieves good cache efficiency.

  • And the code doesn't need to have any knowledge of the cache

  • parameters of your machine.

  • In contrast, a cache-aware algorithm

  • would actually know the parameters of the cache sizes

  • on your machine.

  • And the code would actually put the size of the cache inside.

  • So today we're going to talk a lot more about cache oblivious

  • algorithms.

  • Last time we talked about one cache oblivious algorithm

  • that was for matrix multiplication.

  • And today we're going to talk about some other ones.

  • So first example I want to talk about

  • is simulation of heat diffusion.

  • So here's a famous equation known as the heat equation.

  • And this equation is in two dimensions.

  • And we want to simulate this function

  • u that has three parameters t, x, and y. t is the time step.

  • X and y are the x, y coordinates of the 2D space.

  • And we want to know the temperature for each x,

  • y coordinate for any point in time t.

  • And the 2D heat equation can be modeled

  • using a differential equation.

  • So how many of you have seen differential equations before?

  • OK.

  • So good, most of you.

  • So here I'm showing the equation in two dimensions.

  • But you can get similarly equations

  • for higher dimensions.

  • So here it says the partial derivative of u with respect

  • to t is equal to alpha.

  • Alpha is what's called the thermo diffusivity.

  • It's a constant times the sum of the second partial derivative

  • of u with respect to x, and the second partial derivative of u

  • with respect to y.

  • So this is a pretty famous equation.

  • And you can see that we have a single derivative

  • on the left side and a double derivative on the right side.

  • And how do we actually write code

  • to simulate this 2D heat process?

  • So oftentimes in scientific computing,

  • people will come up with these differential equations just

  • to describe physical processes.

  • And then they want to come up with efficient code

  • to actually simulate the physical process.

  • OK.

  • So here's an example of a 2D heat diffusion.

  • So let's say we started out with a configuration on the left.

  • And here the color corresponds to a temperature.

  • So a brighter color means it's hotter.

  • Yellow is the hottest and blue is the coldest.

  • In on the left we just have 6172,

  • which is the course number.

  • So if you didn't know that, you're

  • probably in the wrong class.

  • And then afterwards we're going to run it

  • for a couple time steps.

  • And then the heat is going to diffuse to the neighboring

  • regions of the 2D space.

  • So after you run it for a couple of steps,

  • you might get the configuration on the right

  • where the heat is more spread out now.

  • And oftentimes, you want to run this simulation

  • for a number of time steps until the distribution of heat

  • converges so it becomes stable and that

  • doesn't change by much anymore.

  • And then we stop the simulation.

  • So this is the 1D heat equation.

  • I showed you a 2D one earlier.

  • But we're actually going to generate code for the 1D heat

  • equation since it's simpler.

  • But all the ideas generalize to higher dimensions.

  • And here's the range of colors corresponding to temperature,

  • so the hottest colors on the left

  • and the coldest colors on the right.

  • And if you had a heat source that's

  • on the left hand side of this bar,

  • then this might possibly be a stable distribution.

  • So if you keep running the simulation,

  • you might get a stable distribution of heat

  • that looks like this.

  • OK.

  • So how do we actually write code to simulate this differential

  • equation?

  • So one commonly used method is known as finite difference

  • approximation.

  • So we're going to approximate the partial derivative of u

  • with respect to each of its coordinates.

  • So the partial derivative of u with respect to t

  • is approximately equal to u of t plus delta t

  • where delta t is some small value, and x, and then

  • minus u of tx.

  • And then that's all divided by delta t.

  • So how many of you have seen this approximation method

  • before from your calculus class?

  • OK, good.

  • So as you bring the value of delta t down to 0,

  • then the thing on the right hand side approach

  • is the true partial derivative.

  • So that's a partial derivative with respect to t.

  • We also need to get the partial derivative with respect to x.

  • And here I'm saying the partial derivative with respect

  • to x is approximately equal to ut of x plus delta

  • x over 2 minus ut of x minus delta x

  • over 2, all divided by delta x.

  • So notice here that instead of adding delta

  • x in the first term and not adding anything

  • in the second term, I'm actually adding delta x over 2

  • in the first term and subtracting text over 2

  • in the second term.

  • And it turns out that I can do this with the approximation

  • method.

  • And it still turns out to be valid as long as the two terms

  • that I'm putting in, their difference is delta x.

  • So here the difference is delta x.

  • And I can basically decide how to split up

  • this delta x term among the two things in the numerator.

  • And the reason why I chose delta x over 2 here

  • is because the math is just going to work out nicely.

  • And it's going to give us cleaner code.

  • This is just the first partial derivative with respect to x.

  • Actually need the second partial derivative

  • since the right hand side of this equation

  • has the second partial derivative.

  • So this is what the second partial derivative looks like.

  • So I just take the partial derivative with respect

  • to x of each of the terms in my numerator from the equation

  • above.

  • And then now I can actually plug in the value

  • of this partial derivative by applying the equation

  • above using the arguments t and x plus delta x

  • over 2, and similarly for the second term.

  • So for the first term when I plug it

  • into the equation for the partial derivative with respect

  • to x, I'm just going to get ut of x plus delta x minus utx.

  • And then for the second term, I'm

  • going to get ut of x minus delta x.

  • And then I subtract another factor of utx.

  • So that's why I'm subtracting 2 times utx in the numerator

  • here.

  • And then the partial derivative of each

  • of the things in a numerator also

  • have to divide by this delta x term.

  • So on the denominator, I get delta x squared.

  • So now I have the second partial derivative with respect to x.

  • And I also have the first partial derivative with respect

  • to t.

  • So I can just plug them into my equation above.

  • So on the left hand side I just have this term here.

  • And I'm multiplying by this alpha constant.

  • And then on this term just comes from here.

  • So this is what the 1d heat equation

  • reduces to using the finite difference approximation

  • method.

  • So any questions on this

  • So how do we actually write code to simulate this equation here?

  • So we're going to use what's called a stencil computation.

  • And here I'm going to set delta at x and delta t equal to 1,

  • just for simplicity.

  • But in general you can set them to whatever you want.

  • You can make them smaller to have

  • a more fine grained simulation.

  • So my set delta x and delta eat t equal to 1.

  • Then the denominators of these two terms just become one

  • and I don't need to worry about them.

  • And then I'm going to represent my 2D

  • space using a 2D matrix where the horizontal axis represents

  • values of x, and the vertical axis represents values of t.

  • And I want to fill in all these entries

  • that have a black dot in it.

  • The ones with the orange dot, those are my boundaries.

  • So those actually fixed throughout the computation.

  • So I'm not going to do any computation on those.

  • Those are just given to me as input.

  • So they could be heat sources if we're

  • doing the heat simulation.

  • And then now I can actually write code

  • to simulate this equation.

  • So if I want to compute u of t plus 1x, I can just go up here.

  • And I see that that's equal to this thing over here.

  • And then I bring the negative utx term to the right.

  • So I get ut of x plus alpha times ut

  • x plus 1 minus 2 times utx plus ut x minus 1.

  • As I said before, we just want to keep iterating this

  • until the temperatures becomes stable.

  • So I'm going to proceed in time, which in time is going up here.

  • And to compute one of these points--

  • so let's say this is ut plus 1x--

  • I need to know the value of utx, which is just

  • a thing below me in the matrix.

  • And I also need to know utx plus 1 and utx minus 1.

  • And those are just the things below me

  • and diagonal to either the left or the right side.

  • So each value here just depends on three other values.

  • And this is called a three point stencil.

  • This is the pattern that this equation is representing.

  • And in general, a stencil computation

  • is going to update each point in an array using a fixed pattern.

  • This is called a stencil.

  • So I'm going to do the same thing

  • for all of the other points.

  • And here, I'm going to compute all the values

  • of x for a given time step.

  • And then I move on to the next time step.

  • And then I keep doing this until the distribution

  • of temperatures becomes stable.

  • And then I'm done.

  • OK.

  • So these stencil computations are widely

  • used in scientific computing.

  • They're used for weather simulations, stock market

  • simulations, fluid dynamics, image processing probability,

  • and so on.

  • So they're used all over the place in science.

  • So this is a very important concept to know.

  • So let's say I just ran the code as I showed you

  • in the animation.

  • So I completed one row at a time before I moved on

  • to the next row.

  • How would this code perform with respect to caching?

  • Yes?

  • AUDIENCE: I think if x is less than a third

  • of the cache size, [INAUDIBLE]

  • JULIAN SHUN: Yeah.

  • So if x is small, this would do pretty well.

  • But what if x is much larger than the size of your cache?

  • AUDIENCE: [INAUDIBLE].

  • JULIAN SHUN: Yeah, you.

  • Would do badly and why is that?

  • AUDIENCE: Because the whole [INAUDIBLE] second row,

  • [INAUDIBLE]

  • JULIAN SHUN: Yeah.

  • So it turns out that there could be some reuse here,

  • because when I compute the second row,

  • I'm actually using some values that

  • are computed in the first row.

  • But if the row is much larger than the cache size,

  • by the time I get to the second row,

  • the values that I loaded into cache from the first row

  • would have been evicted.

  • And therefore, I'm going to suffer a cache miss again

  • for the second row, for the values I need to load in,

  • even though they could have been used if we made our code have

  • good locality.

  • Another question I have is if we only

  • cared about the values of x at the most recent time step,

  • do we actually have to keep around this whole 2D matrix?

  • Or can we get by with less storage?

  • Yeah.

  • So how many rows would I have to keep around

  • if I only cared about the most recent time step?

  • Yeah?

  • AUDIENCE: Two.

  • JULIAN SHUN: Two.

  • And why is that?

  • AUDIENCE: So one for the previous time step.

  • One for the current time step.

  • [INAUDIBLE]

  • JULIAN SHUN: Right.

  • So I need to keep around two rows because I

  • need the row from the previous time step

  • in order to compute the values in the current time step.

  • And after the current time step, I

  • can just swap the roles of the two rows

  • that I'm keeping around, and then

  • reuse the previous row for the next one.

  • Yes?

  • AUDIENCE: Would you only need one and then

  • a constant amount of extra space,

  • like if you had three extra things,

  • you could probably do it with one.

  • JULIAN SHUN: So I need to know--

  • when I'm computing the second row,

  • I need to keep around all of these values

  • that I computed in the first row,

  • because these values get fed to one

  • of the computations in the second row.

  • So I need to actually keep all of them around.

  • AUDIENCE: I think if you iterate to the right,

  • then you have three that are this one and the next one.

  • Just three.

  • Then you can--

  • JULIAN SHUN: Oh, I see I see what you're saying.

  • Yeah.

  • So that's actually a good observation.

  • So you only need to keep a constant amount

  • more storage, because you'll just be overwriting the values

  • as you go through the row.

  • So if you keep around one row, some of the values

  • would be for the current time step, and some of them

  • would be from the previous time step.

  • So that's a good observation, yes.

  • OK.

  • So that code, as we saw, it wasn't very cache efficient.

  • You could make a cache efficient using tiling.

  • But we're going to go straight to the cache oblivious

  • algorithm because it's much cleaner.

  • So let's recall the ideal cache model.

  • We talked about this in the previous lecture.

  • So here we have a two level hierarchy.

  • We have a cache.

  • And then we have the main memory.

  • The cache has size of M bytes.

  • And a cache line is B bytes.

  • So you can keep around M over B cache lines in your cache.

  • And if something's in cache and you operate on it,

  • then it doesn't incur any cache misses.

  • But if you have to go to main memory

  • to load the cache line in, then you incur one cache miss.

  • The ideal cache model assumes that the cache

  • is fully associative.

  • So any cache line can go anywhere in the cache.

  • And it also assumes either an optimal omniscient replacement

  • policy or the LRU policy.

  • So the optimal omniscient replacement policy

  • knows the sequence of all future requests to memory.

  • And when it needs to evict something,

  • it's going to pick the thing that leads to the fewest cache

  • misses overall to evict.

  • The LRU policy just evict the thing

  • that was least recently used.

  • But we saw from the previous lecture,

  • that in terms of asymptotic costs,

  • these two replacement policies will give you

  • cache misses within a constant fact of each other.

  • So you can use either one, depending on what's convenient.

  • And two performance measures that we

  • care about when we're analyzing an algorithm

  • and the ideal cache model are the work

  • and the number of cache misses.

  • So the work is just the total number of operations

  • that the algorithm incurs.

  • And serially, this is just the ordinary running time.

  • And the number of cache misses is the number of cache lines

  • you have to transfer between the cache and your main memory.

  • So let's assume that we're running an algorithm

  • or analyzing an algorithm in the ideal cache model,

  • and it runs serially.

  • What kinds of cache misses does the ideal cache model

  • not capture?

  • So remember, we talked about several types of cache misses.

  • And there's one type of cache miss

  • that this model doesn't capture when we're running serially.

  • So let's assume we're running this serially

  • without any parallelism here.

  • So the sharing misses has only come about when

  • you have parallelism.

  • Yes?

  • AUDIENCE: Conflictness?

  • JULIAN SHUN: Yes.

  • So the answer is conflictnesses.

  • And why is that?

  • Why does this model not capture it?

  • AUDIENCE: There's not a specific sets

  • that could get replaced then since it's fully associated.

  • JULIAN SHUN: Yes.

  • So this is a fully associative cache.

  • So any cache line can go anywhere in the cache.

  • And you can only get conflict misses for set associated

  • schemes where each cache line can only

  • be mapped to a particular set.

  • And if you have too many cache lines that

  • map to that particular set, then you're

  • going to keep evicting each other

  • even though the rest of the cache could have space.

  • And that's what's called a conflict miss.

  • The ideal cash model does capture capacity misses.

  • So therefore, it is still a very good model

  • to use at a high level when you're

  • designing efficient algorithms, because it

  • encourages you to optimize for spatial and temporal locality.

  • And once you have a good algorithm in the ideal cache

  • model then you can start dealing with conflict misses

  • using some of the strategies that we

  • talked about last time such as padding

  • or using temporary memory.

  • So any questions on this?

  • OK.

  • So this is the code that does the heat simulation

  • that we saw earlier.

  • So it's just two for loops, a nested for loop.

  • In the outer loop, we're looping over the time dimension.

  • In the inner loop, we're looping over the space dimension.

  • So we're computing all the values

  • of x before we move on to the next time step.

  • And then we're storing two rows here

  • and we're using this trick called a even odd trick.

  • And here's how it works.

  • So to access the next row that we want to compute,

  • that we just do a t plus 1 mod 2.

  • And then to access the current row, it's just t mod 2.

  • So this is implicitly going to swap the roles of the two rows

  • that we're keeping around as we progress through time.

  • And then we're going to set u of t

  • plus 1 mod 2x equal to kernel of u--

  • pointer to ut mod 2x.

  • And this kernel function is defined up here.

  • And recall, that when we're actually

  • passing a pointer to this kernel function,

  • we can actually treat a pointer as the beginning of an array.

  • So we're using array notation up here

  • inside the kernel function.

  • So the array W is passed as input.

  • And then we need to return W of 0.

  • That's just the element at the current pointer

  • that we passed to kernel.

  • And then we add alpha times w of negative 1.

  • That's one element before the thing that we're pointing to,

  • minus 2 times W of 0 plus W of 1.

  • W of 1 is the next element that we're pointing to.

  • OK.

  • So let's look at the caching behavior of this code.

  • So we're going to analyze the cache complexity.

  • And we're going to assume the LRU replacement policy here,

  • because we can.

  • And as we said before, we're going

  • to loop through one entire row at a time

  • before we go onto the next row.

  • So the number of cache misses I get,

  • assuming that n is greater than M,

  • so that the row size is greater than the cache

  • size, the number of cache misses is theta of NT over B.

  • So how do I get this cache complexity around here?

  • So how many cache misses do I have

  • to incur for each row of this 2D space that I'm computing?

  • Yes?

  • AUDIENCE: N over B.

  • JULIAN SHUN: Right.

  • So I need N over B cache misses for each row.

  • And this is because I can load in B bytes at a time.

  • So I benefit from spatial locality there.

  • And then I have N elements I need to compute.

  • So it's theta of N over B per row.

  • And as we said before, when we get to the next row,

  • the stuff that we need from the previous row

  • have already been evicted from cache.

  • So I basically have to incur theta of N over B cache

  • misses for every row.

  • And the number of rows I'm going to compute as t.

  • So it's just theta of NT over B. Any questions on this analysis?

  • So how many of you think we can do better than this?

  • OK.

  • So one person.

  • Two, three.

  • OK.

  • So turns out that we can do better than this.

  • You can actually do better with tiling,

  • but I'm not going to do the tiling version.

  • I want to do the cache oblivious version.

  • And the cache oblivious version is

  • going to work on trapezoidal regions in the 2D space.

  • And recall that a trapezoid has a top base and a bottom base.

  • And here the top base is at t1, the bottom base is at t0,

  • and the height is just t1 minus t0.

  • And the width of a trapezoid is just the width of it

  • at the midpoint between t1 and t0, so at t1 plus t0 over 2.

  • So we're going to compute all of the points

  • inside this trapezoid that satisfy

  • these inequalities here.

  • So t has to be greater than or equal to t0 and less than t1.

  • And then x is greater than or equal to x0 plus dx0

  • times t minus t0.

  • So dx0 is actually the inverse slope here.

  • And then it also has to be less than x1

  • plus dx1 times t minus t0.

  • So dx1 is the inverse slope on the other side.

  • And dx0 and dx1 have to be either negative 1, 0, or 1.

  • So negative 1 just corresponds to inverse slope of negative 1,

  • which is also a slope of negative 1.

  • If it's 1, then it's just a slope or inverse of 1.

  • And then if it's 0, then we just have a vertical line.

  • OK.

  • So the nice property of this trapezoid

  • is that we can actually compute everything inside the trapezoid

  • without looking outside the trapezoid.

  • So we can compute everything here independently

  • of any other trapezoids we might be generating.

  • And we're going to come up with a divide

  • and conquer approach to execute this code.

  • So the divide and conquer algorithm has a base case.

  • So our base case is going to be when

  • the height of the trapezoid is 1.

  • And when the height is 1, then we're

  • just going to compute all of the values using a simple loop.

  • And any order if the computation inside this loop

  • is valid, since we have all the values

  • in the base of the trapezoid and we

  • can compute the values in the top of the trapezoid

  • in whatever order.

  • They don't depend on each other.

  • So that's a base case.

  • Any questions so far?

  • So here's one of the recursive cases.

  • It turns out that we're going to have

  • two different types of cuts.

  • The first cut is called a space cut.

  • So I'm going to do a space cut if the width of the trapezoid

  • is greater than or equal to twice the height.

  • So this means that the trapezoid is too wide.

  • And I'm going to cut it vertically.

  • More specifically, I'm going to cut it with a line,

  • with slope negative 1 going through the center

  • of the trapezoid.

  • And then I'm going to traverse the trapezoid on the left side

  • first.

  • And then after I'm done with that,

  • traverse the trapezoid on the right side.

  • So can I actually switch the order of this?

  • Can I compute the stuff on the right side

  • before I do this stuff on the left side?

  • No.

  • Why is that?

  • AUDIENCE: [INAUDIBLE].

  • JULIAN SHUN: Yeah.

  • So there some points in the right trapezoid

  • that depend on the values from the left trapezoid.

  • And so for the left trapezoid, every point we want to compute,

  • we already have all of its points,

  • assuming that we get all the values of the base points.

  • But for the right hand side, some of the values

  • depend on values in the left trapezoid.

  • So we can't execute the right trapezoid

  • until we're done with the left trapezoid.

  • And this is the reason why I cut this trapezoid

  • with a slope of negative 1 instead

  • of using a vertical cut.

  • Because if I did a vertical cut then

  • inside both of the trapezoids, I would

  • have points that depend on the other trapezoid.

  • So this is one of the two cuts.

  • This is called a space cut.

  • And it happens when the trapezoid is too wide.

  • The other cut is the time cut I'm

  • going to cut with respect to the time dimension.

  • And this happens when the trapezoid is too tall,

  • so when the width is less than twice the height

  • of the trapezoid.

  • Then what I'm going to do is I'm just

  • going to cut it with a horizontal line

  • through the center.

  • And then I'm going to traverse the bottom trapezoid first.

  • And after I'm done with that, I can traverse a top trapezoid.

  • And again, the top trapezoid depends on some points

  • from the bottom trapezoid.

  • So it's I can't switch the order of those.

  • Any questions?

  • OK.

  • So let's now look at the code that

  • implements this recursive divide and conquer algorithm.

  • So here's the C code.

  • It takes as input t0 and t1.

  • These are the coordinates of the top

  • and the bottom up the trapezoid, or bottom and top

  • of the trapezoid, then x.

  • 0 is the left side of the trapezoid--

  • of the base of the trapezoid.

  • dx0 is the inverse slope, and the x1

  • is the right side of the bottom base of the trapezoid,

  • and dx1 is the inverse slope on the right side.

  • So we're first going to compute the height of our trapezoid.

  • And we're going to let LT be the height.

  • And that's just t1 minus t0.

  • And if the height is 1, then we're

  • just going to use a for loop over all the points

  • in that height 1 trapezoid.

  • We're going to call this kernel function

  • that we defined before.

  • And otherwise, the height is greater than 1.

  • And we're going to check whether we should do a space cut

  • or time cut.

  • So we do a space cut if the trapezoid is too wide.

  • And this condition inside the if clause

  • is checking if the trapezoid is too wide.

  • And if so, then we're going to make two recursive calls

  • to trapezoid.

  • And we're going to cut it in the middle using

  • this slope of negative 1.

  • So you see the negative ones here in the recursive calls.

  • And otherwise, we'll do a time cut.

  • And for the time cut we just cut it in the middle.

  • So we cut it at the value of t that's equal to LT divided by,

  • 2 or t0 plus LT divided by 2.

  • And again, we have two recursive calls to trapezoid.

  • OK.

  • So even though I'm only generating slopes of negative 1

  • in this recursive call, this code

  • is going to work even if I have slopes of 1 and 0,

  • because I could start out with slopes of 1 and 0.

  • For example, if I had a rectangular region,

  • then the slopes are just going to be 0,

  • and this code is still going to work.

  • But most of the slopes that I'm dealing with

  • are going to be a negative 1, because those

  • are the new slopes and I'm generating.

  • Any questions?

  • So this code is very concise.

  • It turns out that even, odd tricks still works here.

  • You can still keep around just two rows,

  • because you're guaranteed that you're not

  • going to overwrite any value until all the things

  • that depend on it are computed.

  • But when you're using just two rows,

  • the values in a particular row might not all

  • correspond to the same time step,

  • because we're not finishing an entire row before we

  • move on to the next one.

  • We're actually partially computing rows.

  • OK.

  • So let's analyze the cash complexity of this algorithm.

  • Again, we're going to use the recursion tree approach

  • that we talked about last time.

  • And our code is going to split itself

  • into two cell problems at every level until it gets to a leaf.

  • And each leaf represents theta of hw points

  • where h is theta of w.

  • So h is the height. w is the width.

  • And we have the property that h is equal to theta w because

  • of the nature of the algorithm.

  • When the trapezoid becomes too wide,

  • we're going to make it less wide by doing a space cut.

  • And if it becomes too tall, we're

  • going to do a horizontal cut.

  • So we're guaranteed that the height and the width

  • are going to be within a constant factor of each other

  • when we get to the base case.

  • And each leaf in the base case is

  • going to incur theta of w over B misses

  • because we have to load in the base of the trapezoid

  • from main memory.

  • And once that's in cache, we can compute

  • all of the other points in the trapezoid

  • without incurring any more cache misses.

  • So the cache misses per leaf is just theta w over B.

  • And we're going to set w equal to theta of M in the analysis,

  • because that's the point when the trapezoid fits into cache.

  • The algorithm doesn't actually have any knowledge

  • of this M parameter.

  • So it's still going to keep divide

  • and conquering until it gets the base case of size 1.

  • But just for the analysis, we're going to use a base

  • case when w is theta of M.

  • So the number of leaves we have is theta of NT

  • divided by w because each leaf is a size theta hw.

  • The number of internal notes we have

  • is equal to a number of leaves minus

  • because we have a tree here.

  • But the internal notes don't contribute substantially

  • to the cache complexity, because each of them

  • is just doing a constant number of cache

  • misses to set up the two recursive calls.

  • And it's not doing anything more expensive than that.

  • So we just need to compute the number of cache

  • misses at the leaves.

  • We have theta of NT over hw leaves, each one of which

  • takes theta of w over B cache misses.

  • And we're going to multiply that out.

  • So the w term cancels out.

  • We just have NT over hB and we set h and w to be theta of M.

  • So we're just left with NT over MB as our cache complexity.

  • Yes?

  • AUDIENCE: Can you explain why hB [INAUDIBLE]??

  • JULIAN SHUN: Sure.

  • So each leaf only incurs theta w over B cache

  • misses because we need to compute the values

  • of the base of the trapezoid.

  • And that's going to incur theta of w over B

  • cache misses because it's w wide,

  • and therefore, everything else is going to fit into cache.

  • So when we compute them, we already

  • have all of our previous values that we want in cache.

  • So that's why it's not going to incur any more cache misses.

  • So does that makes sense?

  • AUDIENCE: Yeah, I forgot that it was [INAUDIBLE]..

  • JULIAN SHUN: Yeah.

  • OK.

  • So this was just analysis for one dimension.

  • But it actually generalizes to more than one dimension.

  • So in general, if we have d dimensions,

  • then the cache complexity is going to be theta of NT

  • divided by M to the 1 over d times B.

  • So if d is 1, then that just reduces to NT over MB.

  • If d is 2, then it's going to be NT over B times square root

  • of M and so on.

  • And it turns out that this bound is also optimal.

  • So any questions?

  • So compared to the looping code, this code

  • actually has much better cache complexity.

  • It's saving by a factor of M. The looping code

  • had a cache complexity that was theta of NT over B.

  • It didn't have an M in the denominator.

  • OK.

  • So we're actually going to do a simulation now.

  • We're going to compare how the looping

  • code in a trapezoid code runs.

  • And in this simulation, the green points

  • correspond to a cache hit, the purple points correspond

  • to a cache miss.

  • And we're going assume a fully associative cache using the LRU

  • replacement policy where the cache line size is 4 points

  • and cache size is 32 points.

  • And we're going to set the cache hit latency to be one cycle,

  • and the cache miss latency to be 10 cycles.

  • So an order of magnitude slower for cache misses.

  • And we're doing this for a rectangular region

  • where N is 95.

  • And we're doing it for 87 time steps.

  • So when we pull out the simulation now.

  • OK.

  • So on the left hand side, we're going to have the looping code.

  • On the right hand side, we're going

  • to have the trapezoidal code.

  • So let's start this.

  • So you can see that the looping code

  • is going over one row at a time whereas the trapezoidal code is

  • not doing that.

  • It's partially computing one row and then moving

  • on to the next row.

  • I can also show the cuts that are generated

  • by the trapezoidal algorithm.

  • I have to remember how to do this.

  • So C--

  • So there there's the cuts that are generated

  • by the trapezoidal algorithm.

  • And I can speed this up.

  • So you can see that the trapezoidal algorithm

  • is incurring much fewer cache misses than the looping code.

  • So I just make this go faster.

  • And the trapezoid code is going to finish,

  • while the looping code is still slowly making

  • its way up the top.

  • OK.

  • So it's finally done.

  • So any questions on this?

  • Yeah?

  • AUDIENCE: Why doesn't the trapezoid [INAUDIBLE]??

  • JULIAN SHUN: In which of the regions?

  • So it's loading-- you get a cache miss for the purple dots

  • here.

  • And then the cache line size is 4.

  • So you get a cache miss for the first point,

  • and then you hit on the next three points.

  • Then you get another cache miss for the fifth point.

  • And then you hit on the 6, 7, and 8 points.

  • AUDIENCE: I was indicating the one above it.

  • JULIAN SHUN: So for the one above it--

  • so we're assuming that the two arrays fitting cache already.

  • So we don't actually need to load them from memory.

  • The thing above it just depends on the values

  • that we have already computed.

  • And that fits in cache.

  • Those are ready in cache.

  • This base of the trapezoid is already in cache.

  • And the row right above it, we just

  • need to look at those values.

  • Does that makes sense?

  • AUDIENCE: OK.

  • Because of the even odd?

  • JULIAN SHUN: Yeah.

  • Yeah.

  • Any other questions on this?

  • Does anyone want to see this again?

  • Yeah?

  • So I could let this run for the rest of the lecture,

  • but I have more interesting material

  • that I want to talk about.

  • So let's just stop after this finishes.

  • And as you see again, the looping code

  • is slowly making its way to the top.

  • It's much slower than the trapezoid code.

  • OK.

  • So that was only for one dimensions.

  • Now let's look at what happens in two dimensions.

  • And here, I'm going to show another demo.

  • And this demo, I'm actually going to run the code for real.

  • The previous demo was just a simulation.

  • So this is going to happen in real time.

  • And I'm going to simulate the heat in a 2D space.

  • And recall that the colors correspond to the temperature.

  • So a brighter color means it's hotter.

  • So let me pull out the other demo.

  • OK.

  • So here, my mouse cursor is the source of heat.

  • So you see that it's making the points around my mouse cursor

  • hot.

  • And then it's slowly diffusing its way to the other points.

  • Now I can actually move this so that my heat source changes,

  • and then the heat I generated earlier slowly goes away.

  • OK.

  • So in the lower left hand corner,

  • I'm showing the number of iterations

  • per second of the code.

  • And we can see that the looping code is taking--

  • it's doing about 1,560 iterations per second.

  • Let's see what happens when we switch to the trapezoid code.

  • So the trapezoid code is doing about 1,830 iterations

  • per second.

  • So it's a little bit faster, but not by too much.

  • Does anyone have an idea why we're seeing this behavior?

  • So we said that the trapezoid code

  • incurs many fewer cache misses than the looping code,

  • so we would expect it to be significantly faster.

  • But here it's only a little bit faster.

  • Yeah?

  • AUDIENCE: [INAUDIBLE].

  • JULIAN SHUN: Right.

  • So that's a good point.

  • So in 2D you're only saving a factor

  • of square root of M instead of M. But square root of M

  • is still pretty big compared to the speed

  • up we're getting here.

  • So any other ideas?

  • Yeah.

  • So there is a constant factor in the trapezoidal code.

  • But even after accounting for the constant factor,

  • you should still see a better speed up than this.

  • So even accounting for the constant factors,

  • what other problem might be going on here?

  • Yeah?

  • AUDIENCE: Is it that we don't actually have an ideal cache?

  • JULIAN SHUN: Yeah.

  • So that's another good observation.

  • But the caches that we're using, they still

  • should get pretty good cache.

  • I mean, they should still have cache complexly

  • that's pretty close to the ideal cache model.

  • I mean, you might be off by small constant factor,

  • but not by too much.

  • Yeah?

  • AUDIENCE: Maybe because [INAUDIBLE]

  • JULIAN SHUN: Sorry.

  • Could you repeat that?

  • AUDIENCE: There are [INAUDIBLE] this time like [INAUDIBLE]

  • JULIAN SHUN: Yeah.

  • So OK.

  • So if I move the cursor, it's probably

  • going to be a little bit slower, go slower by a little bit.

  • But that doesn't really affect the performance.

  • I can just leave my cursor there and this is what

  • the iterations per second is.

  • Yes?

  • AUDIENCE: Maybe there's like, a lot of similar programs

  • doing this [INAUDIBLE].

  • JULIAN SHUN: Yeah.

  • So there is some other factor dominating.

  • Does anyone have an idea what that factor might be?

  • AUDIENCE: Rendering?

  • JULIAN SHUN: No.

  • It's not the rendering.

  • I ran the code without showing the graphics,

  • and performance was similar.

  • Yes?

  • AUDIENCE: Maybe similar to what she said there could be other

  • things using cache [INAUDIBLE].

  • JULIAN SHUN: Yes.

  • Yeah.

  • So there could be other things using the cache.

  • But that would be true for both of the programs.

  • And I don't actually have anything that's intensive

  • running, except for PowerPoint.

  • I don't think that uses much of my cache.

  • All right.

  • So let's look at why this is the case.

  • So it turns out that the hardware is actually helping

  • the looping code.

  • So the question is how come the cache oblivious

  • trapezoidal code can have so many fewer cache misses,

  • but the advantage gained over the looping version

  • is so marginal?

  • Turns out that for the looping code,

  • the hardware is actually helping it

  • by doing hardware prefetching.

  • And hardware prefetching for the looping code

  • is actually pretty good, because the access pattern

  • is very regular.

  • It's just going one row at a time.

  • So the hardware can predict the memory access pattern

  • of the looping code, and therefore, he

  • can bring in the cache lines that the looping code would

  • need before it actually gets to that part of the computation.

  • So prefetching is helping the looping code.

  • And it's not helping the trapezoid code

  • that much because the access pattern is less regular there.

  • And prefetching does use memory bandwidth.

  • But when you're using just a single core,

  • you have more than enough bandwidth

  • to take advantage of the hardware prefetching

  • capabilities of the machine.

  • But later on, we'll see that when we're running in parallel

  • the memory bandwidth does become more

  • of an issue when you have multiple processors

  • all using the memory.

  • Yeah?

  • Question?

  • AUDIENCE: Is there a way of touching a cache [INAUDIBLE]

  • or touching a piece of memory before you need it so that you

  • don't need [INAUDIBLE]

  • JULIAN SHUN: You can do software prefetching.

  • There are instructions to do software prefetching.

  • Hardware prefetching is usually more efficient,

  • but it's like less flexible than the software prefetching.

  • But here we're not actually doing that.

  • We're just taking advantage of hardware prefetching.

  • AUDIENCE: [INAUDIBLE]?

  • JULIAN SHUN: Yeah.

  • So we didn't actually try that.

  • It could benefit a little bit if we used a little bit

  • of software prefetching.

  • Although, I think it would benefit the looping code

  • probably as well if we did that.

  • Yes?

  • AUDIENCE: Is hardware prefetching [INAUDIBLE]??

  • JULIAN SHUN: Sorry?

  • Sorry?

  • AUDIENCE: Is hardware prefetching always enabled?

  • JULIAN SHUN: Yeah.

  • So hardware prefetching is enabled.

  • It's always done by the hardware on the machines

  • that we're using today.

  • This was a pretty surprising result.

  • But we'll actually see the fact of the memory bandwidth

  • later on when we look at the parallel code.

  • Any other questions before I continue?

  • OK.

  • So let's now look at the interplay between caching

  • and parallelism.

  • So this was a theorem that we proved in the previous lecture.

  • So let's recall what it says.

  • It says let Q sub p be the number of cache

  • misses in a deterministic Cilk computation

  • when run on p processors, each with a private cache,

  • and let s sub p be the number of successful steals

  • during the computation.

  • In an ideal cache model with a cache size of M

  • and a block size of B, the number of cache

  • misses on p processors equal to the number of cache

  • misses on one processor plus order

  • number of successful steals times M over B.

  • And last time we also said that the number of successful steals

  • is upper bounded by the span of the computation

  • times the number of processors.

  • If you minimize the span of your computation,

  • then you can also minimize the number of successful steals.

  • And then for low span algorithms,

  • the first term is usually going to dominate the Q1 term.

  • So I'm not going to go over the proof.

  • We did that last time.

  • The moral of the story is that minimizing

  • cache misses in the serial elision

  • essentially minimizes them into parallel execution,

  • assuming that you have a low span algorithm.

  • So let's see whether our trapezoidal algorithm works

  • in parallel.

  • So does the space cut work in parallel?

  • Recall that the space cut, I'm cutting it

  • with a slope of negative 1 through the center,

  • because it's too wide.

  • So can I execute the two trapezoids in parallel here?

  • No.

  • The reason is that the right trapezoid

  • depends on the result of the left trapezoid.

  • So I can't execute them at the same time.

  • But there is a way that I can execute trapezoids in parallel.

  • So instead of just doing the cut through the center,

  • I'm actually going to do a V-cut.

  • So now I have three trapezoids.

  • The two trapezoid in black--

  • I can actually execute those in parallel,

  • because they're independent.

  • And everything in those two trapezoids

  • just depends on the base of that trapezoid.

  • And after I'm done with the two trapezoids labeled 1,

  • then I can compute the trapezoid label 2.

  • And this is known as a parallel space cut.

  • It produces two black trapezoids as well as a gray trapezoid

  • and two black trapezoids executed in parallel.

  • And afterwards, the gray trapezoid executes.

  • And this is done recursively as well.

  • Any questions?

  • Yeah?

  • No.

  • OK.

  • We also have the time cut.

  • Oh, sorry.

  • So if the trapezoid is inverted, then we're

  • going to do this upside down V-cut.

  • And in this case, we're going to execute the middle trapezoid

  • before we execute the two trapezoids on the side.

  • For the time cut, it turns out that we're just going

  • to use the same cut as before.

  • And we're just going to execute the two trapezoids serially.

  • So we do get a little bit of parallelism

  • here from the parallel space cut.

  • Let's look at how the parallel codes perform now.

  • So, OK.

  • So this was a serial looping code.

  • Here's the parallel looping code.

  • So we had 1,450 before.

  • About 3,700 now.

  • So little over twice the speed up.

  • And this is on a four core machine.

  • It's just on my laptop.

  • AUDIENCE: [INAUDIBLE]?

  • JULIAN SHUN: Sorry?

  • AUDIENCE: [INAUDIBLE]?

  • JULIAN SHUN: Oh yeah, sure.

  • Yeah, it's slowing down a little bit, but not by too much.

  • OK.

  • Let's look at the trapezoidal code now.

  • So as we saw before, the trapezoid code

  • does about 1,840 iterations per second.

  • And we can paralyze this.

  • So now it's doing about 5,350 iterations per second.

  • So it's getting about a factor of three speed up.

  • I can move it around a little bit more if you want to see it.

  • So serial trapezoid and parallel trapezoid.

  • Is everyone happy?

  • OK.

  • Because I had to do this in real time,

  • the input size wasn't actually that big.

  • So I ran the experiment offline without the rendering

  • on a much larger input.

  • So this input is a 3,000 by 3,000 grid.

  • And I did this for 1,000 time steps using four processor

  • cores.

  • And my cache size is 8 megabytes.

  • So last level cache size.

  • So the input size here is much larger than my last level cache

  • size.

  • And here are the times that I got.

  • So the serial looping code took about 129 seconds.

  • And when we did it in parallel, it was about 66 seconds.

  • So it got about a factor of two speed up,

  • which is consistent with what we saw.

  • For the trapezoidal code, it actually got a better speed up

  • when we increased the input size.

  • So we got about a factor of four speed up.

  • And this is because for larger input size,

  • the cache efficiency plays a much larger role,

  • because the cache is so small compared to our input size.

  • So here we see that the parallel looping code achieves

  • less than half of the potential speed

  • up, even though the parallel looping

  • code has much more potential parallelism

  • than the trapezoidal code.

  • So trapezoidal code only had a little bit of parallelism only

  • for the space cuts, whereas the trapezoidal code is actually

  • getting pretty good speed up.

  • So this is near linear speed up, since I'm using four cores

  • and it's getting 3.96 x speed up.

  • So what could be going on here?

  • Another thing to look at is to compare

  • the serial trapezoid code with the serial looping code, as

  • well as the parallel trapezoid code with the parallel looping

  • code.

  • So if you look at the serial trapezoid code,

  • you see that it's about twice as fast

  • as the serial looping code.

  • But the parallel trapezoid or code

  • is about four times faster than the parallel looping code.

  • And the reason here is that the harbor prefetching

  • can't help the parallel looping code that much.

  • Because when you're running in parallel, all of the cores

  • are using memory.

  • And there's a memory bandwidth bottleneck here.

  • And prefetching actually needs to use memory bandwidth.

  • So in the serial case, we had plenty of memory bandwidth

  • we could use for a prefetching, but in the parallel case,

  • we don't actually have much parallel-- but much memory

  • bandwidth we can use here.

  • So that's why in a parallel case,

  • the trapezoid code gets a better speed up

  • over the parallel looping code, compared to the serial case.

  • And the trapezoid code also gets better speed up

  • because it does things more locally,

  • so it needs to use less--

  • fewer memory operations.

  • And there's a scalability bottleneck

  • at the memory interconnect.

  • But because the trapezoidal code is cache oblivious,

  • it does a lot of work in cache, whereas the looping code

  • does more accesses to the main memory.

  • Any questions on this?

  • So how do we know when we have a parallel speed up bottleneck,

  • how do we know what's causing it?

  • So there are several main things that we should look at.

  • So we should see if our algorithm has

  • insufficient parallelism, whether the scheduling

  • overhead is dominating, whether we

  • have a lack of memory bandwidth, or whether there

  • is contention going on.

  • And contention can refer to either locking or true and

  • false sharing, which we talked about in the last lecture.

  • So the first two are usually quite easy to diagnose.

  • You can compute the work in the span of your algorithm,

  • and from that you can get the parallelism.

  • You can also use Cilkscale to help you diagnose the first two

  • problems, because Cilkscale can tell you how much parallelism

  • you're getting in your code.

  • And it can also tell you the burden of parallelism

  • which includes the scheduler overhead.

  • What about for memory bandwidth?

  • How can we diagnose that?

  • So does anyone have any ideas?

  • So I can tell you one way to do it.

  • I can open up my hardware and take out all of my memory chips

  • except for one of them, and run my serial code.

  • And if it slows down, then that means

  • it was probably memory bandwidth bound

  • when I did it in parallel.

  • But that's a pretty heavy handed way to diagnose this problem.

  • Is there anything we can do was just software?

  • Yes?

  • AUDIENCE: Can we simulate like Valgrind would do it and count

  • how memory accesses [INAUDIBLE]

  • JULIAN SHUN: Yeah, so you could use a tool like Valgrind

  • to count the number of memory accesses.

  • Yes?

  • AUDIENCE: It's like toolset or something

  • where you can make sure that only one processor is

  • being use for this.

  • JULIAN SHUN: So you can make sure only one processor is

  • being used, but you can't--

  • but it might be using like, more memory

  • than just the memory from one chip.

  • There's actually a simpler way to do this.

  • The idea is that we'll just run p identical copies

  • of the serial code.

  • And then they will all executing in parallel.

  • And if the execution slows down, then that

  • means they were probably contending

  • for memory bandwidth.

  • Does that make sense?

  • One caveat of this is you can only

  • do this if you have enough physical memory, because when

  • you're running p identical copies,

  • you have to use more DRAM than if you just ran one copy.

  • So you have to have enough physical memory.

  • But oftentimes, you can usually isolate some part of the code

  • that you think has a performance bottleneck,

  • and just execute that part of the program

  • with p copies in parallel.

  • And hopefully that will take less memory.

  • There are also hardware counters you

  • can check if you have root access to your machine

  • that can measure how much memory bandwidth

  • your program is using.

  • But this is a pretty simple way to do this.

  • So there are ways to diagnose lack of memory bandwidth.

  • Turns out that contention is much harder to diagnose.

  • There are tools that exist that detect lock contention

  • in an execution, but usually they only detect a contention

  • when the contention actually happens,

  • but the contention doesn't have to happen

  • every time you run your code.

  • So these tools don't detect a potential for lock contention.

  • And potential for true and false sharing

  • is even harder to detect, especially false sharing,

  • because if you're using a bunch of variables in your code,

  • you don't know which of those map to the same cache line.

  • So this is much harder to detect.

  • Usually when you're trying to debug the speed up bottleneck

  • in your code, you would first look at the first three things

  • here.

  • And then once you eliminated those first few things,

  • then you can start looking at whether contention

  • is causing the problem.

  • Any questions?

  • OK.

  • So I talked about stencil computation.

  • I want to now talk about another problem, sorting.

  • And we want to do this cache efficiently.

  • OK.

  • So let's first analyze the cache complexity

  • of a standard merge sort.

  • So we first need to analyze the complexity of merging,

  • because this is used as a subroutine in merge sort.

  • And as you recall in merging, we're

  • given two sorted input arrays.

  • And we want to generate an output array that's

  • also sorted containing all the elements from the two input

  • arrays.

  • And the algorithm is going to maintain a pointer to the head

  • of each of our input arrays.

  • And then it's going to compare the two elements

  • and take the smaller one and put it into the output,

  • and then increment the pointer for that array.

  • And then we keep doing this, taking the smaller of the two

  • elements until the two input arrays become empty,

  • at which point we're done with the algorithm.

  • We have one sorted output array.

  • OK.

  • So to merge n elements, the time to do this is just theta of n.

  • Here n is the sum of the sizes of my two input arrays.

  • And this is because I'm only doing a constant amount of work

  • for each of my input elements.

  • What about the number of cache misses?

  • How many cache misses will incur when I'm merging n elements?

  • Yes?

  • AUDIENCE: [INAUDIBLE].

  • JULIAN SHUN: Yeah.

  • So I'm going to incur theta of n over B cache

  • misses because my two input arrays are stored contiguously

  • in memory so I can read them at B bytes

  • at a time with just one cache miss.

  • And then my output array is also stored contiguously

  • so I can write things out B bytes

  • at a time with just one cache miss.

  • I might waste to cache line at the beginning and end of each

  • of my three arrays, but that only affects the bound

  • by a constant factor.

  • So it's theta of n over B.

  • So now let's look at merge sort.

  • So recall that merge sort has two recursive calls to itself

  • on inputs of half the size.

  • And then it doesn't merge at the end

  • to merge the two sorted outputs of its recursive calls.

  • So if you look at how the recursion precedes,

  • its first going to divide the input array into two halves.

  • It's going to divide it into two halves

  • again again, until you get to the base

  • case of just one element, at which point you return.

  • And then now we start merging pairs of these elements

  • together.

  • So now I have these arrays of size 2 in sorted order.

  • And I merged pairs of those arrays.

  • And I get subarrays of size 4.

  • And then finally, I do this one more time

  • to get my sorted output.

  • OK.

  • So let's review the work of merge sort.

  • So what's the recurrence for merge sort

  • if we're computing the work?

  • Yes?

  • AUDIENCE: [INAUDIBLE].

  • JULIAN SHUN: Yeah.

  • So that's correct.

  • So I have two subproblems of size N over 2.

  • And then I need to do theta n work to do the merge.

  • And this is case two of master theorem.

  • So I'm computing log base b of a, which is log base 2 of 2.

  • And that's just 1.

  • And that's the same as the exponent in the term

  • that I'm adding in.

  • So since they're the same, I add in an additional log factor.

  • And my overall work is just theta of n log n.

  • OK.

  • So now I'm going to solve this recurrence again using

  • the recursion tree method.

  • I'm still going to get theta of n log n.

  • But I'm doing this because it's going

  • to be useful when we analyze the cache complexity.

  • So at the top level I have a problem of size n.

  • And I'm going to branch into two problems of size n over 2.

  • And when I'm done with them, I have

  • to do a merge, which takes theta of n work.

  • And I'm just putting n here.

  • I'm ignoring the constants.

  • And I'm going to branch again.

  • And each one of these is going to do n over 2 work to merge.

  • And I'm going to get all the way down to my base case

  • after log base 2 of n levels.

  • The top level is doing n work.

  • Second level is also doing n work.

  • And it's going to be the same for all levels down

  • to the leaves.

  • Leaves is also doing a linear amount of work.

  • So the overall work is just the work per level times

  • the number of levels.

  • So it's just theta of n log n.

  • OK.

  • So now let's analyze this with caching.

  • OK.

  • So we said earlier that the cache complexity

  • of the merging subroutine is theta of n over B.

  • And here's the recurrence for the cache

  • complexity of merge sort.

  • So my base case here is when n is less than or equal to cM,

  • for some sufficiently small constant c.

  • And this is because at this point,

  • my problem size fits into cache.

  • And everything else I do for that problem

  • is still going to be in cache.

  • And to load it into cache, the base case,

  • I need to incur theta and over B cache misses.

  • And otherwise, I'm going to have to recursive calls of size n

  • over 2.

  • And then I need to do theta of n over B cache

  • misses to do the merge of my two results.

  • So here, my base case is larger than what I did for the work.

  • The algorithm actually is still recursing down

  • to a constant size base case.

  • But just for analysis, I'm stopping the recursion when

  • n is less than or equal to CM.

  • So let's analyze this.

  • So again, I'm going to have the problems of size n

  • at the beginning.

  • And then I'm going to split into two problems of size n over 2.

  • And then I'm going to have to pay n over B cache

  • misses to merge the results together.

  • Similarly for the next level, now I'm

  • paying n over 2B cache misses for each of my two problems

  • here to do the merge.

  • And I keep going down until I get to a subproblem of size

  • theta of cM.

  • At that point, it's going to fit in cache.

  • And I don't need to recurse anymore in my analysis.

  • So number of levels of this recursion tree

  • is just log base 2 of n over cM.

  • So I'm basically chopping off the bottom up this recursion

  • tree.

  • The number of levels I had below this is log base 2 of cM.

  • So I'm taking a log base 2 of n and subtracting

  • log base 2 of cM.

  • And that's equivalent to log base 2 of n divided by cM.

  • The number of leaves I have is n over cM since each leaf

  • is state of cM large.

  • And the number of cache misses I need to incur--

  • to process a leaf is just theta of m over B,

  • because I just need to load the input for that subproblem

  • into cache.

  • And then everything else fits in cache.

  • So for the top level, I'm incurring n

  • over B cache misses.

  • The next level, I'm also incurring n over B cache

  • misses.

  • Same with the third level.

  • And then for the leaves, I'm incurring m over B times n

  • over cM cache misses.

  • The n over cM is the number of leaves

  • I have and theta of m over B is the number of cache

  • misses per leaf.

  • And that also equals theta of n over B.

  • So overall, I multiply theta of n over B

  • by the number of levels I have.

  • So the number of levels I have is log base 2 of n over cM.

  • And I just got rid of the constant

  • here, since doesn't affect the asymptotic bound.

  • So the number of cache misses I have is theta of n over B times

  • log base 2 of--

  • or any base for the log of n over M.

  • So any questions on this analysis?

  • So I am saving a factor of B here in the first term.

  • So that does reduce.

  • That just gives me a better cache complexity

  • than just a work bound.

  • But for the M term, it's actually

  • inside the denominator of the log.

  • And that doesn't actually help me that much.

  • So let's look at how much we actually save.

  • So one n is much greater than M. Then log base 2 of n over M

  • is approximately equal to log base 2 of n.

  • So the M term doesn't contribute much to the asymptotic costs,

  • and therefore, compared to the work,

  • we're only saving a factor of B. When

  • n is approximately equal to M, then log base 2 of n over m

  • is constant, and we're saving a factor of B log n.

  • So we save more when the memory size--

  • or when the problem size is small.

  • But for large enough problem sizes,

  • we're only saving a factor of B. So

  • does anyone think that we can do better than this?

  • So I've asked this question several times before,

  • and the answer is always the same.

  • Yes?

  • It's a good answer.

  • So we're going to do this using multiway merging.

  • So instead of just merging two sorted subarrays,

  • we're going to merge together R sorted subarrays.

  • So we're going to have R subarrays, each of size

  • n over R. And these are sorted.

  • And I'm going to merge them together

  • using what's called a tournament tree.

  • So how the tournament tree works is I'm

  • going to compare the heads of each pair of these subarrays

  • and store it in the node of the tournament tree.

  • And then after I do that, I compare these two nodes.

  • And I get the minimum of those two.

  • Eventually, after I compare all of the elements,

  • I'm just left with the minimum element

  • at the root of the tournament tree.

  • And then I can place that into my output.

  • So the first time I want to fill this tournament tree,

  • it's going to theta of R work because there are

  • R nodes in my tournament tree.

  • So when I compare these two elements, the smaller one is 6.

  • For these two, the smaller one is 2.

  • And then I compare 2 and 6, take the smaller one.

  • And then on the other side, I have a 7 here.

  • So I compare 2 and 7.

  • 2 is smaller, so it appears at the root.

  • And then I know that that's going

  • to be my minimum element among the heads of all of the R

  • subarrays that I'm passing it.

  • So the first time to generate this tournament tree

  • takes theta of R work because I have to fill in R nodes.

  • But once I generated this tournament tree,

  • for all subsequent rounds, I only need to fill in the path

  • from the element that one to the output or right,

  • or to the root of the tournament tree,

  • because those are the only values that would have changed.

  • So now I'm going to fill in this path here.

  • And this only takes theta of log R work to do it,

  • because the height of this tournament tree

  • is log base 2 of R.

  • So I'm going to fill this in.

  • Now 14 goes here.

  • 6 is a smaller of the two.

  • And then 6 is a smaller of the two again.

  • So my next element is 6.

  • And I keep doing this until all of the elements for my R

  • subarrays get put into the output.

  • The total work for merging is going

  • to be theta of R for the first round,

  • plus n log R for all the remaining rounds.

  • And that's just equal to theta of n log R,

  • because we're assuming that n is bigger than R here.

  • Any questions on how the multiway merge works?

  • No?

  • OK.

  • So let's analyze the work of this multiway

  • merge when used inside merge sort.

  • So the recurrence is going to be as follows.

  • So if n is 1, then we just do a constant amount of work.

  • Otherwise, we have R subproblems of size n over R.

  • And then we're paying theta of n log R to do the multiway merge.

  • So here's the recursion tree.

  • At the top level, we have a problem of size n.

  • And then we're going to split into R subproblems of size n/R.

  • And then we have to pay n log R work

  • to merge the results of the recursive call together.

  • And then we keep doing this.

  • Turns out that the work at each level

  • sums up to n log base 2 of R. And the number of levels

  • we have here is log base R of n, because each time we're

  • branching by a factor of R. For the leaves, we have n leaves.

  • And we just pay linear work for that,

  • because we don't have to pay for the merge.

  • We're not doing anymore recursive calls.

  • So the overall work is going to be theta of n log

  • R times log base R of n, plus n for the leaves,

  • but that's just a lower order term.

  • And if you work out the math, some of these terms

  • are going to cancel out, and you just

  • get theta of log n for the work.

  • So the work is the same as the binary merge sort.

  • Let's now analyze the cash complexity.

  • So let's assume that we have R less than cM

  • over B for a sufficiently small constant C less than

  • or equal to 1.

  • We're going to consider the R way merging of contiguous

  • arrays of total size n.

  • And if R is less than cM over B, then we

  • can fit the entire tournament tree into cache.

  • And we can also fit one block from each of the R subarrays

  • into cache.

  • And in that case, the total number of cache

  • misses to do the multiway merge is just theta of n over B,

  • because we just have to go over the n elements in our input

  • arrays.

  • So the recurrence for the R way merge sort is as follows.

  • So if n is less than cM, then it fits in cache.

  • So the number of cache misses we pay is just theta of n over B.

  • And otherwise, we have R subproblems of size n over R.

  • And that we add theta of n over B

  • to do the merge of the results of the subproblems.

  • Any questions on the recurrence here?

  • Yes?

  • AUDIENCE: So how do we pick up the value of R?

  • Does it make it cache oblivious?

  • JULIAN SHUN: Good question.

  • So we didn't pick the value of R.

  • So this is not a cache oblivious algorithm.

  • And we'll see what to choose for R in a couple of slides.

  • So let's analyze the cache complexity of this algorithm

  • again, using the recursion tree analysis.

  • So at the top level, we're going to have R subproblems of size

  • n over R. And we have to pay n over B cache

  • misses to merge them.

  • And it turns out that at each level, the number of cache

  • misses we have to pay is n over B, if you work out the math.

  • And the number of levels of this recursion tree

  • is going to be log base R of n over cM,

  • because we're going to stop recurring

  • when our problem size is cM.

  • And on every level of recursion, we're

  • branching by a factor of R. So our leaf size is cM, therefore

  • the number of leaves we have is n over cM.

  • And for each leaf, it's going to take theta of m over B cache

  • misses to load it into cache.

  • And afterwards, we can do everything in cache.

  • And multiplying the number of leaves by the cost per leaf,

  • we get theta of n over B cache misses.

  • And therefore, the number of cache

  • misses is n over B times the number of levels.

  • Number of levels is log base R of n over M.

  • So compared to the binary merge sort algorithm,

  • here we actually have a factor of R in the base of the log.

  • Before, the base of the log was just 2.

  • So now the question is what we're going to set R to be.

  • So again, we have a voodoo parameter.

  • This is not a cache oblivious algorithm.

  • And we said that R has to be less than or equal to cM over

  • B in order for the analysis to work out.

  • So let's just make it as large as possible.

  • Let's just set R equal to theta of M over B.

  • And now we can see what this complexity works out to be.

  • So we have the total cache assumption.

  • We also have the fact that log base M of n over M

  • is equal to theta of log n over log M.

  • So the cache complexity is theta of n over B times log base

  • M over B of n over M. But if we have the tall cache assumption

  • that log base M over B is the same as log base of M

  • asymptotically.

  • So that's how we get to the second line.

  • And then you can rearrange some terms

  • and cancel some terms out.

  • And we'll end up with theta of n log n divided by B log M.

  • So we're saving a factor of theta

  • of b log M compared to the work of the algorithm,

  • whereas for the binary version of merge sort,

  • we were only saving a factor of B for large enough inputs.

  • So here we get another factor of log M in our savings.

  • So as I said, the binary one cache misses is n log n over B,

  • whereas the multiway one is n log n over B log M.

  • And as longest as n is much greater than M,

  • then we're actually saving much more than the binary version.

  • So we're saving a factor of log M in cache misses.

  • And let's just ignore the constants here and look at what

  • log M can be in practice.

  • So here are some typical cache sizes.

  • The L1 cache size is 32 kilobytes,

  • so that's 2 to the 15th.

  • And log base 2 of that is 15.

  • So we get a 15x savings.

  • And then for the larger cache sizes,

  • we get even larger saving.

  • So any questions on this?

  • So the problem with this algorithm

  • is that it's not cache oblivious.

  • We have to tune the value of R for a particular machine.

  • And even when we're running on the same machine,

  • there could be other jobs running

  • that contend for the cache.

  • Turns out that there are several cache oblivious sorting

  • algorithms.

  • The first one that was developed was by paper

  • by Charles Leiserson, and it's called funnel sort.

  • The idea here is to recursively sort n to the 1/3 groups of n

  • to the 2/3 elements, and then merge the sorted groups

  • with an n to the 1/3 funnel.

  • And this funnel is called a k funnel, more generally,

  • and it merges together k cubed elements in a k sorted list.

  • And the costs for doing this merge is shown here.

  • And if you plug this into the recurrence,

  • you'll get that, the asymptotic number of cache

  • misses is the same as that of the multiway

  • merge sort algorithm while being cache oblivious.

  • And this bound is actually optimal.

  • So I'm not going to have time to talk

  • about the details of the funnel sort algorithm.

  • But I do have time to just show you

  • a pretty picture of what the funnel looks like.

  • So this is what a k funnel looks like.

  • It's recursively constructed.

  • We have a bunch of square root of k funnels inside.

  • They're all connected to some buffers.

  • So they feed elements the buffer,

  • and then the buffers feed element

  • to this output square root of k funnel, which becomes

  • the output for the k funnel.

  • So this whole blue thing is the k funnel.

  • And the small green triangles are square root of k funnels.

  • And the number of cache misses incurred by doing this merge

  • is shown here.

  • And it uses the tall cache assumption for analysis.

  • So a pretty cool algorithm.

  • And we've posted a paper online that describes this algorithm.

  • So if you're interested in the details,

  • I encourage you to read that paper.

  • There are also many other cache oblivious algorithms out there.

  • So there's been hundreds of papers

  • on cache oblivious algorithms.

  • Here are some of them.

  • Some of these are also described in the paper.

  • In fact, I think all of these are described

  • in the paper we posted.

  • There are also some cool cache oblivious data structures

  • that have been developed, such as for B-trees,

  • ordered-file maintenance, and priority queues.

  • So it's a lot of literature on cache oblivious algorithms.

  • And there's also a lot of material online

  • if you're interested in learning more.

ANNOUNCER: The following content is provided under a Creative

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B2 中高級

15.忽視緩存的算法 (15. Cache-Oblivious Algorithms)

  • 1 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字