Placeholder Image

字幕列表 影片播放

  • The following content is provided under a Creative

  • Commons license.

  • Your support will help MIT OpenCourseWare

  • continue to offer high-quality educational resources for free.

  • To make a donation or to view additional materials

  • from hundreds of MIT courses, visit MIT OpenCourseWare

  • at ocw.mit.edu.

  • JULIAN SHUN: Today, we're going to talk

  • about multicore programming.

  • And as I was just informed by Charles, it's 2018.

  • I had 2017 on the slide.

  • So first, congratulations to all of you.

  • You turned in the first project's data.

  • Here's a plot showing the tiers that different groups reached

  • for the beta.

  • And this is in sorted order.

  • And we set the beta cutoff to be tier 45.

  • The final cutoff is tier 48.

  • So the final cutoff we did set a little bit aggressively,

  • but keep in mind that you don't necessarily

  • have to get to the final cutoff in order

  • to get an A on this project.

  • So we're going to talk about multicore processing today.

  • That's going to be the topic of the next project

  • after you finish the first project.

  • So in a multicore processor, we have a whole bunch

  • of cores that are all placed on the same chip,

  • and they have access to shared memory.

  • They usually also have some sort of private cache, and then

  • a shared last level cache, so L3, in this case.

  • And then they all have access the same memory controller,

  • which goes out to main memory.

  • And then they also have access to I/O.

  • But for a very long time, chips only had a single core on them.

  • So why do we have multicore processors nowadays?

  • Why did semiconductor vendors start

  • producing chips that had multiple processor

  • cores on them?

  • So the answer is because of two things.

  • So first, there's Moore's Law, which

  • says that we get more transistors every year.

  • So the number of transistors that you can fit on a chip

  • doubles approximately every two years.

  • And secondly, there's the end of scaling of clock frequency.

  • So for a very long time, we could just

  • keep increasing the frequency of the single core on the chip.

  • But at around 2004 to 2005, that was no longer the case.

  • We couldn't scale the clock frequency anymore.

  • So here's a plot showing both the number of transistors

  • you could fit on the chip over time,

  • as well as the clock frequency of the processors over time.

  • And notice that the y-axis is in log scale here.

  • And the blue line is basically Moore's Law,

  • which says that the number of transistors

  • you can fit on a chip doubles approximately every two years.

  • And that's been growing pretty steadily.

  • So this plot goes up to 2010, but in fact, it's

  • been growing even up until the present.

  • And it will continue to grow for a couple

  • more years before Moore's Law ends.

  • However, if you look at the clock frequency line,

  • you see that it was growing quite

  • steadily until about the early 2000s, and then at that point,

  • it flattened out.

  • So at that point, we couldn't increase the clock frequencies

  • anymore, and the clock speed was bounded

  • at about four gigahertz.

  • So nowadays, if you go buy a processor,

  • it's usually still bounded by around 4 gigahertz.

  • It's usually a little bit less than 4 gigahertz,

  • because it doesn't really make sense to push it all the way.

  • But you might find some processors

  • that are around 4 gigahertz nowadays.

  • So what happened at around 2004 to 2005?

  • Does anyone know?

  • So Moore's Law basically says that we

  • can fit more transistors on a chip

  • because the transistors become smaller.

  • And when the transistors become smaller,

  • you can reduce the voltage that's

  • needed to operate the transistors.

  • And as a result, you can increase the clock frequency

  • while maintaining the same power density.

  • And that's what manufacturers did until about 2004 to 2005.

  • They just kept increasing the clock frequency

  • to take advantage of Moore's law.

  • But it turns out that once transistors become

  • small enough, and the voltage used

  • to operate them becomes small enough,

  • there's something called leakage current.

  • So there's current that leaks, and we're

  • unable to keep reducing the voltage while still having

  • reliable switching.

  • And if you can't reduce the voltage anymore,

  • then you can't increase the clock frequency

  • if you want to keep the same power density.

  • So here's a plot from Intel back in 2004

  • when they first started producing multicore processors.

  • And this is plotting the power density versus time.

  • And again, the y-axis is in log scale here.

  • So the green data points are actual data points,

  • and the orange ones are projected.

  • And they projected what the power density

  • would be if we kept increasing the clock

  • frequency at a trend of about 25% to 30% per year,

  • which is what happened up until around 2004.

  • And because we couldn't reduce the voltage anymore,

  • the power density will go up.

  • And you can see that eventually, it

  • reaches the power density of a nuclear reactor, which

  • is pretty hot.

  • And then it reaches the power density of a rocket nozzle,

  • and eventually you get to the power

  • density of the sun's surface.

  • So if you have a chip that has a power density

  • equal to the sun's surface--

  • well, you don't actually really have a chip anymore.

  • So basically if you get into this orange region,

  • you basically have a fire, and you can't really

  • do anything interesting, in terms of performance

  • engineering, at that point.

  • So to solve this problem, semiconductor vendors

  • didn't increased the clock frequency anymore,

  • but we still had Moore's Law giving us

  • more and more transistors every year.

  • So what they decided to do with these extra transistors

  • was to put them into multiple cores,

  • and then put multiple cores on the same chip.

  • So we can see that, starting at around 2004,

  • the number of cores per chip becomes more than one.

  • And each generation of Moore's Law

  • will potentially double the number of cores

  • that you can fit on a chip, because it's doubling

  • the number of transistors.

  • And we've seen this trend up until about today.

  • And again, it's going to continue for a couple

  • more years before Moore's Law ends.

  • So that's why we have chips with multiple cores today.

  • So today, we're going to look at multicore processing.

  • So I first want to introduce the abstract multicore

  • architecture.

  • So this is a very simplified version,

  • but I can fit it on this slide, and it's a good example

  • for illustration.

  • So here, we have a whole bunch of processors.

  • They each have a cache, so that's

  • indicated with the dollar sign.

  • And usually they have a private cache as well as

  • a shared cache, so a shared last level cache, like the L3 cache.

  • And then they're all connected to the network.

  • And then, through the network, they

  • can connect to the main memory.

  • They can all access the same shared memory.

  • And then usually there's a separate network for the I/O

  • as well, even though I've drawn them as a single network here,

  • so they can access the I/O interface.

  • And potentially, the network will also

  • connect to other multiprocessors on the same system.

  • And this abstract multicore architecture

  • is known as a chip multiprocessor, or CMP.

  • So that's the architecture that we'll be looking at today.

  • So here's an outline of today's lecture.

  • So first, I'm going to go over some hardware challenges

  • with shared memory multicore machines.

  • So we're going to look at the cache coherence protocol.

  • And then after looking at hardware,

  • we're going to look at some software solutions

  • to write parallel programs on these multicore machines

  • to take advantage of the extra cores.

  • And we're going to look at several concurrency

  • platforms listed here.

  • We're going to look at Pthreads.

  • This is basically a low-level API

  • for accessing, or for running your code in parallel.

  • And if you program on Microsoft products,

  • the Win API threads is pretty similar.

  • Then there's Intel Threading Building Blocks,

  • which is a library solution to concurrency.

  • And then there are two linguistic solutions

  • that we'll be looking at--

  • OpenMP and Cilk Plus.

  • And Cilk Plus is actually the concurrency platform

  • that we'll be using for most of this class.

  • So let's look at how caches work.

  • So let's say that we have a value in memory

  • at some location, and that value is--

  • let's say that value is x equals 3.

  • If one processor says, we want to load

  • x, what happens is that processor reads

  • this value from a main memory, brings it into its own cache,

  • and then it also reads the value, loads it

  • into one of its registers.

  • And it keeps this value in cache so

  • that if it wants to access this value again in the near future,

  • it doesn't have to go all the way out to main memory.

  • It can just look at the value in its cache.

  • Now, what happens if another processor wants to load x?

  • Well, it just does the same thing.

  • It reads the value from main memory,

  • brings it into its cache, and then also loads it

  • into one of the registers.

  • And then same thing with another processor.

  • It turns out that you don't actually

  • always have to go out to main memory to get the value.

  • If the value resides in one of the other processor's caches,

  • you can also get the value through the other processor's

  • cache.

  • And sometimes that's cheaper than going all the way out

  • to main memory.

  • So the second processor now loads x again.

  • And it's in cache, so it doesn't have

  • to go to main memory or anybody else's cache.

  • So what happens now if we want to store

  • x, if we want to set the value of x to something else?

  • So let's say this processor wants to set x equal to 5.

  • So it's going to write x equals 5

  • and store that result in its own cache.

  • So that's all well and good.

  • Now what happens when the first processor wants to load x?

  • Well, it seems that the value of x is in its own cache,

  • so it's just going to read the value of x there,

  • and it gets a value of 3.

  • So what's the problem there?

  • Yes?

  • AUDIENCE: The path is stale.

  • JULIAN SHUN: Yeah.

  • So the problem is that the value of x in the first processor's

  • cache is stale, because another processor updated it.

  • So now this value of x in the first processor's cache

  • is invalid.

  • So that's the problem.

  • And one of the main challenges of multicore hardware

  • is to try to solve this problem of cache coherence--

  • making sure that the values in different processors' caches

  • are consistent across updates.

  • So one basic protocol for solving this problem

  • is known as the MSI protocol.

  • And in this protocol, each cache line is labeled with a state.

  • So there are three possible states--

  • M, S, and I. And this is done on the granularity of cache lines.

  • Because it turns out that storing this information

  • is relatively expensive, so you don't

  • want to store it for every memory location.

  • So they do it on a per cache line basis.

  • Does anyone know what the size of a cache line

  • is, on the machines that we're using?

  • Yeah?

  • AUDIENCE: 64 bytes.

  • JULIAN SHUN: Yeah, so it's 64 bytes.

  • And that's typically what you see today on most Intel and AMD

  • machines.

  • There's some architectures that have different cache lines,

  • like 128 bytes.

  • But for our class, the machines that we're using