Placeholder Image

字幕列表 影片播放

  • The following content is provided under a Creative

  • Commons license.

  • Your support will help MIT OpenCourseWare

  • continue to offer high quality, educational resources for free.

  • To make a donation or to view additional materials

  • from hundreds of MIT courses, visit MIT OpenCourseWare

  • at ocw.mit.edu.

  • JULIAN SHUN: So welcome to the second lecture of 6.172,

  • performance engineering of software systems.

  • Today, we're going to be talking about Bentley rules

  • for optimizing work.

  • All right, so work, does anyone know what work means?

  • You're all at MIT, so you should know.

  • So in terms of computer programming,

  • there's actually a formal definition of work.

  • The work of a program on a particular input

  • is defined to be the sum total of all the operations executed

  • by the program.

  • So it's basically a gross measure

  • of how much stuff the program needs to do.

  • And the idea of optimizing work is

  • to reduce the amount of stuff that the program needs

  • to do in order to improve the running time of your program,

  • improve its performance.

  • So algorithm design can produce dramatic reductions

  • in the work of a program.

  • For example, if you want to sort an array of elements,

  • you can use a nlogn time QuickSort.

  • Or you can use an n squared time sort, like insertion sort.

  • So you've probably seen this before

  • in your algorithm courses.

  • And for large enough values of n,

  • a nlogn time sort is going to be much

  • faster than a n squared sort.

  • So today, I'm not going to be talking about algorithm design.

  • You'll see more of this in other courses here at MIT.

  • And we'll also talk a little bit about algorithm design

  • later on in this semester.

  • We will be talking about many other cool tricks for reducing

  • the work of a program.

  • But I do want to point out, that reducing

  • the work of our program doesn't automatically translate

  • to a reduction in running time.

  • And this is because of the complex nature

  • of computer hardware.

  • So there's a lot of things going on that aren't captured

  • by this definition of work.

  • There's instruction level parallelism, caching,

  • vectorization, speculation and branch prediction, and so on.

  • And we'll learn about some of these things

  • throughout this semester.

  • But reducing the work of our program

  • does serve as a good heuristic for reducing

  • the overall running time of a program, at least

  • to a first order.

  • So today, we'll be learning about many ways

  • to reduce the work of your program.

  • So rules we'll be looking at, we call

  • them Bentley optimization rules, in honor of John Lewis Bentley.

  • So John Lewis Bentley wrote a nice little book back

  • in 1982 called Writing Efficient Programs.

  • And inside this book there are various techniques

  • for reducing the work of a computer program.

  • So if you haven't seen this book before, it's very good.

  • So I highly encourage you to read it.

  • Many of the original rules in Bentley's book

  • had to deal with the vagaries of computer architecture

  • three and a half decades ago.

  • So today, we've created a new set

  • of Bentley rules just dealing with the work of a program.

  • We'll be talking about architecture-specific

  • optimizations later on in the semester.

  • But today, we won't be talking about this.

  • One cool fact is that John Lewis Bentley is actually

  • my academic great grandfather.

  • So John Bentley was one of Charles Leiseron's

  • academic advisors.

  • Charles Leiserson was Guy Blelloch's academic advisor.

  • And Guy Blelloch, who's a professor at Carnegie Mellon,

  • was my advisor when I was a graduate student at CMU.

  • So it's a nice little fact.

  • And I had the honor of meeting John Bentley a couple of years

  • ago at a conference.

  • And he told me that he was my academic great grandfather.

  • [LAUGHING]

  • Yeah, and Charles is my academic grandfather.

  • And all of Charles's students are my academic aunts

  • and uncles--

  • [LAUGHING]

  • --including your T.A. Helen.

  • OK, so here's a list of all the work optimizations

  • that we'll be looking at today.

  • So they're grouped into four categories, data structures,

  • loops, and functions.

  • So there's a list of 22 rules on this slide today.

  • In fact, we'll actually be able to look at all of them today.

  • So today's lecture is going to be structured

  • as a series of many lectures.

  • And I'm going to be spending one to three slides on each one

  • of these optimizations.

  • All right, so let's start with optimizations

  • for data structures.

  • So first optimization is packing and encoding

  • your data structure.

  • And the idea of packing is to store more than one data value

  • in a machine word.

  • And the related idea of encoding is

  • to convert data values into a representation that

  • requires fewer bits.

  • So does anyone know why this could possibly reduce

  • the running time of a program?

  • Yes?

  • AUDIENCE: Need less memory fetches.

  • JULIAN SHUN: Right, so good answer.

  • The answer was, it might need less memory fetches.

  • And it turns out that that's correct,

  • because computer program spends a lot of time moving

  • stuff around in memory.

  • And if you reduce the number of things

  • that you have to move around in memory,

  • then that's a good heuristic for reducing the running

  • time of your program.

  • So let's look at an example.

  • Let's say we wanted to encode dates.

  • So let's say we wanted to code this string, September 11,

  • 2018.

  • You can store this using 18 bytes.

  • So you can use one byte per character here.

  • And this would require more than two double words,

  • because each double word is eight bytes or 64 bits.

  • And you have 18 bytes.

  • You need more than two double words.

  • And you have to move around these words

  • every time you want to manipulate the date.

  • But turns out that you can actually

  • do better than using 18 bytes.

  • So let's assume that we only want to store years

  • between 4096 BCE and 4096 CE.

  • So there are about 365.25 times 8,192 dates

  • in this range, which is three million approximately.

  • And you can use log base two of three million bits

  • to represent all the dates within this range.

  • So the notation lg here means log base of two.

  • That's going to be the notation I'll be using in this class.

  • And L-O-G will mean log base 10.

  • So we take the ceiling of log base two or three million,

  • and that gives us 22 bits.

  • So a good way to remember how to compute the log

  • base two of something, you can remember that the log base

  • two of one million is 20, log base two of 1,000 is 10.

  • And then you can factor this out and then add in log base

  • two of three, rounded up, which is two.

  • So that gives you 22 bits.

  • And that easily fits within one 32-bit words.

  • Now, you only need one word instead of three words,

  • as you did in the original representation.

  • But with this modified representation,

  • now determining the month of a particular date

  • will take more work, because now you're not explicitly

  • storing the month in your representation.

  • Whereas, with the string representation,

  • you are explicitly storing it at the beginning of the string.

  • So this does take more work, but it requires less space.

  • So any questions so far?

  • OK, so it turns out that there's another way

  • to store this, which also makes it easy

  • for you to fetch the month, the year, or the day

  • for a particular date.

  • So here, we're going to use the bit fields facilities in C.

  • So we're going to create a struct called date underscore t

  • with three fields, the year, the month, and the date.

  • And the integer after the semicolon

  • specifies how many bits I want to assign

  • to this particular field in the struct.

  • So this says, I need 13 bits for the year,

  • four bits for the month, and five bits for the day.

  • So the 13 bits for the year is, because I

  • have 8,192 possible years.

  • So I need 13 bits to store that.

  • For the month, I have 12 possible months.

  • So I need log base two of 12 rounded up, which is four.

  • And then finally, for the day, I need

  • log base two of 31 rounded up, which is five.

  • So in total, this still takes 22 bits.

  • But now the individual fields can now

  • be accessed much more quickly, than if we had just

  • encoded the three million dates using sequential integers,

  • because now you can just extract a month just by saying whatever

  • you named your struct.

  • You can just say that struct dot month.

  • And that give you the month.

  • Yes?

  • AUDIENCE: Does C actually store it like that,

  • because I know C++ it makes it finalize.

  • So then you end up taking more space.

  • JULIAN SHUN: Yeah, so this will actually

  • pad the struct a little bit at the end, yeah.

  • So you actually do require a little bit more than 22 bits.

  • That's a good question.

  • But this representation is much more easy to access,

  • than if you just had encoded the integers

  • as sequential integers.

  • Another point is that sometimes unpacking and decoding

  • are the optimization, because sometimes it

  • takes a lot of work to encode the values and to extract them.

  • So sometimes you want to actually unpack the values

  • so that they take more space, but they're faster to access.

  • So it depends on your particular application.

  • You might want to do one thing or the other.

  • And the way to figure this out is just to experiment with it.

  • OK, so any other questions?

  • All right, so the second optimization

  • is data structure augmentation.

  • And the idea here is to add information to a data structure

  • to make common operations do less work,

  • so that they're faster.

  • And let's look at an example.

  • Let's say we had two singly linked list

  • and we wanted to append them together.

  • And let's say we only stored the head pointer to the list,

  • and then each element in the list

  • has a pointer to the next element in the list.

  • Now, if you want to spend one list to another list,

  • well, that's going to require you walking down the first list

  • to find the last element, so that you

  • can change the pointer of the last element

  • to point to the beginning of the next list.

  • And this might be very slow if the first list is very long.

  • So does anyone see a way to augment this data structure so

  • that appending two lists can be done much more efficiently?

  • Yes?

  • AUDIENCE: Store a pointer to the last value.

  • JULIAN SHUN: Yeah, so the answer is

  • to store a pointer to the last value.

  • And we call that the tail pointer.

  • So now we have two pointers, both the head and the tail.

  • The head points to the beginning of the list.

  • The tail points to the end of the list.

  • And now you can just append two lists in constant time,

  • because you can access the last element in the list

  • by following the tail pointer.

  • And then now you just change the successor pointer

  • of the last element to point to the head of the second list.

  • And then now you also have to update the tail

  • to point to the end of the second list.

  • OK, so that's the idea of data structure augmentation.

  • We added a little bit of extra information

  • to the data structure, such that now appending

  • two lists is much more efficient than in the original method,

  • where we only had a head pointer.

  • Questions?

  • OK, so the next optimization is precomputation.

  • The idea of precomputation is to perform some calculations

  • in advance so as to avoid doing these computations

  • at mission-critical times, to avoid doing them at runtime.

  • So let's say we had a program that needed

  • to use binomial coefficients.

  • And here's a definition of a binomial coefficient.

  • So it's basically the choose function.

  • So you want to count the number of ways

  • that you can choose k things from a set of n things.

  • And the formula for computing this

  • is, n factorial divided by the product of k factorial and n

  • minus k factorial.

  • Computing this choose function can actually

  • be quite expensive, because you have

  • to do a lot of multiplications to compute the factorial,

  • even if the final result is not that big,

  • because you have to compute one term in the numerator and then

  • two factorial terms in the denominator.

  • And then you also might run into integer overflow issues,

  • because n factorial grows very fast.

  • It grows super exponentially.

  • It grows like n to the n, which is even faster than two

  • to the n, which is exponential.

  • So doing this computation, you have

  • to be very careful with integer overflow issues.

  • So one idea to speed up a program that

  • uses these binomials coefficients

  • is to precompute a table of coefficients

  • when you initialize the program, and then

  • just perform table lookup on this precomputed table

  • at runtime when you need the binomial coefficient.

  • So does anyone know what the table that

  • stores binomial coefficients is called?

  • Yes?

  • AUDIENCE: [INAUDIBLE]

  • JULIAN SHUN: Yea, Pascal's triangles, good.

  • So here is what Pascal's triangle looks like.

  • So on the vertical axis, we have different values of n.

  • And then on the horizontal axis, we have different values of k.

  • And then to get and choose k, you

  • just go to the nth row in the case column

  • and look up that entry.

  • Pascal's triangle has a nice property,

  • that for every element, it can be computed

  • as a sum of the element directly above it and above it

  • and to the left of it.

  • So here, 56 is the sum of 35 and 21.

  • And this gives us a nice formula to compute

  • the binomial coefficients.

  • So we first check if n is less than k in this choose function.

  • If n is less than k, then we just

  • return zero, because we're trying

  • to choose more things than there are in a set.

  • If n is equal to zero, then we just

  • return one, because here k must also be equal to zero,

  • since we had the condition n less than k above.

  • And there's one way to choose zero things

  • from a set of zero things.

  • And then if k is equal to zero, we also

  • return one, because there's only one way

  • to choose zero things from a set of any number of things.

  • You just don't pick anything.

  • And then finally, we recursively call this choose function.

  • So we call choose of n minus one k minus one.

  • This is essentially the entry above and diagonal to this.

  • And then we add in choose of n minus one k, which

  • is the entry directly above it.

  • So this is a recursive function for generating this Pascal's

  • triangle.

  • But notice that we're actually still not doing precomputation,

  • because every time we call this choose function,

  • we're making two recursive calls.

  • And this can still be pretty expensive.

  • So how can we actually precompute this table?

  • So here's some C code for precomputing Pascal's triangle.

  • And let's say we only wanted coefficients up

  • to choose sides of 100.

  • So we initialize matrix of 100 by 100.

  • And then we call this an init choose function.

  • So first it goes from n equal zero, all the way up

  • to choose size minus one.

  • And then it says, choose n of zero to be one.

  • It also sets choose of n, n to be one.

  • So the first line is, because there's

  • only one way to choose zero things

  • from any number of things.

  • And the second line is, because there's only one way

  • to choose n things from n things, which

  • is just to pick all of them.

  • And then now we have a second loop,

  • which goes from n equals one, all the way up to choose

  • size minus one.

  • Then first we set choose of zero n

  • to be zero, because here n is--

  • or k is greater than n.

  • So there's no way to pick more elements

  • from a set of things that is less than the number of things

  • you want to pick.

  • And then now you loop from k equals one, all the way up to n

  • minus one.

  • And then your apply this recursive formula.

  • So choose of n, k is equal to choose of n minus

  • one, k minus one plus choose of n minus one k.

  • And then you also set choose of k, n to be zero.

  • So this is basically all of the entries above the diagonal

  • here, where k is greater than n.

  • And then now inside the program whenever

  • we need a binomial coefficient that's less than 100,

  • we can just do table lookup into this table.

  • And we just index and then just choose array.

  • So does this make sense?

  • Any questions?

  • It's pretty easy so far, right?

  • So one thing to note is, that we're still

  • computing this table at runtime, because we

  • have to initialize this table at runtime.

  • And if we want to run our program many times,

  • then we have to initialize this table many times.

  • So is there a way to only initialize this table once,

  • even though we might want to run the program many times?

  • Yes?

  • AUDIENCE: Put in the source code.

  • JULIAN SHUN: Yeah, so good, so put it in the source code.

  • And so we're going to do compile-time initialization.

  • And if you put the table in the source code,

  • then the compiler will compile this code

  • and generate the table for you that compile time.

  • So now whenever you run it, you don't

  • have to spend time initializing the table.

  • So idea of compile-time initialization

  • is to store the values of constants

  • during compilation and, therefore,

  • saving work at runtime.

  • So let's say we wanted choose values up to 10.

  • This is the table, the 10 by 10 table storing

  • all of the binomial coefficients up to 10.

  • So if you put this in your source code,

  • now when you run the program, you

  • can just index into this table to get the appropriate constant

  • here.

  • But this table was just a 10 by 10 table.

  • What if you wanted a table of 1,000 by 1,000?

  • Does anyone actually want to type this in, a table of 1,000

  • by 1,000?

  • So probably not.

  • So is there any way to get around this?

  • Yes?

  • AUDIENCE: You could make a program that uses it.

  • And the function will be defined [INAUDIBLE] prints

  • out the zero [INAUDIBLE].

  • JULIAN SHUN: Yeah, so the answer is

  • to write a program that writes your program for you.

  • And that's called metaprogramming.

  • So here's a snippet of code that will

  • generate this table for you.

  • So it's going to call this init choose function

  • that we defined before.

  • And then now it's just going to print out C code.

  • So it's going to print out the declaration of this array

  • choose, followed by a left bracket.

  • And then for each row of the table,

  • we're going to print another left bracket

  • and then print the value of each entry in that row, followed

  • by a right bracket.

  • And we do that for every row.

  • So this will give you the C code.

  • And then now you can just copy and paste this and place it

  • into your source code.

  • This is a pretty cool technique to get your computer

  • to do work for you.

  • And you're welcome to use this technique in your homeworks

  • and projects if you'd need to generate large tables

  • of constant values.

  • So this is a very good technique to know.

  • So any questions?

  • Yes?

  • AUDIENCE: Is there a way to write the output other programs

  • to a file, as oppose to having to copy and paste

  • into the source code?

  • JULIAN SHUN: Yeah, so you can pipe the output of this program

  • to a file.

  • Yes?

  • AUDIENCE: So are there compiler tools that can--

  • so we have three processor tools.

  • Is there [INAUDIBLE] processor can do that?

  • We compile the code, run it, and then [INAUDIBLE]..

  • JULIAN SHUN: Yeah, so I think you

  • can write macros to actually generate this table.

  • And then the compiler will run those macros

  • to generate this table for you.

  • Yeah, so you don't actually need to copy and paste it yourself.

  • Yeah?

  • CHARLES: And you know, you don't have

  • to write it in C. If it's quicker to write with Python,

  • you'd be writing in Python, just put it in the make file

  • for the system you're building.

  • So if it's in the make file, says,

  • well, we're making this thing, first

  • generate the file in the table and now you

  • include that in whatever you're compiling

  • or/and it's just one more step in the process.

  • And for sure, it's generally easier

  • to write these tables with the scripting language like Python

  • than writing them in C. On the other hand,

  • if you need experience writing in C, practice writing in C.

  • JULIAN SHUN: Right, so as Charles says,

  • you can write your metaprogram using any language.

  • You don't have to write it in C. You can write it in Python

  • if you're more familiar with that.

  • And it's often easier to write it using a scripting

  • language like Python.

  • OK, so let's look at the next optimization.

  • So we're already gone through a couple

  • of mini lectures already.

  • So congratulations to all of you who are still here.

  • So the next optimization is caching.

  • The idea of caching is to store results

  • that have been accessed recently,

  • so that you don't need to compute them again

  • in the program.

  • So let's look at an example.

  • Let's say we wanted to compute the hypotenuse

  • of a right triangle with side lengths A and B.

  • So the formula for computing this is, you

  • take the square root of A times A plus B times B. OK, so

  • turns out that the square root operator is actually

  • a relatively expensive, more expensive than additions

  • and multiplications on modern machines.

  • So you don't want to have to call the square root

  • function if you don't have to.

  • And one way to avoid doing that is to create a cache.

  • So here I have a cache just storing the previous hypotenuse

  • that I calculated.

  • And I also store the values of A and B

  • that were passed to the function.

  • And then now when I call the hypotenuse function,

  • I can first check if A is equal to the cached value of A

  • and if B is equal to the cached value of B.

  • And if both of those are true, then

  • I already computed the hypotenuse before.

  • And then I can just return cached of h.

  • But if it's not in my cache, now I need to actually compute it.

  • So I need to call the square root function.

  • And then I store the result into cached h.

  • And I also store A and B into cached A

  • and cached B respectively.

  • And then finally, I returned cached h.

  • So this example isn't actually very realistic,

  • because my cache is only a size one.

  • And it's very unlikely, in a program,

  • you're going to repeatedly call some function

  • with the same input arguments.

  • But you can actually make a larger cache.

  • You can make a cache of size 1,000,

  • storing the 1,000 most recently computer hypotenuse values.

  • And then now when you call the hypotenuse function,

  • you can just check if it's in your cache.

  • Checking the larger cache is going

  • to be more expensive, because there are more values

  • to look at.

  • But they can still save you time overall.

  • And hardware also does caching for you,

  • as we'll talk about later on in the semester.

  • But the point of this optimization

  • is that you can also do caching yourself.

  • You can do it in software.

  • You don't have to let hardware do it for you.

  • And turns out for this particular program here,

  • actually, it is about 30% faster if you do hit the cache

  • about 2/3 of the time.

  • So it does actually save you time

  • if you do repeatedly compute the same values over and over

  • again.

  • So that's caching.

  • Any questions?

  • OK, so the next optimization we'll look at is sparsity.

  • The idea of exploiting sparsity, in an input,

  • is to avoid storage and computing

  • on zero elements of that input.

  • And the fastest way to compute on zero

  • is to just not compute on them at all,

  • because we know that any value plus zero

  • is just that original value.

  • And any value times zero is just zero.

  • So why waste a computation doing that when

  • you already know the result?

  • So let's look at an example.

  • This is matrix-vector multiplication.

  • So we want to multiply a n by n matrix by a n by one vector.

  • We can do dense matrix-vector multiplication

  • by just doing a dot product of each row in the matrix

  • with the column vector.

  • And then that will give us the output vector.

  • But if you do dense matrix-vector multiplication,

  • that's going to perform n squared or 36,

  • in this example, scalar multiplies.

  • But it turns out, only 14 of these entries in this matrix

  • are zero or are non-zero.

  • So you just wasted work doing the multiplication on the zero

  • elements, because you know that zero times any other element

  • is just zero.

  • So a better way to do this, is instead of

  • doing the multiplication for every element,

  • you first check if one of the arguments is zero.

  • And if it is zero, then you don't

  • have to actually do the multiplication.

  • But this is still kind of slow, because you still

  • have to do a check for every entry in your matrix,

  • even though many of the entries are zero.

  • So it's actually a pretty cool data structure

  • that won't actually store these zero entries.

  • And this will speed up your matrix-vector multiplication

  • if your matrix is sparse enough.

  • So let me describe how this data structure works.

  • It's called compressed sparse row or CSR.

  • There is an analogous representation

  • called compressed sparse column or CSC.

  • But today, I'm just going to talk about CSR.

  • So we have three arrays.

  • First, we have the rows array.

  • The length of the rows array is just equal to the number

  • of rows in a matrix plus one.

  • And then each entry in the rows array

  • just stores an offset into the columns array

  • or the cols array.

  • And inside the cols array, I'm storing

  • the indices of the non-zero entries in each of the rows.

  • So if we take row one, for example,

  • we have rows of one is equal to two.

  • That means I start looking at the second entry

  • in the cols array.

  • And then now I have the indices of the non-zero columns

  • in the first row.

  • So it's just one, two, four, and five.

  • These are the indices for the non-zero entries.

  • And then I have another array called vals.

  • The length of this array is the same as the cols array.

  • And then this array just stores the actual value

  • in these indices here.

  • So the vals array for row one is going

  • to store four, one, five, and nine, because these

  • are the non-zero entries in the first row.

  • Right, so the rows array just serves as an index

  • into this cols array.

  • So you can basically get the starting index

  • in this cols array for any row just

  • by looking at the entry stored at the corresponding location

  • in the rows array.

  • So for example, row two starts at location six.

  • So it starts here.

  • And you have indices three and five,

  • which are the non-zero indices.

  • So does anyone know how to compute

  • the length, the number of non-zeros in a row

  • by looking at the rows array?

  • Yes, yes?

  • AUDIENCE: You go to the rows array

  • and just drag the [INAUDIBLE]

  • JULIAN SHUN: Right.

  • AUDIENCE: [INAUDIBLE] the number of elements that are

  • [INAUDIBLE].

  • JULIAN SHUN: Yeah, so to get the length of a row,

  • you just take the difference between that row's offset

  • and the next row's offset.

  • So we can see that the length of the first row is four,

  • because it's offset is two.

  • And the offset for row two is six.

  • So you just take the difference between those two entries.

  • We have an additional entry here.

  • So we have the sixth row here, because we

  • want to be able to compute the length of the last row

  • without overflowing in our array.

  • So we just created an additional entry in the rows array

  • for that.

  • So turns out that this representation will save you

  • space if your matrix is sparse.

  • So the storage required by the CSR format

  • is order n plus nnz, where nnz is the number of non-zeros

  • in your matrix.

  • And the reason why you have n plus nnz,

  • well, you have two arrays here, cols and vals,

  • whose length is equal to the number of non-zeros

  • in the matrix.

  • And then you also have this rows array,

  • whose length is n plus one.

  • So that's why we have n plus nnz.

  • And if the number of non-zeros is much less than n squared,

  • then this is going to be significantly more compact

  • than the dense matrix representation.

  • However, this isn't always going to be the most

  • compact representation.

  • Does anyone see why?

  • Why might the dense representation

  • sometimes take less space?

  • Yeah?

  • Sorry.

  • AUDIENCE: Less space or more space?

  • JULIAN SHUN: Why might the dense representation sometimes take

  • less space?

  • AUDIENCE: I mean, if you have not many zeros,

  • then you can figure it out n squared plus something else

  • with the sparse created.

  • JULIAN SHUN: Right.

  • So if you have a relatively dense matrix,

  • then it might take more space than storing it.

  • It might take more space in the CSR representation,

  • because you have these two arrays.

  • So if you take the extreme case where all of the entries

  • are non-zeros, then both of these arrays

  • are going to be of length and squares.

  • So you already have 2n squared there.

  • And then you also need this rows array, which

  • is of length and plus one.

  • OK, so now I gave you this more compact representation

  • for storing the matrix.

  • So how do we actually do stuff with this representation?

  • So turns out that you can still do

  • matrix-vector multiplication using

  • this compressed sparse row format.

  • And here's the code for doing it.

  • So we have this struct here, which

  • is the CSR representation.

  • We have the rows array, the cols array, and then the vals array.

  • And then we also have the number of rows, n, and the number

  • of non-zeros, nnz.

  • And then now what we do, we call this SPMV or sparse

  • matrix-vector multiply.

  • We pass in our CSR representation,

  • which is A, and then the input array, which is x.

  • And then we store the result in an output array y.

  • So first, we loop through all the rows.

  • And then we set y of i to be zero.

  • This is just initialization.

  • And then for each of my rows, I'm

  • going to look at the column indices

  • for the non-zero elements.

  • And I can do that by starting at k equals to rows of i

  • and going up to rows of i plus one.

  • And then for each one of these entries,

  • I just look up the index, the column index

  • for the non-zero element.

  • And I can do that with cols of k, so let that be j.

  • And then now I know which elements to multiply.

  • I multiply vals of k by x of j.

  • And then now I just add that to y of i.

  • And then after I finish with all of these multiplications

  • and additions, this will give me the same result

  • as if I did the dense matrix-vector multiplication.

  • So this is actually a pretty cool program.

  • So I encourage you to look at this program offline,

  • to convince yourself that it's actually

  • computing the same thing as the dense matrix-vector

  • multiplication version.

  • So I'm not going to approve this during lecture today.

  • But you can feel free to ask me or any of your TAs

  • after class, if you have questions about this.

  • And the number of scalar multiplication

  • that you have to do using this code

  • is just going to be nnz, because you're just operating

  • on the non-zero elements.

  • You don't have to touch all of the zero elements.

  • And in contrast, the dense matrix-vector multiply

  • algorithm would take n squared multiplication.

  • So this can be significantly faster for a sparse matrices.

  • So turns out that you can also use a similar structure

  • to store a sparse static graph.

  • So I assume many of you have seen graphs

  • in your previous courses.

  • See, here's what the sparse graph representation

  • looks like.

  • So again, we have these arrays.

  • We have these two arrays.

  • We have offsets and edges.

  • The offsets array is analogous to the rows array.

  • And the edges array is analogous to the columns array

  • for the CSR representation.

  • And then in this offsets array, we store for each vertex

  • where its neighbor's start in this edges array.

  • And then in the edges array, we just

  • write the indices of its neighbor's there.

  • So let's take vertex one, for example.

  • The offset of vertex one is two.

  • So we know that its outgoing neighbor

  • start at position two in this edges array.

  • And then we see that vertex one has outgoing edges to vertices

  • two, three, and four.

  • And we see in the edges array two, three, four listed there.

  • And you can also get the degree of each vertex, which

  • is analogous to the length of each row,

  • by taking the difference between consecutive offsets.

  • So here we see that the degree of vertex one

  • is three, because its offset is two.

  • And the offset of vertex two is five.

  • And it turns out that using this representation,

  • you can run many classic graph algorithms

  • such as breadth-first search and PageRank

  • quite efficiently, especially when the graph is sparse.

  • So this would be much more efficient

  • than using a dense matrix to represent the graph

  • and running these algorithms.

  • You can also store weights on the edges.

  • And one way to do that is to just create an additional array

  • called weights, whose length is equal to the number of edges

  • in the graph.

  • And then you just store the weights in that array.

  • And this is analogous to the values array in the CSR

  • representation.

  • But there's actually a more efficient way to store this,

  • if you always need to access the weight whenever

  • you access an edge.

  • And the way to do this is to interleave the weights

  • with the edges, so to store the weight for a particular edge

  • right next to that edge, and create an array of twice

  • number of edges in the graph.

  • And the reason why this is more efficient

  • is, because it gives you improved cache locality.

  • And we'll talk much more about cache locality

  • later on in this course.

  • But the high-level idea is, that whenever you access an edge,

  • the weight for that edge will also

  • likely to be on the same cache line.

  • So you don't need to go to main memory

  • to access the weight of that edge again.

  • And later on in the semester we'll

  • actually have a whole lecture on doing optimizations

  • for graph algorithms.

  • And today, I'm just going to talk about one representation

  • of graphs.

  • But we'll talk much more about this later on.

  • Any questions?

  • OK, so that's it for the data structure optimizations.

  • We still have three more categories

  • of optimizations to go over.

  • So it's a pretty fun lecture.

  • We get to learn about many cool tricks for reducing

  • the work of your program.

  • So in the next class of optimizations

  • we'll look at is logic, so first thing is constant folding

  • and propagation.

  • The idea of constant folding and propagation

  • is to evaluate constant expressions

  • and substitute the result into further expressions, all

  • at compilation times.

  • You don't have to do it at runtime.

  • So again, let's look at an example.

  • So here we have this function called orrery.

  • Does anyone know what orrery means?

  • You can look it up on Google.

  • [LAUGHING]

  • OK, so an orrery is a model of a solar system.

  • So here we're constructing a digital orrery.

  • And in an orrery we have these whole bunch

  • of different constants.

  • We have the radius, the diameter, the circumference,

  • cross area, surface area, and also the volume.

  • But if you look at this code, you

  • can notice that actually all of these constants

  • can be defined in compile time once we fix the radius.

  • So here we set the radius to be this constant here,

  • six million, 371,000.

  • I don't know where that constant comes from, by the way.

  • But Charles made these slides, so he probably does.

  • CHARLES: [INAUDIBLE]

  • JULIAN SHUN: Sorry?

  • CHARLES: Radius of the Earth.

  • JULIAN SHUN: OK, radius of the Earth.

  • Now, the diameter is just twice this radius.

  • The circumference is just pi times the diameter.

  • Cross area is pi times the radius squared.

  • Surface area is circumference times the diameter.

  • And finally, volume is four times pi times the radius cube

  • divided by three.

  • So you can actually evaluate all of these two constants

  • at compile time.

  • So with a sufficiently high level of optimization,

  • the compiler will actually evaluate all of these things

  • at compile time.

  • And that's the idea of constant folding and propagation.

  • It's a good idea to know about this, even though the compiler

  • is pretty good at doing this, because sometimes

  • the compiler won't do it.

  • And in those cases, you can do it yourself.

  • And you can also figure out whether the compiler

  • is actually doing it when you look at the assembly code.

  • OK, so the next optimization is common subexpression

  • elimination.

  • And the idea here is to avoid computing the same expression

  • multiple times by evaluating the expression once and storing

  • the result for later use.

  • So let's look at this simple four-line program.

  • We have a equal to b plus c.

  • The we set b equal to a minus d.

  • Then we set c equal to b plus c.

  • And finally, we set d equal to a minus d.

  • So notice her that the second and the fourth lines

  • are computing the same expression.

  • They're both computing a minus d.

  • And they evaluate to the same thing.

  • So the idea of common subexpression elimination

  • would be to just substitute the result of the first evaluation

  • into the place where you need it in future line.

  • So here, we still evaluate the first line for a minus d.

  • But now in the second time we need a minus d.

  • We just set the value to b.

  • So now d is equal to b instead of a minus d.

  • So in this example, the first and the third line,

  • the right hand side of those lines actually look the same.

  • They're both b plus c.

  • Does anyone see why you can't do common subexpression

  • elimination here?

  • AUDIENCE: b minus changes the second line.

  • JULIAN SHUN: Yeah, so you can't do common subexpression

  • for the first and the third lines, because the value of b

  • changes in between.

  • So the value of b changes on the second line.

  • So on the third line when you do b plus c,

  • it's not actually computing the same thing

  • as the first evaluation of b plus c.

  • So again, the compiler is usually

  • smart enough to figure this optimization out.

  • So it will do this optimization for you in your code.

  • But again, it doesn't always do it for you.

  • So it's a good idea to know about this optimization

  • so that you can do this optimization by hand when

  • the compiler doesn't do it for you.

  • Questions so far?

  • OK, so next, let's look at algebraic identities.

  • The idea of exploiting algebraic identities

  • is to replace more expensive algebraic expressions

  • with equivalent expressions that are cheaper to evaluate.

  • So let's look at an example.

  • Let's say we have a whole bunch of balls.

  • And we want to detect whether two balls collide

  • with each other.

  • Say, ball has a x-coordinate, a y-coordinate, a z-coordinate,

  • as well as a radius.

  • And the collision test works as follows.

  • We set d equal to the square root

  • of the sum of the squares of the differences between each

  • of the three coordinates of the two balls.

  • So here, we're taking the square of b1's x-coordinate

  • minus b2's x-coordinate, and then

  • adding the square of b1's y-coordinate

  • minus b2's y-coordinate, and finally,

  • adding the square of b1 z-coordinate

  • minus b2's z-coordinate.

  • And then we take the square root of all of that.

  • And then if the result is less than

  • or equal to the sum of the two radii of the ball,

  • then that means there is a collision,

  • and otherwise, that means there's not a collision.

  • But it turns out that the square root operator,

  • as I mentioned before, is relatively expensive compared

  • to doing multiplications and additions and subtractions

  • on modern machines.

  • So how can we do this without using the square root operator?

  • Yes.

  • AUDIENCE: You add the two radii, and the distance is more than

  • the distance between the centers,

  • then you know that they must be overlying.

  • JULIAN SHUN: Right, so that's actually a good fast path

  • check.

  • I don't think it necessarily always gives you

  • the right answer.

  • Is there another?

  • Yes?

  • AUDIENCE: You can square the ignition of the radii

  • and compare that instead of taking

  • the square root of [INAUDIBLE].

  • JULIAN SHUN: Right, right, so the answer

  • is, that you can actually take the square of both sides.

  • So now you don't have to take the square root anymore.

  • So we're going to use the identity that

  • says, that if the square root of u

  • is less than or equal to v exactly when u is less than

  • or equal to v squared.

  • So we're just going to take the square of both sides.

  • And here's the modified code.

  • So now I don't have this square root anymore on the right hand

  • side when I compute d squared.

  • But instead, I square the sum of the two radii.

  • So this will give you the same answer.

  • However, you do have to be careful with floating point

  • operations, because they don't work exactly

  • in the same way as real numbers.

  • So some numbers might run into overflow issues or rounding

  • issues.

  • So you do have to be careful if you're

  • using algebraic identities and floating point computations.

  • But the high-level idea is that you

  • can use equivalent algebraic expressions to reduce

  • the work of your program.

  • And we'll come back to this example

  • late on in this lecture when we talk

  • about some other optimizations, such as the fast path

  • optimization, as one of the students pointed out.

  • Yes?

  • AUDIENCE: Why do you square the sum

  • of these squares [INAUDIBLE]?

  • JULIAN SHUN: Which?

  • Are you talking about--

  • AUDIENCE: Yeah.

  • JULIAN SHUN: --this line?

  • So before we were comparing d.

  • AUDIENCE: [INAUDIBLE].

  • JULIAN SHUN: Yeah, yeah, OK, is that clear?

  • OK.

  • OK, so the next optimization is short-circuiting.

  • The idea here is, that when we're

  • performing a series of tests, we can actually

  • stop evaluating this series of tests

  • as soon as we know what the answer is So here's an example.

  • Let's say we have an array, a, containing

  • all non-negative integers.

  • And we want to check if the sum of the values

  • in a exceed some limit.

  • So the simple way to do this is, you just

  • sum up all of the values of the array using a for loop.

  • And then at the end, you check if the total sum

  • is greater than the limit.

  • So using this approach, you always

  • have to look at all the elements in the array.

  • But there's actually a better way to do this.

  • And the idea here is, that once you

  • know the partial sum exceeds the limit that you're

  • testing against, then you can just return true,

  • because at that point you know that the sum of the elements

  • in the array will exceed the limit,

  • because all of the elements in the array are non-negative.

  • And then if you get all the way to the end of this for loop,

  • that means you didn't exceed this limit.

  • And you can just return false.

  • So this second program here will usually

  • be faster, if most of the time you

  • exceed the limit pretty early on when

  • you loop through the array.

  • But if you actually end up looking at most of the elements

  • anyways, or even looking at all the elements,

  • this second program will actually

  • be a little bit slower, because you have this additional check

  • inside this for loop that has to be done for every iteration.

  • So when you apply this optimization,

  • you should be aware of whether this will actually

  • be faster or slower, based on the frequency of when

  • you can short-circuit the test.

  • Questions?

  • OK, and I want to point out that there are short-circuiting

  • logical operators.

  • So if you do double ampersand, that's

  • short-circuiting logical and operator.

  • So if it evaluates the left side to be false,

  • it means that the whole thing has to be false.

  • So it's not even going to evaluate the right side.

  • And then the double vertical bar is going

  • to be a short-circuiting or.

  • So if it knows that the left side is true,

  • it knows the whole thing has to be true, because or just

  • requires one of the two sides to be true.

  • And it's going to short circuit.

  • In contrast, if you just have a single ampersand

  • or a single vertical bar, these are not

  • short-circuiting operators.

  • They're going to evaluate both sides of the argument.

  • The single ampersand and single vertical bar

  • turn out to be pretty useful when

  • you're doing bit manipulation.

  • And we'll be talking about these operators

  • more on Thursday's lecture.

  • Yes?

  • AUDIENCE: So if your program going to send false,

  • if it were to call the function and that function

  • was on the right hand side of an ampersand,

  • would it mean that would never get called,

  • even though-- and you possibly now

  • find out that, the right hand side would crash simply

  • because the left hand side was false?

  • JULIAN SHUN: Yeah, if you use a double ampersand,

  • then that would be true.

  • Yes?

  • AUDIENCE: [INAUDIBLE] check [INAUDIBLE]

  • that would cause the cycle of left hand,

  • so that the right hand doesn't get [INAUDIBLE]..

  • JULIAN SHUN: Yeah.

  • I guess one example is, if you might possibly index

  • an array out of balance, you can first

  • check whether you would exceed the limit or be out of bounds.

  • And if so, then you don't actually do the index.

  • OK, a related idea is to order tests, suss

  • out the tests that are more often successful or earlier.

  • And the ones that are less frequently successful

  • are later in the order, because you

  • want to take advantage of short-circuiting.

  • And similarly, inexpensive tests should precede expensive tests,

  • because if you do the inexpensive tests and your test

  • short-circuit, then you don't have

  • to do the more expensive tests.

  • So here's an example.

  • Here, we're checking whether a character is whitespace.

  • So character's whitespace, if it's

  • equal to the carriage return character,

  • if it's equal to the tab character,

  • if it's equal to space, or if it's

  • equal to the newline character.

  • So which one of these tests do you

  • think should go at the beginning?

  • Yes?

  • AUDIENCE: Probably the space.

  • JULIAN SHUN: Why is that?

  • AUDIENCE: Oh, I mean [INAUDIBLE]..

  • Well, maybe the newline [INAUDIBLE]..

  • Either of those could be [INAUDIBLE]..

  • JULIAN SHUN: Yeah, yeah, so it turns out

  • that the space and the newline characters

  • appear more frequently than the carriage return.

  • And the tab and the space is the most frequent,

  • because you have a lot of spaces in text.

  • So here I've reordered the test, so that the check for space

  • is first.

  • And then now if you have a character, that's a space.

  • You can just short circuit this test and return true.

  • Next, the newline character, I have it as a second test,

  • because these are also pretty frequent.

  • You have a newline for every new line in your text.

  • And then less frequent is the tab character,

  • and finally, the carriage return for character

  • isn't that frequently used nowadays.

  • So now with this ordering, the most frequently successful

  • tests are going to appear first.

  • Notice that this only actually saves you

  • work if the character is a whitespace character.

  • It it's not a whitespace character,

  • than you're going to end up evaluating all of these tests

  • anyways.

  • OK, so now let's go back to this example of detecting collision

  • of balls.

  • So we're going to look at the idea of creating a fast path.

  • And the idea of creating a fast path

  • is, that there might possibly be a check that

  • will enable you to exit the program early,

  • because you already know what the result is.

  • And one fast path check for this particular program

  • here is, you can check whether the bounding boxes of the two

  • balls intersect.

  • If you know the bounding boxes of the two balls

  • don't intersect, then you know that the balls cannot collide.

  • If the bounding boxes of the two balls do intersect,

  • well, then you have to do the more expensive test,

  • because that doesn't necessarily mean

  • that the balls will collide.

  • So here's what the fast path test looks like.

  • We're first going to check whether the bounding boxes

  • intersect.

  • And we can do this by looking at the absolute value

  • of the difference on each of the coordinates

  • and checking if that's greater than the sum of the two radii.

  • And if so, that means that for that particular coordinate

  • the bounding boxes cannot intersect.

  • And therefore, the balls cannot collide.

  • And then we can return false of any one

  • of these tests returned true.

  • And otherwise, we'll do the more expensive test

  • of comparing d square to the square of the sum of the two

  • radii.

  • And the reason why this is a fast path

  • is, because this test here is actually cheaper

  • to evaluate than this test below.

  • Here, we're just doing subtractions, additions,

  • and comparisons.

  • And below we're using the square operator, which

  • requires a multiplication.

  • And multiplications are usually more expensive than additions

  • on modern machines.

  • So ideally, if we don't need to do the multiplication,

  • we can avoid it by going through our fast path.

  • So for this example, it probably isn't

  • worth it to do the fast path check since it's

  • such a small program.

  • But in practice there are many applications and graphics

  • that benefit greatly from doing fast path checks.

  • And the fast path check will greatly

  • improve the performance of these graphics programs.

  • There's actually another optimization

  • that we can do here.

  • I talked about this optimization couple of slides ago.

  • Does anyone see it?

  • Yes?

  • AUDIENCE: You can factor out the sum of the radii

  • for [INAUDIBLE].

  • JULIAN SHUN: Right.

  • So we can apply common subexpression elimination here,

  • because we're computing the sum of the two radii four times.

  • We can actually just compute it once, store it in a variable,

  • and then use it for the subsequent three calls.

  • And then similarly, when we're taking

  • the difference between each of the coordinates,

  • we're also doing it twice.

  • So again, we can store that in a variable

  • and then just use the result in the second time.

  • Any questions?

  • OK, so the next idea is to combine tests together.

  • So here, we're going to replace a sequence of tests

  • with just one test or switch statement.

  • So here's an implementation of a full adder.

  • So a full adder is a hardware device

  • that takes us input three bits.

  • And then it returns the carry bit and the sum bit as output.

  • So here's a table that specifies for every possible input

  • to the full adder of what the output should be.

  • And there are eight possible inputs to the full adder,

  • because it takes three bits.

  • And there are eight possibilities there.

  • And this program here is going to check all the possibilities.

  • It's first going to check if a is equal to zero.

  • If so, it checks if b is equal to zero.

  • If so, it checks if c is equal to zero.

  • And if that's true, it returns zero and zero for the two bits.

  • And otherwise, it returns one and zero and so on.

  • So this is basically a whole bunch

  • of if else statements nested together.

  • Does anyone think this is a good way to write the program?

  • Who thinks this is a bad way to write the program?

  • OK, so most of you think it's a bad way to write the program.

  • And hopefully, I can convince the rest of you

  • who didn't raise your hand.

  • So here's a better way to write this program.

  • So we're going to replace these multiple if else clauses

  • with a single switch statement.

  • And what we're going to do is, we're going

  • to create this test variable.

  • That is a three-bit variable.

  • So we're going to place the c bit

  • in the least significant digit.

  • The b bit, we're going to shift it over by one,

  • so in the second least significant digit,

  • and then the a bit in the third least significant digit.

  • And now the value of this test variable

  • is going to range from zero to seven.

  • And then for each possibility, we

  • can just specify what the sum and the carry bits should be.

  • And this requires just a single switch statement, instead of

  • a whole bunch of if else clauses.

  • There's actually an even better way to do this,

  • for this example, which is to use table lookups.

  • You just precompute all these answers, store it in a table,

  • and then just look it up at runtime.

  • But the idea here is that you can combine multiple tests

  • in a single test.

  • And this not only makes your code cleaner,

  • but it can also improve the performance of your program,

  • because you're not doing so many checks.

  • And you won't have as many branch misses.

  • Yes?

  • AUDIENCE: Would coming up with logic gates for this

  • be better or no?

  • JULIAN SHUN: Maybe.

  • Yeah, I mean, I encourage you to see if you can write a faster

  • program for this.

  • All right, so we're done with two categories

  • of optimizations.

  • We still have two more to go.

  • The third category is going to be about loops.

  • So if we didn't have any loops in our programs,

  • well, there wouldn't be many interesting programs

  • to optimize, because most of our programs

  • wouldn't be very long running.

  • But with loops we can actually optimize these loops

  • and then realize the benefits of performance engineering.

  • The first loop optimization I want to talk about is hoisting.

  • The goal of hoisting, which is also called loop-invariant code

  • motion, is to avoid recomputing a loop-invariant code

  • each time through the body of a loop.

  • So if you have a for loop where in each iteration

  • are computing the same thing, well, you

  • can actually save work by just computing it once.

  • So in this example here, I'm looping

  • over an array of length N. And them I'm setting Y of i

  • equal to X of i times the exponential of the square root

  • of pi over two.

  • But this exponential square root of pi over two

  • is actually the same in every iteration.

  • So I don't actually have to compute that every time.

  • So here's a version of the code that does hoisting.

  • I just move this expression outside of the for loop

  • and stored it in a variable factor.

  • And then now inside the for loop,

  • I just have to multiply by factor.

  • I already computed what this expression is.

  • And this can save running time, because computing

  • the exponential, the square root of pi over two,

  • is actually relatively expensive.

  • So turns out that for this example, you know,

  • the compiler can probably figure it out

  • and do this hoisting for you.

  • But in some cases, the compiler might not

  • be able to figure it out, especially

  • if these functions here might change throughout the program.

  • So it's a good idea to know what this optimization is,

  • so you can apply it in your code when the compiler doesn't do it

  • for you.

  • OK, sentinels, so sentinels are special dummy values

  • placed in a data structure to simplify the logic of handling

  • boundary conditions, and in particular the handling of loop

  • exit tests.

  • So here, I, again, have this program that checks whether--

  • so I have this program that checks

  • whether the sum of the elements in sum array A

  • will overflow if I added all of the elements together.

  • And here, I've specified that all of the elements of A

  • are non-negative.

  • So how I do this is, in every iteration

  • I'm going to increment some by A of i.

  • And then I'll check if the resulting sum is less than A

  • of i.

  • Does anyone see why this will detect if I had an overflow?

  • Yes?

  • AUDIENCE: We're a closed algorithm.

  • It's not taking any values.

  • JULIAN SHUN: Yeah, so if the thing I added in

  • causes an overflow, then the result is going to wrap around.

  • And it's going to be less than the thing I added in.

  • So this is why the check here, that

  • checks whether the sum is less than negative i,

  • will detect an overflow.

  • OK, so I'm going to do this check in every iteration.

  • If it's true, I'll just return true.

  • And otherwise, I get to the end of this for loop

  • where I just return false.

  • But here on every iteration, I'm doing two checks.

  • I'm first checking whether I should

  • exit the body of this loop.

  • And then secondly, I'm checking whether the sum is less than A

  • of i.

  • It turns out that I can actually modify this program,

  • so that I only need to do one check

  • in every iteration of the loop.

  • So here's a modified version of this program.

  • So here, I'm going to assume that I

  • have two additional entries in my array A.

  • So these are A of n and A of n minus one.

  • So I assume I can use these locations.

  • And I'm going to set A of n to be the largest possible

  • 64-bit integer, or INT64 MAX.

  • And I'm going to set A of n plus one to be one.

  • And then now I'm going to initialize my loop variable

  • i to be zero.

  • And then I'm going to set the sum equal to the first element

  • in A or A of zero.

  • And then now I have this loop that

  • checks whether the sum is greater than or equal to A

  • of i.

  • And if so, I'm going to add A of i plus one to the sum.

  • And then I also increment i.

  • OK, and this code here does the same thing

  • as a thing on the left, because the only way

  • I'm going to exit this while loop is, if I overflow.

  • And I'll overflow if A of i becomes greater than sum,

  • or if the sum becomes less than A

  • of i, which is what I had in my original program.

  • And then otherwise, I'm going to just increment sum by A of i.

  • And then this code here is going to eventually overflow,

  • because if the elements in my array A

  • don't cause the program to overflow,

  • I'm going to get to A of n.

  • And A of n is a very large integer.

  • And if I add that to what I have,

  • it's going to cause the program to overflow.

  • And at that point, I'm going to exit this

  • for loop or this while loop.

  • And then after I exit this loop, I can check why I overflowed.

  • If I overflowed because of sum element of A,

  • then the loop index i is going to be less than n,

  • and I return true.

  • But if I overflowed because I added in this huge integer,

  • well, than i is going to be equal to n.

  • And then I know that the elements of A

  • didn't caused me to overflow, the A of n value here did.

  • So then I just return false.

  • So does this makes sense?

  • So here in each iteration, I only

  • have to do one check instead of two checks,

  • as in my original code.

  • I only have to check whether the sum is greater than

  • or equal to A of i.

  • Does anyone know why I set A of n plus one equal to one?

  • Yes?

  • AUDIENCE: If everything else in the array was zero, then

  • you still wouldn't have overflowed.

  • If you had been at 64 max, it would overflow.

  • JULIAN SHUN: Yeah, so good.

  • So the answer is, because if all of my elements

  • were zero in my original array, that even

  • though I add in this huge integer,

  • it's still not going to overflow.

  • But now when I get to A of n plus one,

  • I'm going to add one to it.

  • And then that will cause the sum to overflow.

  • And then I can exit there.

  • So this is a deal with the boundary condition

  • when all the entries in my array are zero.

  • OK, so next, loop unrolling, so loop unrolling

  • attempts to save work by combining

  • several consecutive iterations of a loop

  • into a single iteration.

  • Thereby, reducing the total number

  • of iterations of the loop and consequently

  • the number of times that the instructions that control

  • the loop have to be executed.

  • So there are two types of loop unrolling.

  • There's full loop unrolling, where

  • I unroll all of the iterations of the for loop,

  • and I just get rid of the control-flow logic entirely.

  • Then there's partial loop unrolling, where I only

  • unroll some of the iterations but not all of the iterations.

  • So I still have some control-flow code in my loop.

  • So let's first look at full loop unrolling.

  • So here, I have a simple program that

  • just loops for 10 iterations.

  • The fully unrolled loop just looks like the code

  • on the right hand side.

  • I just wrote out all of the lines of code

  • that I have to do in straight-line code,

  • instead of using a for loop.

  • And now I don't need to check on every iteration,

  • whether I need to exit the for loop.

  • So this is for loop unrolling.

  • This is actually not very common,

  • because most of your loops are going

  • to be much larger than 10.

  • And oftentimes, many of your loop bounds

  • are not going to be determined at compile time.

  • They're determined at runtime.

  • So the compiler can't fully unroll that loop for you.

  • For small loops like this, the compiler

  • will probably unroll the loop for you.

  • But for larger loops, it actually

  • doesn't benefit to unroll the loop fully,

  • because you're going to have a lot of instructions.

  • And that's going to pollute your instruction cache.

  • So the more common form of loop unrolling

  • is partial loop unrolling.

  • And here, in this example here, I've

  • unrolled the loop by a factor of four.

  • So I reduce the number of iterations of my for loop

  • by a factor of four.

  • And then inside the body of each iteration

  • I have four instructions.

  • And then notice, I also changed the logic

  • in the control-flow of my for loops.

  • So now I'm incrementing the variable j

  • by four instead of just by one.

  • And then since n might not necessarily

  • be divisible by four, I have to deal with the remaining

  • elements.

  • And this is what the second for loop is doing here.

  • It's just dealing with the remaining elements.

  • And this is the more common form of loop unrolling.

  • So the first benefit of doing this

  • is, that you have fewer checks to the exit condition

  • for the loop, because you only have

  • to do this check every four iterations instead

  • of every iteration.

  • But the second and much bigger benefit

  • is, that it allows more compiler optimizations,

  • because it increases the size of the loop body.

  • And it gives the compiler more freedom

  • to play around with code and to find ways to optimize

  • the performance of that code.

  • So that's usually the bigger benefit.

  • If you unroll the loop by too much,

  • that actually isn't very good, because now you're

  • going to be pleading your instruction cache.

  • And every time you fetch an instruction,

  • it's likely going to be a miss in your instruction cache.

  • And that's going to decrease the performance of your program.

  • And furthermore, if your loop body is already very big,

  • you don't really get additional improvements

  • from having the compiler do more optimizations,

  • because it already has enough code to work with.

  • So giving it more code doesn't actually give you much there.

  • OK, so I just said this.

  • The benefits of loop unrolling a lower number of instructions

  • and loop control code.

  • And then it also enables more compiler optimizations.

  • And the second benefit here is usually the much more important

  • benefit.

  • And we'll talk more about compiler optimizations

  • in a couple of lectures.

  • OK, any questions?

  • OK, so the next optimization is loop fusion.

  • This is also called jamming.

  • And the idea here is to combine multiple loops

  • over the same index range into a single loop,

  • thereby saving the overhead of loop control.

  • So here, I have two loops.

  • They're both looping from i equal zero, all the way up

  • to n minus one.

  • The first loop, I'm computing the minimum of A of i

  • and B of i and storing the result in C of i.

  • The second loop, I'm computing the maximum of A of i

  • and B of i and storing the result in D of i.

  • So since these are going over the same index rang,

  • I can fused together the two loops,

  • giving me a single loop that does both of these lines here.

  • And this reduces the overhead of loop control code,

  • because now instead of doing this exit condition check two n

  • times, I only have to do it n times.

  • This also gives you better cache locality.

  • Again, we'll talk more about cache locality

  • in a future lecture.

  • But at a high level here, what it

  • gives you is, that once you load A of i and B of i into cache,

  • when you compute C of i, it's also

  • going to be in cache when you compute D of i.

  • Whereas, in the original code, when you compute D of i,

  • it's very likely that A of i and B of i

  • are going to be kicked out of cache already,

  • even though you brought it in when you computed C of i.

  • For this example here, again, there's

  • another optimization you can do, common subexpression

  • elimination, since you're computing this expression

  • A of i is less than or equal to B of i twice.

  • OK, next, let's look at eliminating wasted iterations.

  • The idea of eliminating wasted iterations

  • is to modify the loop bounds to avoid

  • executing loop iterations over essentially empty loop bodies.

  • So here, I have some code to transpose, a matrix.

  • So I go from i equal zero to n minus one,

  • from j equals zero to n minus one.

  • And then I check if i is greater than j.

  • And if so, I'll swap the entries in A of i, j and A of j, i.

  • The reason why I have this check here

  • is, because I don't want to do the swap twice.

  • Otherwise, I'll just end up with the same matrix I had before.

  • So I only have to do the swap when i is greater than j.

  • One disadvantage of this code here is, I still

  • have to loop for n squared iterations,

  • even though only about half of the iterations

  • are actually doing useful work, because about half

  • of the iterations are going to fail this check here,

  • that checks whether i is greater than j.

  • So here's a modified version of the program, where I basically

  • eliminate these wasted iterations.

  • So now I'm going to loop from i equals one to n minus one,

  • and then from j equals zero all the way up to i minus one.

  • So now instead of going up to n minus one,

  • I'm going just up to i minus one.

  • And that basically puts this check,

  • whether i is greater than j, into the loop control code.

  • And that saves me the extra wasted iterations.

  • OK, so that's the last optimization on loops.

  • Are there any questions?

  • Yes?

  • AUDIENCE: Isn't the checks still have [INAUDIBLE]??

  • JULIAN SHUN: So the check is--

  • so you still have to do the check in the loop control code.

  • But here, you also had to do it.

  • And now you just don't have to do it again

  • inside the body of the loop.

  • Yes?

  • AUDIENCE: In some cases, where it

  • might be more complex to do it, is it

  • also [INAUDIBLE] before you optimize it,

  • but it's still going to be fast enough [INAUDIBLE]..

  • Like in the first example, even though the loop is empty,

  • most of the time you'll be able to process [INAUDIBLE]

  • run the instructions.

  • JULIAN SHUN: Yes, so most of these

  • are going to be branch hits.

  • So it's still going to be pretty fast.

  • But it's going to be even faster if you just

  • don't do that check at all.

  • So I mean, basically you should just text it out

  • in your code to see whether it will give you

  • a runtime improvement.

  • OK, so last category of optimizations is functions.

  • So first, the idea of inlining is

  • to avoid the overhead of a function call

  • by replacing a call to the function with the body

  • of the function itself.

  • So here, I have a piece of code that's

  • computing the sum of squares of elements in an array A.

  • And so I have this for loop that in each iteration

  • is calling this square function.

  • And the square function is defined above here.

  • It just does x times x for input argument x.

  • But it turns out that there is actually some overhead

  • to doing a function call.

  • And the idea here is to just put the body

  • of the function inside the function that's calling it.

  • So instead of calling the square function,

  • I'm just going to create a variable temp.

  • And then I set sum equal to sum plus temp times temp.

  • So now I don't have to do the additional function call.

  • You don't actually have to do this manually.

  • So if you declare your function to be static inline,

  • then the compiler is going to try

  • to inline this function for you by placing

  • the body of the function inside the code that's calling it.

  • And nowadays, the compiler is pretty good at doing this.

  • So even if you don't declare static inline,

  • the compiler will probably still inline this code for you.

  • But just to make sure, if you want to inline a function,

  • you should declare it as static inline.

  • And you might ask, why can't you just use a macro to do this?

  • But it turns out, that inline functions nowadays are just

  • as efficient as macros.

  • But they're better structured, because they evaluate

  • all of their arguments.

  • Whereas, macros just do a textual substitution.

  • So if you have an argument that's

  • very expensive to evaluate, the macro

  • might actually paste that expression multiple times

  • in your code.

  • And if the compiler isn't good enough

  • to do common subexpression elimination,

  • then you've just wasted a lot of work.

  • OK, so there's one more optimization--

  • or there's two more optimizations

  • that I'm not going to have time to talk about.

  • But I'm going to post these slides on Learning Modules,

  • so please take a look at them, tail-recursion elimination

  • and coarsening recursion.

  • So here are a list of most of the roles

  • that we looked at today.

  • There are two of the function optimizations

  • I didn't get to talk about, please take

  • a look at that offline, and ask your TAs if you

  • have any questions.

  • And some closing advice is, you should

  • avoid premature optimizations.

  • So all of the things I've talked about today

  • improve the performance of your program.

  • But you first need to make sure that your program is correct.

  • If you have a program that doesn't do the right thing,

  • then it doesn't really benefit you to make it faster.

  • And to preserve correctness, you should do regression testing,

  • so develop a suite of tests to check

  • the correctness of your program every time

  • you change the program.

  • And as I said before, reducing the work of a program

  • doesn't necessarily decrease its running time.

  • But it's a good heuristic.

  • And finally, the compiler automates

  • many low-level optimizations.

  • And you can look at the assembly code

  • to see whether the compiler did something.

The following content is provided under a Creative

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

2.優化工作的賓利規則 (2. Bentley Rules for Optimizing Work)

  • 2 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字