字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] SPEAKER: OK. So what is dynamic programming? Normally, of course, I like to ask you to contribute ideas that we can discuss. But in this particular case, the point that I want to make is either you already knew the answer, or you're wrong. Dynamic programming is designed to be buzzword compliant. It's something that was invented by Richard Bellman, and he came up with the name. And it's actually-- and this is according to his own autobiography-- the problem was that Rand is a defense contractor, so he was working at Rand, working on computer algorithms, and Rand worked for the Air Force, and still works for-- it's part of the big industrial military complex-- which meant that their budget depended on ultimately the Secretary of Defense. And if the Secretary of Defense didn't like what he was doing, or Congress didn't like what he was doing, and they got wind of it, that would be the end of it. So apparently, according to Bellman, the Secretary of Defense at the time had a pathological aversion to research. The way he described it is that he would suffuse and become red in the face if you mentioned the word research in his presence. And he really didn't like it if you said something like mathematics. And so Bellman wanted to be sure that whatever he was doing if this showed up in a Congressional report or in a report that landed on the Department of Defense's-- on the Secretary of Defense's-- desk, that it wouldn't say anything objectionable. So he came up with this term programming, which in the math world, means optimization-- find the best answer to a problem. And dynamic and the idea of dynamic, was this is an adjective that, yes, it sort of expresses that things are changing, but mostly it has no negative connotations. Dynamic just sounds good. So that's where the name comes from. So dynamic programming is basically a name that nobody could possibly object to, and therefore decide to cancel the project. There is a term that is much more descriptive of what's going on, which is a lookup table. So what dynamic programming really is, when it comes down to it, is it's a way of looking at certain kinds of problems. With linked data structures, linked lists, tries, hash tables, all of these sort of fall into a category of data structure that you use when you've got data and you want to map the connections between them. And so there's a whole category of different data structures that involve using pointers to map the connections between things. Dynamic programming is a category of algorithms where you say, in my problem, when I want to solve it, I end up asking the same question over and over again as part of solving it. The same question with the same parameters, if you will. And so rather than doing all the same work over and over again, let me just remember the answer after I figure it out once, and the next time I ask the same question, I'll just go look up what the answer was. That's the idea of dynamic programming. And of a lookup table. You've already used lookup tables. For example, going way back to the beginning of the class, Here's a for loop. for(i=0, i less than strlen of some string variable that I've called str, i++. this. Is a loop the loop through every character of this string, str, one by one. But the problem is every time you go through the loop, you recalculate the length of this string. And calculating the length of the string in C takes some time because you have to walk down the whole string until you reach that null character at the end. So going all the way back again to the beginning of the class, remember, we talked about, it would be better from a performance perspective not to keep recalculating the length of the string. Instead, why not say something like int length=strlen(str), and save that value to a variable, and just look up that value every time we go through the loop and are checking-- have we reached the last value of i? There, length you can think of as a lookup table that stores one value. There's already an example that you've seen. We're going go to lookup tables that store the answer to more than one thing. And so I'm going to go through today a few different examples of problems that you can solve using dynamic programming. So these are problems where you can break them down in a way that you keep asking the same question over and over again and just look up the answer in a table or an array. And the first such problem I want to talk about is rod cutting. Now the idea of this is I have a rod of something valuable, so maybe it's a Tootsie Roll, or plutonium, there's a wide range of things that come as sort of like a long cylinder. And you want to chop it up into shorter pieces to sell. And what you know is that different length pieces, different length rods, sell for different amounts. And the question is, how should I cut this long thing up into smaller pieces to maximize the money that I can sell it for? So if I know, for example, that a rod that's one inch long sells for $1, and a rod that's two inches long turns out to be a lot more valuable, I can sell that for $5. And when it's three inches long I can sell for $8. Now there's a lot more demand for things that are three inches long then things that are four inches long. So if it's four inches long, I can sell it for $9, which is a little more but not a lot more. And then comes the question-- I've got a rod of length 10-- how should I chop it down into 1s and 2s and 3s and 4s to maximize the profit. And there's different ways I could do this. I could chop it into 10 little one inch rods, and sell each one for $1 and I've made $10. I could chop it into five two inch rods, and it would get 5 times 5 is $25. That's a lot more money. So being the unabashed capitalist than I am, that's what I'm likely to do. And there are some in-between options that might make me some money. The question is, how to quickly figure out what I should chop this rod into to maximize my profit and be sure that that's the most that I can sell it for There wasn't some other way to chop this up that would be better. And one way to think about the problem is, if I've got a rod that's 10 inches long and I can chop it once an inch, there are 1, 2, 3, 4, 5, 6, 7, 8, 9 possible places I could cut the rod. And any one of those places I could either decide to make a cut or not make a cut. And that will affect the end result. So that's kind of 9 bits-- I either cut or I don't. They're a 1 or a 0. So I end up in terms of the possible different combinations of how I could cut this with every possible 9-bit number in binary, which has 2 to the 9th possible values, which is 512. So I could try 512 possible different ways to cut this rod. And I can figure out which one does the best. And then I could try having a rod that was 20 inches long instead, and discover there's 19 possible places to decide I'm going to cut it or I'm not, and there's 2 to the 19th possibilities now, which is about 500,000. 2 to the 10th is about 1,000, 2 to the 20th is about a million, 2 to the 19th is about a million divided by 2, it's about 500,000. So in general, if we said this rod is n inches long, so that there's n places or n minus 1 places where I could make a decision to cut it, there are 2 to the n approximately possible different ways of making combinations of cuts. And that's a lot of things to try. At the same time, you can sort of see that splitting it into rods of 1, 1 5, 5, and 9 inches long-- this is not the only combination of cuts that would give me that. There are other combinations that would also give me two 1s, two 5s, and-- sorry, a 4 inch long. Two 1s, two 2s, and a 4. I could do a 4 and then two 2s and then two 1s. I'd get the same value out of it. So a lot of the possible ways that we could cut-- they would give the same result in terms of the value. And so maybe I want to reorganize the problem where I don't have to try everything, because I realize a lot of times I'm just recalculating the same thing. And one way to do that-- and this is where this concept of dynamic programming or a lookup table comes in-- is to say, let me decide first where to cut where the left most cut will be. So instead of deciding, am I going to cut it 1 inch or not, am I going to cut it 2 inches or not, am I going to cut it 3 inches or not, I'm going to decide what's the first place that I'm going to cut? That could be at 1 inches or 2 inches or 3 inches or 4 inches or 5 inches or 6 or 7 or 8 or 9 inches, or I could decide I'm not going to cut at all, I'm just going to leave the rod whole. So with his rod of length 10, there's 10 possible places to make the first cut. And the thing that I'm going to decide is that that splits the rod into two-- the part on the left I'm not going to cut again. I'm not going to cut this 2 part-- or the 3 part-- I'm making that decision. This is the first piece that I'm going to sell. So the total value that I can get will be whatever I can sell the piece on the left for like $1 or $5 or $8 or $9, plus whatever I can get for the right part if I chop it into smaller pieces the best possible way. So the part on the left I won't cut again. I can just look up what is the value of a rod of this length in the table. But I also need to figure out for the part on the right, what is the best way to cut this up. And even that, if you draw out all the possibilities, you're going to end up with the same 2 to the n, except you can start seeing some commonalities. So for example, if you decided to cut a rod of length 2, you've got eight leftover, and you want to find the best way to cut it. Now, if you only chopped off one inch first, that was your first cut, and then you did another cut where you chopped off another inch, suddenly you're end up with what is the best way to cut a rod of length eight. So in the possible ways of chopping this thing of length 9, one possibility is, again, we take one inch off to sell, and we have eight inches left. If right from the start we've taken two inches off, we would also have eight inches left. And so we're asking the same question both times. We've said, I've got some stuff I've already cut that I'm going to sell, and now I've got an eight inch rod left. How do I cut that? There's no point figuring that out twice. We could just keep a little array in which we store, what's the best way to cut a rod of length 8? What's the best way to cut around of length 9, of 7, 6, of 5, of 4, and for each one of those, we just look in the array and say, have we put a value in there yet? If so, we just use it. We don't recompute it. And the result is, if you start looking at it, we've got 10 choices for our first cut, and then on average we've got something like up to nine choices for our second cut, and up to eight choices for our third cut. And so for each of the possible 10 choices the first time, we've got nine choices to make for our second cut, we're up to 90. But this would get us exponential. The problem is a lot of these cuts end up being the same. And so you end up saying, I only have to deal with sort of 10 choices of what to make as the first cut for right of length 10, and nine choices to make for the first cut of a rod of length 9, and eight choices for the first cut to make on a lot of length 8, because after that I'm getting down to something shorter. I'll figure that out eventually. And you end up with something like n squared. So for rod of length 10, I'd have roughly 10 times 10 choices that I actually have to do work for. And so you get down to n squared as opposed to n cubed. Another way to do this is to say, let me start with a rod of length 1. I have no choices for how to cut it, so I know the answer for how I'm going to sell that. For a rod of length 2, I can either do the cost for rod of length 1, and make a cut, and then have a rod of length 1 leftover. I can look up what that costs. And I'll store then which of those was the best value for rod of length 2. I'll say, for rod of length 2, I can get $5, and the way to do that is to not cut. And now for a rod of length three, well, I could decide to cut it after 1 inch, and I've got 2 inches leftover. So we'll look and say, well, the best thing to do with a rod of 2 inches is get $5 for it, so great. For something with 3 inches, I can get $6 by making a cut after one inch. If I make the cut after 2 inches, I get $5 for selling a rod of length 2, and then I look up what do I do with a rod of length 1, and the answer is you just sell it for $1, that gets me $6. I look up, and what happens if I just sell a rod of length 3-- and that's $8. So I'll record-- great, for rods of length 3, don't chop it up, just sell it. For a rod of length 4, now I can chop it after 1 inch. I look up, ah, for a rod of length 1, I sell it for $1, plus what do I do with a rod of length 3? I just sell it. I'm done. I've found that that gets me $9. What if I cut at 2 inches? Well I've got a two inch bar I sell for $5, and then I look up how do I recover the right half, it's 2 inches long. I won't bother. My table says just keep it. . And after I've tested all of those, I know that cutting 4 into 2 and 2 is going to be the best thing to do. So what's going to happen is, as you build this up, eventually you get to where you're dealing with your rod of length 10. For every cut that you make, you can just look up the left half you know you're going to say as a single thing, the right half you can look up in a table and say, what's the maximum value I can get for this? And the table will say, you can get this many dollars, and here's where you make the first cut. And then you've got a shorter piece left over, and you can look up where to make the next cut. Questions? So I'm going to move on to some other examples. And the next one I'm going go through fairly quickly. And the one after that, I'm going to go into more detail even than I did the rod cutting. And we'll draw out that table, filling in some of the values. If you remember to networking last week, we've got all sorts of computers that are connected together-- for instance, my computer, at least when I'm in my office, is probably hooked up through a department-wide server, or maybe just through Yale's University server, and Natalie also has a computer that's connected, and because our offices are close, they may actually have a connection directly to each other. Now, David and Doug up in the CS50 office at Harvard, or their computers may also be directly connected to each other. So they can send messages back and forth directly, and they're also connected to Harvard's server. But if I want to send a message to David, I will need-- and that might be an HTTP request, for example, if he's running a web server on his computer-- I'm going to need to find a way to hop from computer to computer. I can't talk directly to David's computer, because there's no wire running between them. So I need to find some computer where I say, I've got a message for David, do you know how to deliver that? And it will say, yeah, I can do that for you. And then it will pass it on to some other computer, saying this is for David, and it keeps getting passed from server to server, until it gets to one that is actually connected to David's computer. And the problem of how to find who to pass that message to is called routing. And it's one of the problems in networking. And it's particularly difficult because networking, you have to deal with other people. When you're just writing a program of your own on your laptop, if something goes wrong, it was probably your fault, and you can probably fix it. If you're talking about a network, it could be that somebody turned off the lights somewhere else, or unplugged their computer without warning. That's completely out of your control. But you still have to deal with it. And so how do you get from my computer to David's computer is something that you have to keep track of and may keep changing, and to get there, we need to find-- I need to be able to find not just how I could get to David's computer, but essentially who to pass that message to to get it there the fastest. I may have more than one way of doing it. And one way to do it is going pretty fast, and the other way is going to go through China and Sierra Leone and a congested undersea cable to Africa which has very poor internet links, and it will take a long time for the message to get to David. Unless, of course, somebody from MIT went and found the cable that connects Harvard to Yale and snipped it just for fun, in which case I might be stuck taking the longer route for a while because something broke outside of my control. And the question I want to talk about today is, how do you figure out who to pass that message to? And this is another place where dynamic programming turns out to be really helpful. So the idea is, it's not just that I need to know how to get to David's computer, I need to know how to get to everywhere on the internet. And everywhere else on the internet also needs to know how to get to everywhere on the internet. Everybody is trying to solve this problem of how to find the quickest way to get to any other computer. And to start out, I'm just plugged in, I don't even know who I'm connected to, let alone who else is out there. So the process is, rather than me trying to go snooping and exploring the entire internet-- and by the time I'm done, somebody will have unplugged their computer and the process will need to change-- I want to try and make sure that we are all doing this work together and sharing as much information as possible. So what I will do is I will just broadcast a message to all of my neighbors, saying haha, I'm here, I'm Benedict, and now Natalie will know, and Yale server will know, aha, Benedict's computer is connected directly to us. So if we need to get a message to Benedict, we can just give it to him. Now at the same time, everybody else is going to do that. So I'll get a message from Natalie, saying haha, here I am, and I'll get one from Yale saying haha, here I am. And now, I know that I'm connected directly to Natalie and Yale. This is what we call being one hop away. So if I need to send a message, it needs to make one hop, just like an airplane makes one hop from city to city, to get to the next computer. And as far as I know, that's all there is in the world. But everybody's just done this, so now everybody knows all of the computers that they're adjacent to. The next thing is, we can all send that information around, and say, here I am, and oh, by the way, here's everybody that I'm connected to. So if you give me a message for any of those people, I'm saying, anybody who passes a message to me that needs to go to Natalie-- I can get it to her in one hop. So then you can figure out how many hops it takes you one more hop than that if you pass the message to me. So after we share that we can update our lists. And what will happen is Yale server has learnt from me, hahaha, I can get a message to Natalie for you, just give it to me and in one hop, I'll get it to Natalie. And Yale server says, yeah, but I can just give it to Natalie myself, so why bother doing that? So Yale server will keep in its list, I just found a way to get Natalie in two hops, I don't care. I can get there in one hop by just giving it straight to Natalie. However, I will have learned that Yale can reach Qwest's server in one hop. And so if I need to pass a message to Qwest, I can give the message to Yale and ask it to give it to Qwest and that's the fastest way I know to get there. And so each step everybody sends a pretty short message to all their neighbors. It takes a fixed amount of time. And we're accumulating this information and always keeping track of, what was the shortest-- what is the fewest number of hops it takes us to get to a server, and who do we give the message to to get there? And as long as I give the message to Yale, Yale can worry about how it gets to Qwest, it's just promised me it takes one hop. If I do this again, now it turns out I can get to Google in three hops. I can get to Harvard in three hops. But all I care about is, I can get there in three hops and the way I do it is I give the message to Yale's server. Yale's server doesn't even know directly how to talk to Harvard's server or to Google's server. It just knows to pass the message to Qwest's server. Qwest, by the way, owns a lot of the big fibers that connect big data centers and things. So when I traced what happens if I try and connect to Google's server from my office, it goes up through a departments-- CS department-- server, then a Yale server, then it goes to Qwest, and then it goes to Google. There are some other ones, like Internet2, which connects a lot of universities, and America Online used to own a lot of this, I don't know if they ever sold it off. There's a few companies that own most of those cables. And eventually, after four rounds, I find out that I can even reach David and Doug in four hops by giving the message to Yale. And Yale knows that it needs to just give that message to Qwest, which gives that message to Harvard, and then Harvard knows how to pass the message on. And this is in fact how the major routers on the internet share this information. Now, your computer or my computer, what they typically know is any message I need to send, I just send it to the server I'm connected to, to the wireless access point. And the wireless access point basically says, I send it to Yale's server. Or maybe there's a very small-- it knows the following laptops are connected directly to me and anything else I just send to Yale's server. But when you get up to the level of all of Yale University, for example, it may well start to participate in this process of sharing this information to know, I could talk to Qwest, I could talk to Internet2, I could talk to Comcast, there's a bunch of people I could talk to. Which cable should I send this message out on to make sure it gets where it's going? And the dynamic programming aspect of this is, at each step I'm reusing all the information that was computed before to just add another layer on top. How do I get to places in three hops? I'm not recomputing the route to get from Yale to David, let's say, when I figure out that I can get there in four hops. I'm just using the information that I know that the Yale server can do it. And so that reduces the amount of actual work that I have to do. Any questions? Now, the third category of algorithm that I want to talk about-- and I'm going to talk about this one in a little more detail, and I'm going to talk about a few different applications of exactly the same algorithm, so not just of dynamic programming but of this specific algorithm, of different problems that it can solve. And the one I'm going to start with is DNA sequence alignment. You also do this for aligning protein sequences. If you have a strand of DNA, it's made up of a chain of what are called bases, which are chemicals. They're molecules. And there's four of them that are used in DNA. There's adenine, and thymine and guanine and cytosine. And we'll typically just abbreviate each one with a letter. So if you were to unroll a strand of DNA, and just write down the list of these bases of these molecules that are connected in a long chain, you essentially get a string that's made up of four letters. So it's a string of some combination of the characters A, C, G, and T. Now, it might be that I know by heart the entire genome of a mouse and I know what every single gene does. Or at least I know what some of them do. Now, over time we diverged from mice and various mutations happened to our DNA, because whenever it gets copied, we sometimes make mistakes, and somehow we ended up instead of being a proto mouse we ended up being a modern human. And that same proto mouse whatever ended up being a modern mouse. through different mutations. So if I start sequencing my DNA, I might find a strip of it that corresponds to a particular gene, and I don't know what it does, but I'd like to find out. One way to do that would be to try and match it and find the best possible place for this to match which gene of mouse DNA does this match the best? Keeping in mind that there might be some gaps, a gene might have disappeared, a gene might have reappeared, or gotten added, and sometimes one of these bases might have gotten replaced by a different one. So we might have a mismatch. So there's different things that could happen in the process-- it won't be a perfect identical gene in the mouse to what's in the human, it will just be mostly the same. And I could use that to say, what's the best match, and then use that to predict what this gene codes for, or what the function of the protein it codes for is. You can do the same thing with proteins, they're made up of amino acids. You can look at that chain and look for similar things between one protein and a database of proteins where you know what they do to see if you can predict the function. But the problem is there's a lot of different ways of doing it. So let me write these on the board we've got AAC, AGT, TACC. Now, let me try and come up with the worst possible match that I can. TAAGGTCA. So there's different possible ways that I could line these strings up. They're not the same length, so I'm going to need some gaps somewhere. There's different places I could try and put them and see how the characters line up. And the rule is going to be-- and this is based, in fact, on how common it is for a mutation to occur where one base switches to another one, versus a base is actually deleted or added in genetic mutations. It turns out that saying, it's roughly twice as likely that a base just changes to another base than to have a base completely disappear or appear out of nowhere. during mutation. So we will say that when we line up the strings, any time we have a mismatch between two characters-- so two bases that are not the same-- we're going to have that a penalty of 1. And we're a cost of one. That's what we pay to force the strings to match up that way. And any time we've got a gap, where, say, the C from the first string or DNA strand doesn't match anything in the second one, so that corresponds to a base getting inserted or deleted, we're going to assign that a penalty or a cost of 2. So we'd rather have two mutations-- two things were base turned into something else-- that seems about as likely as having something just get added or deleted. And then I can add up all those costs, and I get the cost of that particular way of matching the strings. And I'm looking for the way to do this with the lowest cost possible match. This is also called edit distance, because it's also measuring how many edits would have to make to a text, how many changes would I have to make to the text to turn it from one thing into another. This doesn't just have to be these four letters. This could be anything. But the thing that I've written on the board that I want you to see is the worst possible way I could match these would be to say, what if I just deleted the entire gene from the mouse and then did a bunch of insertions to produce the entire gene from the human? So each one of these characters, we say, matches to a Gap. That is one possibility. It's not a very good one. That would be 2 plus 2 plus 2 plus 2 plus 2 plus 2 blah blah blah blah blah blah blah. That would cost a lot. That's a very unlikely way that we would have gotten from this to this, and we're looking for the most likely way. But what it does say is that there's not really any point inserting yet more gaps in the middle, because those things line up perfectly, they won't change the matching score. That's kind of silly. Once this thing is completely gone, I might as well start adding the new one in. So the longest possible sequence where we try and line the two of these up is the length of 1 plus the length of the other which here is about 18. And now if we look at each of these columns. We sort of had a choice. The same way with a rod, we had a choice-- do we cut or do we not cut? Here we have a choice of do we do a deletion-- so we had a letter on the first string, and we're going to match it to a gap on the second string, as if this letter got deleted. Or we could do an insertion, in which case it's kind of like this situation where we had nothing and we added a letter. We could do that at the beginning of the string. Or we could actually try and match two letters. So at each point here, we've got three choices we can make-- an insertion, a deletion, or a match. And if we could make that choice independently, between each of the letters we'd have three to the 17th possible combinations of things to try. That's a lot. How much is 3 to the 17th? Nobody knows this off the top of your head? OK, the technical term is a lot. That's a number. It's defined as 3 of the 17th. So again, we don't want to keep-- we don't want to try everything. And we really want to do this. Computational biologists really want to figure out, look through a database of genes or of proteins and try and match something to figure out what it might do. Which parts of the genome that we just sequenced or the DNA strand that we just sequenced are worth zeroing in on and doing more studies on, because they might code for something important? It turns out that we like to do this also on occasion, or at least we do do it, when we want to look at two people's submissions and see how similar they are. So fundamentally, cheat checkers do this. They want to quickly compute how many changes would I have to make to one student's program to turn it into another student's program? And once we've computed all of that, we want to look and find the ones where-- for most people maybe it took 1,000 edits, but here we've got a pair where it only took two. And that's kind of suspicious. So I'm going to sort of rephrase the problem. And it's going to be similar to rod cutting, where I'm going to say, let's pretend that I already know how to match the right portion of two of the strings. So if I already know the best way to match from GTTACC to GTCA, then to figure out the best way to match the whole thing is I just have to figure out how to do the first part. And this is kind of like saying, let me decide where I'm going to put the first letter of the first word, where I want to put the first gap, let's say, or the first insertion, and then figure out the rest of it. But I'm phrasing it going from the end to the beginning. And I'm phrasing it going to the end to the beginning kind of like I've shown here just because when we actually compute the table, that's going to end up putting the answer in a more logical place. So we can kind of say, at the last position in this match, are we going to have an insertion, are we going to have a deletion, or are we going to try and match two characters? And that gives us three possibilities. And each one has a cost. And so whatever the best way to line up these two strings is, the last thing that happens is going to be one of these three things-- those are the only three possible ways for the last part of the match to happen-- has to be an insertion or a deletion or a match because they're the only options. So if we know the costs of those, then we can say the cost of matching the entire string is the cost of doing whichever choice we make here plus the best edit distance cost, or the best alignment, of whatever's left. Which is slightly different in each of these three cases. Except that if, for instance, we do a deletion, and then for this choice where we're got TAAGGTCA at the bottom, the next time around we tried to do an insertion, and then the best of whatever's left, we would have AACAGTTACC matching to TAAGGTC, go figure that out. And then we would have insert A then delete C. Sorry, we would we would end up with these two strings. But instead of having to try to match C to A, we would have modeled one insertion and one deletion. We'd have the same thing left here. So as we build these things up, we're going to try avoiding finding the best match for the same pair of strings twice. And, realistically what's going to happen is we can start with, well, what if both strings were empty? Then they match perfectly. No problem. What happens if, at the end, I would like to have the string C or the character C, be deleted. So I would like to do some sort of match here where the very last thing that happens is C is deleted, and the rest of that TAA blah blah blah blah is somewhere over here. But I'll worry about where later. So in that case, we have a cost of 2, and when we were done, we would have used up this character C, because we deleted it. And so what we're left with after that is what is in the corresponding cell to the right in the table, which is the cost of matching nothing to nothing. Similarly, if we said the last thing that we want to have in the table is CC matching to nothing, the cost of that is 2 for the match, the deletion of C, plus the cost of matching nothing, C to nothing, which was this cell in the table. So we have 2 plus the cost of 2 that's right here is 4. So in general, what's going to happen is if I look at the cell where the little hand is right now-- that just disappeared-- right here, this is the cost-- we'll ultimately record the cost of matching GTTACC GGTCA. And it will say, in order to get the best match between these, so the smallest edit distance, if you will, here's what the total cost will be, and it will tell you should you try and match those two G's to each other, or should you do it as-- should you have a deletion or an insertion? That's what's going to be stored in this cell of the table. And ultimately that means in entry 0, 0 of your array at the beginning of the table, will tell you the cost of matching AACAGTTACC to TAAGGTCA. It will tell you what the total cost is, and it will tell you whether you should start with the insertion or deletion or a match. And depending on which one, that's going to use up characters from one or both of the strings, which will move you to a new cell in the table that will tell you what to do next. So once it's primed, I can say, well, at the end, if I just look at the end of each string, I have to match the letter C to the letter A. And what is the best way to do that? Well, I've got three options. One option is I could delete the C. Oh, there's an error here, this should read i plus 1. So I could say, let me go over to cost i plus 1. So I've used up the C. But I'm going to stay in-- sorry, i plus 1 is this row, but I'm going to stay in column j-- so I'm going to come down to here, which says I've used up the A, so this A right here got matched to a gap, which is like an insertion. It says somewhere at the end of the table, I'm going to end with that A from the second string. And, oh, I still have the C floating around. So let me switch back to some white chalk. But leftover, we still have a C here. Because we were trying to match A to C, and so far we haven't dealt with that C. So we would have the cost of matching A to a gap, which is 2, plus whatever the cost of matching C to nothing is, which is 2. Or we could say what if, instead of doing that, we did C and we'll delete C as opposed to inserting A, but it means we still have an A floating around here that we haven't used yet. That was an alternative way of matching C and A. So then we have a cost of 2, because we're doing a deletion. Plus we have left over the cost from the cell to the right in the table of matching the string A to nothing, so that would be a cost of 4. Or we could try and match C and A. And when you match C and A, well the cost of this is zero if the letters are the same and one if they're different. So in this case it's one, because c and A are not the same letter. And we're left with two empty strings, because we've used up the C and the A. So we end up with 1 plus the cost of whatever's over one and down one in the table, which is 0. So that total cost is 1. And of those three possibilities, we just take the smallest one. We say, ah, the smallest one was if I went diagonal-- if I matched A to C. So that's what I'll record in the table. Total cost of matching A to C is one, and the best thing to do is to actually try and match these two characters. It would be a mismatch. And then look up in the table what to do with whatever is left. And now I could say, well, what about matching the string CC to A? I can do the same question. I could say, I could start by matching the letter C and the letter A, and now I've got a C leftover. I could delete the first C and now I've got an A and a C leftover. That would be going this way. Sorry, if I match both. I might have a C leftover. If I delete-- if I just first start with a C and do a deletion, I end up with a C and A leftover. If the first thing I do is insert the A, I've got a CC leftover. So each one of those I've got a penalty of 2 to go right, 2 to go down, one to go diagonal, that I add to whatever's in that table. I discover that if I try and match A to C I pay a penalty of one, and I've got a C leftover that I have to delete, so there's a total cost of three. And that was as good an option as I had, so that's what I filled in to the table. And then I can try and figure out for the string ACC to A, and just work backwards through this table and up the rows until it's all filled in. And the nice thing about this is every time when I look at the table, I need three pieces of information. I need the piece of information in the column just to my right. I need the piece of information in the column in the row immediately below me-- the cell immediately below me-- and I need the piece of information that's one down and one to the right. And because I'm working backwards to the table, it's nice and easy because the data's already there. I started filling in the bottom row in the last column, which are sort of special cases, and now I just have some loops that run backwards through this. And if one string is n letters long, that's n is in number, and the other string is m, which is m as in the mnemonic, then the total is m times n. And I see Doug smiling at mnemonic. I have to attribute that particular joke to Professor Mitzenmacher from Harvard, as reported to me by my roommate, because I wasn't actually in the class where he made the joke. But I liked it. So you can fill up the table. And now, instead of having 3 to the n, essentially, things that you had to try, you only have n squared. So 18 letters. 18 squared is a little less than 20 squared. And 20 squared is 400. So we've got at the most about 400 but in this case it's not even n plus m squared, it's really just n squared. It's really like 10 squared. It's like 100. So we've got roughly 100 different steps to find the answer. If we tried everything, we'd be doing the same work over and over and over again, and we'd do more like 3 to the 20th, which again, is a really big number. 2 to the 20th is a million. And 2 to the 3rd is like-- 3 is like 2 to the 1.2 or something, let's say, so it's roughly a million to the 1.2. It's about 130 million, there you go. I knew we had an idiot savant in the room somewhere. And again, each one of these cells in the table-- it tells you essentially what to do. Do you match this first pair of letters? Do you do a deletion? Or do you do an insertion? And from that you can figure out how much of each of the strings is left. And then you look up in the next cell of the table that corresponds to that what the next thing to do is. And so to figure out exactly how these things match up as well as just what is the cost of matching them up in the best possible way, you work your way through the table. Rather than trying to store all that information in each cell and store the same information over and over again, you store only one piece of the answer in each cell. Questions? Yes. AUDIENCE: In the runtime, it says there's omn, is the m a constant, or is it another-- what is that [INAUDIBLE]? SPEAKER: Right, so in this running time, where it says OMN, and what does that mean? What we're saying is we've got two different strings. One of them is n characters long, the other one is m characters long. So in terms of the function of the input, we're just being slightly more precise and saying, we've got two words that are each n letters long. And then it's sort of n squared. Saying maybe they're very different lengths, and so we can be a little more precise as to how many steps they take. But the key thing there is you see one thing times another, so that's kind of like something squared. As you think about, as these strings get longer and longer, how much work is this? So this is not too bad. And it's really actually pretty feasible to do this for DNA sequences that might be a few hundred base pairs long. It's feasible to do this for a source file from homework for cheat checking-- is maybe a few thousand characters. No big deal, even though we've got 1,000 students, and we've got all the past submissions, so we've got 10,000 submissions, so we've got roughly a million pairs of files. Something like that, and that's not a big deal for a computer. And each one of those takes somewhere like a million steps, because you've got 1,000 characters times 1,000 characters. We can manage that. Other questions? Yes. AUDIENCE: Do you have to gain brute force to fill out the table intially? SPEAKER: So do you have to do brute force to fill up the table initially? That depends on how you define brute force. You do have to fill up all the cells of the table. Because in order to find the answer at the top left, even though you're ultimately only going to use the value from one of these three, you need to look at all three values-- the one to the right and the two below. So in order to find the final answer, you have to have filled up the whole table. And there's m by n entries in the table, which is where you get to this m times n or n squared-ish kind of thing. But compared to what you might think of as brute force of try every possible alignment of the two strings which involves lots of subalignments, lots of parts that align the same way, in lots of different places, we're not redoing those over and over. So we're not doing this really brutal brute force that would say there's something like 3 to the 2000 steps to compare a single pair of files, and then we have to do a million of those. That would be a problem. Other questions? So I've sort of folded two applications into one right there with edit distance. One is this computational biology where you treat a DNA strand or a protein sequences as a string. And the other was some sort of fancy file comparison you might use for search and replace. You might use this for spelling checking to figure out the best word in the dictionary. Not just is the word there, but of the words that are in the dictionary, which one is most likely what I meant to type based on some model of how frequent different kinds of errors are. And you could compute the edit distance to every word in the dictionary to propose a list of suggestions. But there's another application that I want to talk about, which is image stitching. This is what happens when you take out your cell phone and you build a panoramic image. So you sweep the camera around and it's going to take a video or it might just end up taking a series of pictures every second or two. And then it stitches them together into one big image. And there's a few parts to that process. One part is each of the pictures that got taken, figuring out how your camera moved between each one. So it can sort of transform the images as if-- to where they overlap in just the right places. But even then, they won't line up perfectly, because maybe somebody's walking around in the scene, or it turns out that if you're actually walking along with your camera like this, as opposed to just rotating around, you get a parallax effect where things that are closer to you appear to be moving faster than things that are further away. So right now I see part of the middle chair. The right hand side of it to me is hidden behind the tripod. But if I walk over here, that's come into view and the left side of the seat is hidden behind the tripod. So within that panorama, maybe there's no perfect way to light up the images, because there's different information visible even in the overlapping areas. And to resolve that problem, the key is once you figure out how the two images overlap, you want to figure out and take part of that overlapping area from one image and part of it from the other image. So there's a seam between the two. And a seam is just a connected line of pixels. And everything on one side comes from image A, everything on the other side comes from image B. And the question is where should that seam go? Now, that seam should go, ideally, somewhere where we won't notice it. So let's switch from one image to the other at a place where they are really similar, and somewhere where I'm seeing the parallax from the tripod moving, or somebody was walking along in the scene. I try and make this seam where I connect the two images go around that. So it's not picking up sort of a jump where the two images were actually different. Well, the simplest way to do that is to look at a pair of this overlapping rectangle, and say, let's draw this seam where pixels are really similar. And I could do that by looking, for instance, at the difference in the pixel color between the two. In this case, it's nice and easy because the images are black and white right here, so I just take the difference in intensity, which is a number between 0 and 255, and I'll get a different number that's between 255 and negative 255. So I might subtract white from black. I take the absolute value of that. If that's 0, it means that pixel was the same color in both images. If it's really big, it means the colors were really different. And now, I could say the cost of a seam going through a particular pixel is that difference in values between the image pixels-- the difference in intensity. And so for the total seam, I want to figure out from there, should I go right? Should I go down? Or should I go down and to the right? Which I can do by looking up in the table the cost of a seam passing through the pixel to the right or the pixel down and to the right or the pixel to the left. Here's the table, it's just that now, a cell in the table, instead of saying the cost of matching ACCC for example to TCA, filling in a cell in the table. What that-- I'm going to interpret that cell as is the cost of having the image seam run from here to the lower right corner. And so in that sense, if you're looking at image stitching, you can see that matrix. It's this rectangle of overlapping pixels between the two images. And the algorithm to find the optimal seam is identical to edit distance. Absolutely identical. The only difference-- even though I said it was absolutely identical, there is of course a difference, otherwise it would be no fun-- but the difference and the reason I say the algorithm is identical is this cost function to decide what is the total cost of this cell based on the cost of the thing to the right, down and to the right and down below, is different. That cost function for edit distance was some number depending on whether they matched or mismatched or it was an insertion or deletion plus the cost of the thing to the right or down or diagonal. So there was a different penalty that you paid for moving right in this table, or down, or diagonal. With this image stitch right now, the way that I've defined it, the cost is the same. The penalty that you pay for having a seam go through a particular pixel is the same even if you then go on to the right or diagonal or down. It's just the difference in intensity. So you would take the minimum of-- you would take difference in intensity plus the minimum of what's to the right, what's down below, or what's diagonal to the right. Slightly different cost function, but really the same algorithm. And there's ways to implement this where you can essentially say to the algorithm, OK, go fill in this table. Use this cost function, so that you could use exactly the same code. You implement this once and you can do edit distance and you can also do image stitching. And this is one of the respects in which algorithms and the abstraction of computer science becomes really interesting is when you start to see two problems that seem really unrelated and turn out to be the same. And it's even more fun when you can write one program that does two completely different things for you and don't even have to re-implement it. So any questions? I'm going to end, then, with a demo of one more algorithm that uses dynamic programming-- and you can find various demos of this on the web-- this is just one of them-- called seam carving. This is very similar to image stitching, where we found that seam going from the upper left corner to the lower right corner. The idea here is that we want to resize the image on the right. And you did this in p set 3. You resized images-- or p set 4, excuse me. You resized images. And if you made them narrower, they got kind of squashed, and if you made them wider, they got kind of stretched, and everything looked wrong. Wouldn't it be nice if we could resize the image but have it still look normal? So the idea of the seam carving algorithm is essentially to resize the image. Instead of scaling every pixel to be a little narrower than it was before, let's just take a row of pixels and delete it. And so we might take in this row right here, or this column of pixels, and delete it, and maybe nobody would notice. The problem is you can only do this so many times in an image before somebody notices. Because you get some sort of jump, where you've deleted useful information. And one of the reasons for that is that, for example, in this column right here, in the water you could get away with deleting a pixel. In the sky you could get away with deleting a pixel. But in the balloon, you better not delete a pixel, because that would be really obvious. The balloon will start to look really funny really fast. So instead of finding a straight line of pixels to delete, let's find a wiggly line. We're still deleting one pixel from every row, so every row will end up being one pixel smaller than it was before, which means the image will still be rectangular, but we have a choice. If we delete a particular pixel at the top of the image, on the next row down we could either delete the pixel right below it or the one to the left or the one to the right. So this is also pretty much-- we've got these three choices, just like we had before-- slightly different cost function. And then you've got-- you could start by deleting any one of the pixels at the top. You pick the one that's going to have the lowest cost. And that cost might be how different are you than your neighboring pixels on the left and right. So if you're pretty much the same color as the thing you're left and the thing to your right, nobody will notice if you go away. And with just a little bit of extra work, you can then update that information, that table, to reflect the fact that you've deleted this one pixel from every row. You can patch up all of the seams that that would have affected to figure out if I need to squashed by another pixel, what do I throw out. And you can start resizing the image. And you can see, it's deleting a lot of sky. It's deleting a lot of water. It's being pretty careful about how it thins out these bushes. And thinning them a lot less than it thins some other things, trying to do them in sensible ways so. The stuff that we're used to seeing is mostly staying normal and empty space is getting squashed. But that empty space is in different places depending on where you are in the image. And down here, for example, we don't have a lot of empty space. It's hard to delete grass without you knowing it, because you're going to get half a blade of grass and it's going to look funny. And there's only so much water you can delete before that becomes obvious. So somewhere we had to start deflating the reflection of the balloon. So it looks funny, but it looks less funny than if everything had just been squashed by an equal amount. As long as you save the information about what you were deleting, you can re-expand it. So you see, even if we'd squashed a whole bunch, this balloon is still whole, because it's so easy to delete pixels in the sky and still have the sky look good that there's no need to shrink the balloon. The reflection of the balloon-- that got harder because we're balancing it off with deleting stuff in the grass. So it takes just a little bit of work beyond that image seam and edit distance problem to do this seam carving. And it takes just a little bit more work to be able to resize both horizontally and vertically, which is pretty cool. And this is something that was published at the ACM Siggraph Conference, which is a giant computer graphics conference every August, about 10 years ago, I want to say. I can tell you the exact date by looking for the reference-- 2007. So yes, 10 years ago. And it's actually a pretty short paper. And it's really easy to understand, because it's a pretty simple algorithm. And it's a really cool idea. And so it's one of these rare papers that you can sit down and read and understand. You can go to the talk and understand. And you can go home afterwards. And if you've got it a little bit of a background in computer science, and particularly in computer graphics, you can understand this. And you can just sit down and in an hour you can implement it. And so you get these web demos that popped up. Photoshop can do this. And it started appearing all over the place really fast, because it was a really good idea. It worked surprisingly well. And it does that through this dynamic programming. Let's not recompute information about how to get from some particular point, how to delete a scene from there down to the bottom. It doesn't compute that for every possibility It just sort of figures out, well from here, the best thing to do would be to go down or the best thing to do would be to go down and left. And that pixel worries about how to go on from there. Any questions? OK, well thank you all for coming. Remember that tomorrow morning there will be another lecture streamed from Harvard. And if you're at Harvard, you can always have fun going. We'll talk about an introduction to Python, which is another programming language like C. It's got slightly different syntax, but the same basic deal. It tends to get used for certain kinds of programs where its syntax makes it a little bit easier to write them. Then you will be getting instead of a programming assignment, you'll be getting a fun more sort of written exercise to work on over the weekend. Remember your exam. So that will come out after lecture tomorrow. And you'll have three days for that. Then next week at Yale, there are no sections because of our fall break. So there will be no sections Tuesday or Wednesday. You've got some time off. You've got some time to recover. You've got some time for sleep. And next Friday, which is during our fall break, there will be another lecture streamed from Harvard that will be more about Python. And p set 6 will be coming out. It's likely to tie a lot of the things we've been talking about this week and next week together into a programming assignment. And you will of course have 10 days to work on that. So you can look at it when it comes out, but it doesn't have to ruin the rest of your fall break. I promise. And with that, again, thank you for coming. This was CS50.
B1 中級 美國腔 CS50 2017--播放7--動態編程 (CS50 2017 - Lecture 7 - Dynamic Programming) 83 7 小克 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字