字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] SPEAKER 1: Cryptography. What is it, and why is it important? We're going to answer those two questions in exactly that order. Let's start with what cryptography is. It's the art and science of obscuring, and ideally protecting, information. Now it's an art and a science because there's math involved with it. It's pretty straightforward to manipulate characters in some way by adding some constant number to them or to change them in some systematic manner. But it's an art, because doing so in a way to defend against potential attacks is not as easy as it might first appear. There's a lot of guesswork and calculation that needs to go into play to find a really strong cipher. Cryptography gives us the opportunity to have a basic level of security against an adversary who might do bad things with the information. We usually contrast, in cipher information, with information that is presented in the clear, which is to say there's no protection surrounding it at all. And it's generally considered better to protect information using cryptography than to have information just freely available out there. Now a cipher, we're going to start by talking about cryptography sort of through history. We'll lead up to more modern forms of cryptography, which are derived from more ancient forms of cryptography. But a cipher is one of the most fundamental forms of cryptography. And ciphers are algorithms. And recall that an algorithm is just a step-by-step set of instructions that we use to complete a task. And in case, the task is to obscure or encipher information. And ciphers can also be used in reverse to unobscure, or decipher, that same information that was previously encoded or enciphered. Now there are many different ciphers out there that have varying levels of security potential. Some of the more ancient ciphers that we're going to start with should be [INAUDIBLE] be considered to have no security potential at all considering how easy they are to crack. But again, this leads into the more modern approach to cryptography, which is much more secure than some of these basic ones. And now let's start by imagining that we have possession of this device. Now if you're looking at this device and it seems somewhat familiar to you, it may be because you've recently seen the movie A Christmas Story, where Ralphie, the character there, obtains one of these, which is a little orphan Annie's secret society decoder pin. And this decoder pin has a set of numbers going sequentially one through 26 around the inner edge, and a set of letters, which is not presented in any particular order, around the outer edge. And what would happen is the radio announcer would provide, set your pins to some combination. So line up one number with one letter. And then it would read off some secret message that, ostensibly, only individuals who possessed this pin, or many of the duplicate versions of this pin that were distributed to children around the country, could then decipher by taking the numbers that were given over the radio and transforming them back into letters so that it makes sense. So if you can, if you zoom in on this image, it might be a little difficult to see, but you can see that the 3 corresponds to the letter L, and the 4 corresponds to an M based on this particular setting of this decoder pin. So this is one potential, what we would call a substitution cipher, where we're changing, we're substituting a letter in this case for a number, and that number will henceforth represent that letter for the rest of this message. But what is the problem with this cipher? Or more generally, when we think about issues in computer science where we have adversaries who are trying to penetrate some system, or break a code, or break in, or hack into anything, hack your password, we sometimes frame this in terms of asking the question, what is the attack vector? Where is the vulnerability that is potentially part of this particular cipher? And in this case, it's that anybody who has access to this pin is able to break any cipher that is made with this pin. And again, this pin was distributed pretty extensively in 1930s and 40s to children who listened to this very popular radio program. So these pins were in the hands of many people. And anybody who had access to the pin would be able to understand the message. And so that is, how we might frame this attack vector, is the key, in this case, the pin, which we will call a key for this purpose, is just very prevalent. It's pretty well known how to use this key and manipulate this key. A lot of people have access to that key. But that's just one example of a substitution cipher. We have many different examples of substitution ciphers that we could use. Let's just take another very simple, straightforward one, which is imagine we have all of the letters of the alphabet and we're just going to assign the ordinal position of that letter as its cipher value. So with the secret society pin, there was this sort of random element to it, right? The letters were being skipped. There wasn't a rhyme or reason to them, although the numbers were sequential. Here let's just line up both. Let's use sequential letters and map them to their sequential numbers. So A becomes 1, B becomes 2, and so on. Both of these things are increasing linearly. Now you may recall that as computer scientists, we ordinarily start counting from zero rather than counting from one. I'm counting from one here because this mapping of A to 1 and Z to 26 is much more familiar to us intuitively as humans, and I want to keep us grounded in this discussion of cryptography right now. But ordinarily, you might actually instead see this as 0 to 25, 0 being A, through Z being 25 as opposed to 1 through 26. But this cipher would work exactly the same and has roughly the same security potential as Annie's secret society cipher does. And we can actually make this a little bit better because we are consistently increasing the letters, A through Z, and consistently increasing the numbers, 1 through 26. We could also, instead of just doing this direct mapping, we could rotate around. We could start the 1 somewhere else as opposed to being A. And now instead of having just one cipher where A maps to 1, B maps to 2, we have a variety of different ciphers, depending on where we decide we want to have our starting point. So for example, we might instead add two to every number. So instead of going from 1 to 26, we go from 3 to 28. Now think about it. If you're trying to break this cipher and you see patterns like this with all these numbers in them, what might jump out at you? Well, if you're used to seeing ciphers that are 1 through 26, for example, something where you don't see any 1s or 2s and suddenly you're seeing 27s and 28s potentially in the message that might be long enough to have, in this case, Ys or Zs in it might seem to you that this is slightly off. Like this cipher must be shifted in some way. Instead of being this straightforward line, there's some modification that's been made to it. That's kind of a tip off if you're trying to defend against somebody figuring that out. And so instead of going 27, 28 at the end, we might instead wrap around the alphabet. Once we have exhausted the 26 possible values that we started with, the 26 letters of the alphabet, we might instead, once we have X is 26, say, well, instead of Y being 27, Y is 1 and Z is 2. And this is not a massive improvement on the security of this cipher. Like I said, it's still quite fragile and quite easy to break. But it doesn't give quite as much of a clue to a potential adversary as to how to crack it, how to decipher the message. And this can be done for any different value to obtain any number of different ciphers. Instead of going forward by two positions, we could add 20 to every letter's value, again, wrapping around the alphabet when we exhaust, when we get to 26, instead of having 27, 28, we would just reset at 1 and continue on. But we can also add 26 to it. But that doesn't look very different than what we had before. And that's where this cipher's vulnerability comes into play. There's only 26 possible ways to rotate the alphabet while keeping the order of the letters preserved, right? Unless we start skipping A, D, G, and then, you know, rearranging the other letters in some other way. If we want to keep everything straightforward in a line, again, wrapping around 26 when necessary, there's only 26 ways to do it. That is to say that shifting the alphabet forward by 26 is exactly the same as shifting the alphabet forward by 0. And so that's our limitation. We have a very small number of, again, this word keys that can be used to decipher using this cipher. Now this is an example of something called a rotational cipher, and it's actually a rather famous rotational cipher known as the Caesar Cipher. It's attributed to Julius Caesar and was apparently used more than two millennia ago for him to encode messages to his troops on the line. And at the time, this was revolutionary. And generally what you're going to find with cryptography is there's just this pattern of breaking the mold and doing something new and trying to stay one step ahead. And oftentimes, other people will then catch up. And this cipher, which was once, you know, lauded as being a wonderful cipher, is no longer as strong as it once was thought to be. And so we keep having to advance and improve and get ahead of it for whatever kind of adversary that is, whether that's a potential enemy on the battle line, as might have been the case with Julius Caesar, or whether that's a hacker who's trying to break into your system as might be the case today. And fortunately, again, we're not using Caesar Cipher today to uncipher any of our information. We're using much more modern techniques. But these modern techniques evolved from seeing codes being created, ciphers being created and broken, and then having to be created anew to try and defend against new vulnerabilities that have been exposed. So like I said previously, very easy to decipher or to crack the Caesar Cipher, but at the time, very, very difficult. The limitation, again, limited number of keys. There's only 26 ways to rotate the alphabet for it to make sense. In the English alphabet, of course. If you're using a different alphabet, you're number of keys might be different if you're using the same rotational approach. But the fundamental limitation is you are confined by how many letters are in your alphabet that you're using to encipher information. So let's take things one step further. What is an improvement that we might be able to make to Caesar? That would lead us to this idea potentially of the Vigenere Cipher. So Caesar had this limitation of there's one key and there's only 26 possible values for that key. What Vigenere Cipher does is it, instead of using a single key, uses multiple keys. Instead of picking a number to shift by, we're instead going to define a keyword. And we're going to use the letters of that keyword in sequence as we go to change what our key is at any given time, such that our enciphered message, instead of being enciphered using one key, might use three keys or five keys or 10 keys, depending on the length of the keyword that we use, if that keyword is three or five or 10 letters long. So this keyword becomes the interesting twist that made Caesar much more challenging for an adversary to crack by using different keys. Now let's walk through an example of how the Vigenere Cipher works because I think it makes more sense to see this visually rather than just discussing it verbally. So what we want to do here is encrypt the message HELLO using the keyword LAW. So here our message HELLO is what also might be called plain text. It is in the clear. It is not enciphered. It is not hidden against any adversary. And our key is LAW. All right, so let's take a look at how we might do this. So it oftentimes helps, especially when trying to encipher or decipher using the Vigenere Cipher, to consider all of the inputs that go into determining the final outputted character. So we're going to take a look at plain text, and we're going to convert it, just like we did with Caesar, to its ordinal position. We're going to see where in the alphabet that is. Is it the first letter? Then it's 1. If it's the last letter, it's 26. And so on. We're going to do the exact same thing with each letter of our keyword. So we're going to take a look at the keyword, figure out what that letter's numerical correspondence would be. We're going to then add those two things together. If we go over 26, just as we did with the Caesar Cipher, we're going to wrap back around such that we're confining ourselves to that range of 1 through 26. And then we're going to take that number and transform it into a letter. So for example, if the result there is 2, we're going to change that into a B. And the reason for that is that B is the second letter of the alphabet. So let's walk through this with HELLO as our plain text and LAW as our key. So the first letter of our plain text is H, and the ordinal position of that H is 8. It is the eighth letter of the alphabet. We do the same thing with the first L for LAW, the first letter of LAW. L is 12, it's the 12th letter of the alphabet. So our next step is to add those two values, eight and 12 together. We get 20. We don't need to wrap around. We didn't go over 26, so we're still OK. And the 20th letter of the alphabet is T. So the first step of this is enciphering process with HELLO, using the Vigenere Cipher, using the key LAW, is to turn the H into a T. So we can do this again, we can take a look at the E, the second letter of our plain text. We use the second letter of our keyword now. So we're not using the same key. We're not using 12 over and over and over. We're using a different key. We're now using the A, the second letter of our keyword, whose ordinal position is 1. So 5 plus 1 is 6, and that results in F. Next, we use the first L of HELLO, and the W of LAW. So L is 12, W is the 23rd letter of the alphabet, we add those together, we're at 35. 35 is not a legal value in terms of this cipher. We are confined to 1 through 26. And so we just subtract 26 and we get down to 9, and now we have I. So now what do we do? We've exhausted our keyword, but we still have plain text that we need to encipher. Well, as you might expect, the logical thing to do is just go back to the beginning of the keyword and continue on. And so we will. So we'll use the L, the second L of our plain text, and the first L-- because we've now exhausted all of those letters, we have to go back to the beginning-- the L for LAW. 12 plus 12 is 24. 24, the 24th letter of the alphabet is X. And we do that finally as well for the O, advancing it one position, because of the A in LAW, to 16, and that is P. So ultimately, HELLO in this case becomes this random set of characters, TFIXP. And some advantages might also immediately jump out at you. With the Caesar Cipher, anytime we changed a letter, it always was that same letter every time we saw it in the enciphered message. So if we had a B and we were advancing everything by two characters, every B in the original message would always be a D because D comes two letters after B. So again, if our Caesar Cipher key is two, every time we see a B, it becomes a D, every time we have an A, it becomes a C, always. Here with the Vigenere Cipher, because we have different keys and we're rotating these keys differently, depending on which letter of the keyword we are and which letter of the plain text we are, those two Ls are not the same, right? Instead of H-E-L-L-O, we don't have some mapping. Those two Ls are I and X. They are not the same character. And so already we're seeing a bit more security here because there's not this potential to guess. Caesar is also much more secure when you consider how many keys are available to you. With the Caesar Cipher we had 26 keys available to us. With the Vigenere Cipher we have 26 to the n keys, where n is the length of our keyword. So for example, if we're using a two letter long keyword, for example, AA or AB or all the way up, that leaves us with 26 squared, or 676 possibilities. Now if we extend to three letter keywords or four letter keywords, we're getting even more and more possibilities. And as we start to increase the number of possibilities, we start to really increase the difficulty for some adversary to figure out what the key is. And that's really the goal of cryptography, right? We want to be able to protect information and we want to defend that information from being determined by other people. So the more work we put into making more challenging keys, the more likely we are to be successful in our attempt to encipher information. So again, Vigenere much more of a secure cipher. It's still not secure and it's definitely not a cipher that is used today. There are computer programs that are capable of figuring out how to decipher using the Vigenere Cipher pretty well. But it's more secure than Caesar for sure because of its changing alphabets and its much larger number of keys. Let's go back to this decoder pin and think about another potential problem that we have. Now assume that your adversary is actually not a member of Annie's secret society. They don't have this pin. So that's already a step up. We previously had assumed that anybody who had the pin could crack it, and that's still true. But let's assume your adversary, lucky you, doesn't have this pin. Is there still a way that they would be able to crack the code without the pin? Think about it for a second. Think about what our characteristics of the English language are that might suggest people figure out what this cipher is. Think about some unique features of the English language, which is one letter words, like I and A, which might appear in the message. If you see a single letter word in a message, you're probably going to guess that it's either the letter I, and every time I see that character or that number I'm going to assume it's an I, or you're going to assume it's an A and you're going to try and plug in an A everywhere. And some trial and error might reveal some patterns that emerge. And there is a very prevalent pattern in the English language, which is that letters appear with a pretty regular frequency. Given any arbitrary text in the English language, it's pretty likely that the distribution of letters within that text is going to follow this pattern roughly 13% of the time, give or take. Any arbitrary letter selected from a text is going to be the letter E. And only 1/10 of 1% of the time will it be a Z. And only 2/10 might it be a J. So there are some letters that appear very frequently and there are other letters that appear very infrequently. And that is still a problem in this generic substitution cipher, even with the letters being scrambled, which seems at first blush to perhaps be much more secure than one where the letters are increasing sequentially and the numbers are increasing sequentially. Even this scattershot mapping of letters to numbers, as long as we're still confined to these two domains where we have A through Z and 1 through 26 and there's always a mapping between them, whether they're ordered or not ordered, is still a problem, in the English language anyway, because of frequency analysis. These are actually very common puzzles. Humans might find it kind of tedious to try and solve these puzzles, but otherwise, this is well known as a cryptogram. You may, if you are the puzzling type, this type of puzzle is called a cryptogram. And this pattern is definitely something that is across all messages that appear in the English language. There are plenty of other ciphers that appear, that are used, that are more secure than any of these what we might call one-to-one ciphers, mapping a single character to a different character or to a number. There are some ciphers that substitute pairs or triples of characters at a time. And these ciphers, again, form the basis for what eventually becomes more modern cryptography, which we're getting to in just a moment. There are also transposition ciphers, where instead of substituting one character for another, we simply use an algorithm to rearrange all the letters in some systematic way. And the defect there is that all the letters of our original plain text message are still there and all we need to do is unscramble them. And because there's an algorithm that was used to scramble them in the first place, there's got to be a way to undo it as well. With a little bit of trial and error, we can probably sort that out. Finally, the most egregious issue with these classical ciphers is, how do you distribute the key? How do you tell someone who you want to share information with? How do you tell your ally what the key is for the cipher that you are going to use? You can't encrypt it because if you encrypt the key, how will they know what the real key is? If you say, if you send them a message and they don't know how to interpret it, or they see it and they interpret it as something else, that's not going to be helpful to you. You want them to see the key in the plain text. You want them to see the key in the clear, rather. You want them to just have it. You don't want to encrypt that as you hand it to them. That doesn't do them any good. But if you're giving the key to your ally and your adversary is within earshot, or they have access to that same piece of paper because your ally carelessly throws it away and they can just pick it up, now all of a sudden all of your messages using basic ciphers are fairly insecure. But let's take a step forward in modern cryptography. Perhaps you've seen a screen that looks like this at some point when you're trying to log in to some system. Enter your email and we'll email you a link to change your password. Well, why don't you just email me my password? Like you're going to give me a link to change it, you must know it if I use my credentials to log in to your service any given day. But OK, I guess, sure. The reason for this is actually a reason of security. So let's distinguish ciphers, which we've been talking about, from hashes. So one of the most critical distinctions is that ciphers are generally reversible. You can undo what you did. That's the whole reason why it's important to share with your allies the key. But hashes are generally not reversible. Or certainly, they're not supposed to be reversible. And so it turns out, and we'll learn about this a little bit later, when you log in to some service, if that service is doing a good job of protecting your data, the reason they can't just send you your password is because they actually don't know your password. And that might seem strange because clearly, there must be something-- if I type in my password then I get logged in. But a good service is one that does not store your password in the database. That's probably a good thing if you think about it. In case there was ever a data breach, you wouldn't want your password to be in their database. Instead what they do is they store a hash of your password in the database. And then when you provide your password to them, they run that hash through the same things, called a hash function, which is just a generic idea for a function that takes any arbitrarily large amount of data and maps it to some other range or some other set of values. Now that might be a arbitrarily long string of information. It might be some fixed string where if I run my password through this, I'm going to get back something that is always 20 characters long. But it looks nothing like my original password. I've just made some weird manipulations to it. And that's what happens in log-in systems more generally is you will log in to some service, you'll type in your password, when that information is then submitted to the organization to check your log-in credentials, they will run your password through that same hash function again. And if that value matches what they have in their database for you, that is how they know that you have provided the correct credentials. They're mapping-- they're matching some mapping of your password to the one that they have stored, but they're not actually checking your actual password. And that should probably give you some sense of security. And if you ever use a service where you end up having to click on that link and they actually send you your password, you probably don't want to use that service anymore because they're not taking strong enough precautions to protect your data. So as I said, once we have a password stored in the database, it is actually stored as a hash rather than as the password itself. The service should not be able to tell you what your password really is. So this idea of a hash function-- what is it? Well, as I said, it's something that takes any arbitrary data-- and eventually we'll get into hashing things like files and not just words or strings, but for now let's keep it to strings, strings being a sequence of characters or letters, like a word or a phrase or a sentence-- and mapping it to some other range. So we'll start out by just mapping a string, a set of letters, to a number. But it could be to a different string, a string that's always 10 characters long, and so on. So there are some properties that good hash functions have. Let's take a look at what some of these are. So they should use only the data being hashed. There shouldn't be anything else that comes into play. They shouldn't be bringing in any outside information. It should rely exclusively on whatever data is being passed in to the hash function. They should also use all of the data being hashed. It becomes a bit less effective if every time I provide a word or a string to my hash function, I'm only using the first letter of that string, such that my hash function for every word or every string I provide that starts with A is going to return the same value. That's not terribly useful to me. I want to get a better distribution of values. Your hash function should be deterministic. And when we say deterministic, we mean no random elements to it. Oftentimes we think that random numbers are nice to jumble things up. But the problem is we want our hash function to always output the same value for the same inputs. So if I give you my password and hash it and I get some output, every time I provide my password and run it through that same hash function, I want to get the same output every time. And that's what sites rely on when they're using hashed passwords as part of the credentialing check. They're relying on the fact that they will always get the same output given the same input. So that's a requirement of a hash function. Hash functions should uniformly distribute data. So oftentimes you're mapping these strings, let's say, to some set of values. Those could be numbers, again, those could be strings. You want to spread those out evenly, ideally, across all of the possible values that you have. You don't want everything to hash to 15 if your range is 0 to 100. You'd ideally like everything to be spread out such that there's an equal number of 0s, 1s, 99s, and so on, as we talked about a little bit when we discussed hash tables. Finally, we also want to be able to generate very different hash codes, very different values for very similar data. For example, LAW and LAWS should hash two very different values. That would be ideal if a tiny bit of variation created a really dramatic ripple effect. And creating this really dramatic ripple effect is pretty key when we're talking about cryptographic hash functions, which we'll get to in a second, which form the basis of almost all modern cryptography, which form the basis of everything that we do that we rely on when we think of security in the computational field, it's almost always relying on these hash functions being really, really good at making small changes have very dramatic ripple effects in the hash code or the hash value, the data that comes out of the hash function. So after all this talk about good hash functions, let's take a look at a pretty bad hash function. And we'll talk about why. We'll talk about one of its virtues, but some of its potential problems as well. So instead, let's add up all of the ordinal positions of all the letters in the hash string. So this ordinal position idea is exactly the same as we had a moment ago when we were talking about Caesar and Vigenere. So A is 1, B is 2, and so on. So for example, for a word like STAR, if we want to add up the ordinal positions of all of the letters in that word, we have S-T-A-R. That's 19 plus 20 plus 1 plus 18. So if you do that math quickly, that ends up being 58. So what is a good thing about this hash function? Well, it's not reversible. If I get a 58, I don't necessarily know that the input that I had there was STAR. It could have been any one of a whole variety of things. It could have been ARTS or RATS or SWAP or PAWS or WASP or MULL or this whole random set of 29 Bs in a row. All of these things, when run through this really terrible hash function that I've defined here, all add up to 58 when I follow the rules of this algorithm. So I never know what my input was given my output. That is a good thing. That is what a hash function should do. Hash functions, unlike ciphers, should not be reversible. But the problem that I have here is that I have a lot of collisions, right? There are a lot of different things that map to 58. And when we talked about collisions a little bit previously, we were talking about them in the context of a hash table. And collisions were OK in that context. We were just clustering things together. If they all happened to have the same hash value, we'll just put them in the same bucket. When we're talking about cryptography though, when we start to get into relying on cryptography to keep our data secure, we can't have collisions at all. In fact, pretty much we rely on the fact that it is so mathematically unlikely, neigh impossible to have a collision in order for these things to work. And so collisions, when we're talking about cryptographic hash functions, are definitely not a good thing. So to recap, to check that a user gave us the correct password, if we're storing a hash of the password in the database versus just storing the plain text password in the database, which hopefully no one is storing a plain text password in the database, we run the actual password, the real password through the hash function. We get a hash value as an output, some string or some number or what have you as the output. And if we get a match, odds are they entered the right password. Now I'm saying odds are because we can't be 100% sure. And we can never be 100% sure. We can be really, really, really sure, but there's always a chance of a collision. Even with the best designed hash functions, even with the best designed cryptographic hash functions, there's always a chance of a collision. But ideally, that chance is quite infinitesimal. Very, very, very, very, very, very unlikely. So odds are if we get this hash, comes out of this hash function, it's quite likely, like 99.9% plus likely that they entered the correct password, this is, in fact, the user whose credentials are being verified, and we should log them in. Modern cryptography is just hashing. It's just hashing that's quite a bit more clever, certainly than the example that I just talked about a moment ago. Also, these algorithms tend not to work on a character by character basis. It's the algorithm that I just did as well where I was adding up every single letter. I was looking at each one individually. They tend to take, these modern algorithms tend to take clusters of letters, pairs or triples or so on at a time, maybe do even more things. They might rearrange the letters before they do things to them. So there's multiple layers going on with these encryption algorithms. And unlike some of the ones I've discussed earlier, most of these also have the property where given data of arbitrary size-- and now we're starting to really expand our minds into not just words or strings, but also images, files, videos, documents, PDFs, and so on; anything can be run through a hash function to get a value-- but we're always going to get a string of bits, a bit string, that is always exactly the same size. So depending on the algorithm, maybe it's going to be a 160-bit long string, or a 256-bit long string. But our range is finite. It's always going to be exactly 256 bits. But the combination of those bits will be different, ideally, for every single piece of data we might throw at it, no matter what. OK, so let's expand our definition of a hash function to this idea of a cryptographic hash function. What properties should they have? They should be very difficult, very, very difficult, basically impossible to reverse. It should be computationally impossible for anybody to undo the encryption. That's pretty much the same as a regular hash function. We're just really hammering the point home when we say this here. They should still be deterministic. We don't want any random elements to it. We still want to a hash a value and always get the same output no matter what if we run that same value through the hash function an arbitrary number of times. They should still generate very different hash codes for very similar data. We still want things to be spread out and we want minor changes to have dramatic effect. And they should never-- and this is one of those words that computer scientists love-- they should never allow two different sets of data to hash to the same value. Do you see a potential problem when we frame it in this way? When we say they should never be able to do that? We've already restricted ourselves to a finite domain, right? I said a moment ago, maybe this hash function maps to 160-bit long strings. There's only so many combinations of 160 bits. Now that might be an unfathomably large number, but using the word never there becomes a bit dangerous. We can't really rely on that. And we'll see why this could potentially be a problem. This static length string, by the way, is usually referred to as a digest in this context. When we start to talk about more modern cryptography techniques, the output of a cryptographic hash function is usually referred to as a digest. Let's take a look at one of these cryptographic hash functions. And certainly I'm not going to dive into the mathematics of it. I wouldn't be able to explain the mathematics. I wouldn't be able to do it justice if I tried to explain the mathematics of it. But let's just take a look at some of the basics of this. So SHA-1. SHA-1 is quite a famous algorithm. It was designed by the National Security Agency in the mid-1990s. So these are really smart people who are tasked with working with things like military intelligence. These are people who are dedicating their lives to trying to protect data as best as they possibly can. Far more brilliant minds than I, for sure. And this hash function-- and this is a published paper. Hash functions tend to be, actually it's this very strange dichotomy where you describe exactly how the function works, but it still should be irreversible. And this just really becomes a question of incredibly complicated mathematics involved, such that even if you knew so many of the pieces going in, you still might not-- you still wouldn't be able to undo it, even if you tried. It's kind of amazing actually. SHA-1's digests are always 160 bits in length. So this is one of those ones I just said a moment ago. That means that there are 2 to the 160 different SHA-1 digests, which is a bit over 10 to the 48th power. And again, 2 to 160 means for every single one of the 160 bits, that could be a 0 or a 1. So we have that, two options times two options times two options, 160 times. Just to try and make it fathomable, to understand how large this number is, let me try and paint a picture for you. So imagine that you are looking on Earth for a specific grain of sand. You're looking for one specific grain of sand on Earth. That is easier by far than trying to have SHA-1 have a collision where two values would map to the same thing. There's about 10 to the 18 grains of sand on Earth. So that's eight quintillion-- I had to look up that word-- eight quintillion grains of sand. So way easier to find the grain of sand on Earth than it is to have a collision. In fact, we go even further and say that imagine that every single one of those grains of sand was another planet Earth, each of which also had sand on it. So you have eight quintillion planet Earths. You're trying to find a specific grain of sand on one of those eight quintillion planets. It's still easier than trying to have a collision with SHA-1. SHA-1 is such an important algorithm that it's actually one of the algorithms that is required in federal regulations to be used by the government for encrypting information. There are others as well, but SHA-1 is listed by the National Institute for Science and Technology as a standard algorithm. But there's a problem, which is that SHA-1 is broken. And it has this clever website called SHAttered, shattered.io. So the research team that figured out how to create a collision intentionally create a collision. And intentionally creating collision has the effect of basically saying, this cryptographic hash function is broken. And they have proven that there is a way that they can systematically generate collisions. So that's bad. And we'll see why that's bad in just a moment. But you can go to this URL, shattered.io, and read quite a bit about how the researchers do it. They explain it in different levels. So if you really want to dive into the technology and the mathematics of it, you're certainly welcome to. If you just want to understand it at a base level and why this is a problem, I definitely encourage you to take a look at this site and read more about this. So what did these researchers do? So they said, It is now practically possible to craft two colliding PDF files and obtain a SHA-1 digital signature on the first PDF file, which can also be abused as a valid signature on the second PDF file. In short, what they're basically saying here is we were able to create two PDF files such that if I run them through the SHA-1 algorithm, the digest that I get is the same. Why is this potentially bad? For example, by crafting the two colliding PDF files as two rental agreements with different rent, it is possible to trick someone to create a valid signature for a high-rent contract by having him or her sign a low-rent contract. If you can take a PDF and twist it into anything you want it to be, but have a valid signature, a valid SHA hash associated with it, that's not great. Now before alarm bells start going off because SHA-1 is still use quite extensively, even now, this SHAttered research result was developed in 2017 it was released, but SHA-1 is still being used now, even then. Before you panic though, it has not been broken that many times, although they did very-- they worked for two years to create this PDF collision. And they demonstrated a method for how to do it. It has still not happened that many times. Cryptographic hash functions, once they've demonstrated one collision, are broken. That is certainly true. But the actual effects of this have not yet really materialized. The computational power required to create this is well beyond the capabilities of most people, or most syndicates even. So no cause for alarm yet. But it does show that there is a limitation with SHA-1, and we still want to always be staying one step ahead. Just like when Julius Caesar's enemies figured out how to crack the Caesar Cipher, the goal was, we need to get one step ahead. As technologists, we always want to stay one step ahead to make sure that we are doing our best job protecting our data. And as lawyers, we want to make sure we're doing our best job protecting our clients' data against potential adversarial attacks. So as I mentioned, there are other standards that are in use by other organizations, including the federal government. SHA-1, as I mentioned, is just one of a few different options that they use. SHA-2 and SHA-3 are much more robust algorithms. They use more bits, basically, in their digest. So instead of being 160 bits, you can have anywhere between 220 and 500 or so bits. So way larger of a domain, even reducing the likelihood of a collision that much more. Again, imagine how unlikely it was with 2 to the 160. Now we make it even more so. 500 bits, that's unfathomably large and difficult to duplicate. MD5 and MD6 are other cryptographic hash functions, or hash functions that you may encounter. MD5 in particular I've highlighted here in yellow because it's not actually considered secure anymore, but it's still very, very commonly used as a checksum. Basically, what we do is we run a file through MD5. And say we're a distributor of a file and we want people to come download our source, and they want to be able to trust our source, we might run our file through MD5 and say, if you run this file through MD5, the hash will be blah blah blah blah blah. And other people can then download the file and run it through MD5. It's usually a program that is available on computers for people to just run any arbitrary data through to get a hash result. And they can check, OK, the hash value that I received from this trusted source matches the hash value that I was told I would receive, and so I will trust this. Versus perhaps getting that same software versus some corner of the internet that you don't really trust. If you find the MD5 hash of the trusted source does not match what you downloaded and what you thought was that same file, it's probably a sign that something has changed in it and you don't really want to-- you might want to be skeptical about trusting that file rather than just diving right into it. So what do we do that relies on cryptography on the internet today? Or you know, just using our computers every day. Email. Email relies pretty extensively on cryptography, particularly when we start to use secure email services, of which Gmail might not be considered one, but there are services out there, for example, ProtonMail and others, that do encrypt email completely from point to point. Much safer in terms of protecting one's communications. Similarly, you may be familiar with the mobile app Signal is also used to encrypt communications between two people over the text messaging network rather than over email and the internet. Secure web browsing, you may be familiar with this distinction between HTTP and HTTPS. And if you're not, that's OK. We're going to be talking about that a little bit later as well. But you want to make sure that your web traffic is encrypted against people who are able to just monitor the network for all the traffic that is going by. You probably don't want your searches to be someone else's fodder for entertainment. VPNs. If you use a VPN, that's a great thing to do if you're traveling, for example, and you may be on less secure networks than you might find at your business or at home or at a university institution, for example. VPNs allow you to encrypt communications with a network, and also allow the network to pretend to do something on your behalf so that your web traffic cannot be traced back to you directly, which might be advantageous in some situations as well. Document storage as well. So if you use services like Dropbox, for example, generally what Dropbox is going to do is break your document into pieces and encrypt those pieces. Rather than just storing the whole file writ large in some server somewhere on the cloud, it's going to encrypt it before it sends it over so that you have some more comfort that your data is being protected by these cloud services. And certainly, we're going to talk a bit more about what the cloud is and what cloud services are and what they can be used for a little bit later in the course as well. Hash functions and cryptographic hash functions are great, but they are well documented and there's only the one. There's only one version of SHA-1. There's only one version of SHA-3. And that is a limitation. Now it might not be a severe one because it's pretty strong. They're pretty strong algorithms. But are there ways that we can improve our own cryptographic techniques if we're trying to protect data that we are receiving, data that we are sending, and so on? And that leaves this idea of public-key cryptography, or public- and private-key cryptography, or asymmetric encryption. You'll hear these terms kind of used interchangeably. Let's start by talking about public-key cryptography by way of an analogy. We're going to go way back to arithmetic and algebra days here. So imagine we have something like this. We have 14 times 8 equals 112. Multiplication we can think of as a function. It is a function. If 14 is our input and our function is times 8, the result is 112. Now multiplication is not a hash function because it is reversible. I can take that 112, multiply it by 1/8, or equivalently divide by 8, and get back the original input. So multiplication is a function, but it is not a hash function. It is reversible because if we multiply any number x by some other number y, we get a result z. And we can undo that whole process by taking z, multiplying it by 1 over y, or the reciprocal of y, and getting back the original x. Reversible. Goes in both directions. Now let's take this function and kind of obscure it. We know for ourselves that this function that I'm using is n times 8. Whatever I pass in is going to be multiplied by 8. But I don't tell you what that is. I don't tell my friends what that is. I just say, hey, if you want to send me a message, just run it through this function. So again, we're going to just use math as an example here. If my message is 14, I might say, f of 14-- and again, this is getting back to algebra, maybe a little bit back in the day-- f of 14 is 112. That is my public key, you might think. And you might say, having just gone through this whole example, that, well, it's pretty easy to undo that. If I know that 14 is the plain text and 112 is the cipher text, I can probably figure out that your function is n times 8. And so I've broken your encryption scheme. I have figured out how to reverse your cryptography. Well, it's true that n times 8 is certainly one function that I could use to turn that plain text, 14 in this example, into that cipher text, 112 in this example. But there are other ways that I can do it. My actual function could have been n times 10 minus 28. So 14 times 10 is 140, minus 28 is 112. And there are other contrived mathematical examples that I could continue to do pretty much ad infinitum to define ways to transform 14 into 112. So just because you see that 112, that doesn't mean you have figured out how to break my hash function. You haven't figured out what my encryption technique is. If all I say is, here's a black box that I would like you to feed an input into, even if you see the output, you, or really more concernedly an adversary who sees that output as well should not be able to, or cannot in this case, undo it. Because yes, I could have been using n times 8. I could have been using this crazy thing involving the square of n. And that's kind of the idea behind public-key cryptography. I am going to publicize that I have a function that can be used, but I'm not going to tell you what that function is, and I'm certainly not going to tell you how to reverse it. So public- and private-key cryptography are actually two hash functions where the goal is to reverse them. We kind of talked about this as hash functions are supposed to be irreversible. But the distinction here is that we are creating two functions, f and g, which are intended to reverse one another. So it's not that the function does the single function that is reversible, it is that we have two functions that, working together, create a circuit. If I take data and I run it through function f, I get some output. If I run that output through function g, I get back the original data. I have deciphered the information. And the same thing works in reverse. If I take some data and I run it through function g, I get some hashed output that makes no sense. And if I run that hashed output through function f, I get back the original data once again. Now the key is that-- pun intended-- the key is that one of these functions is public and the other one is private. One of them is available to everybody, and everybody uses that function to send you messages. If you want to send me a message using encryption, using public and private key encryption, you take the message and you use my public key to encrypt it, and you send me the result, the hashed encrypted result. And I use my private key to decrypt it. And I am, ostensibly, the only person who has my private key, even though I've broadcasted, made my public key widely available. Now the math that goes into this is well beyond the scope of a discussion that we're going to have here today. But basically, and most encryption, most cryptography involves the use of prime numbers, particularly very, very large prime numbers. And you're looking for prime numbers that have a particular pattern. And when I say "you're" looking for it, don't worry, you don't have to do this yourself. There are plenty of programs out there, RSA being a very popular one, that can be used to generate these public and private key pairs. But the amazing thing is that it can generate these pairs very quickly, but it's almost impossible to break or figure out what the underlying functions, or even in this case what the underlying two prime numbers are that are the foundation for your own encryption strategy. So it's pretty amazing that it's easy to define these functions and almost impossible to reverse engineer them, so to speak. So we start with a huge prime number, we find some other prime number that has a property, a special property related to it, and from those two numbers we generate two functions whose goal in life is to undo whatever the first one does. So f's job is to undo what g does, g's job is to undo what f does. And this is called a public and private key pair. So your public key is really some complicated hash function that does work. And that hash function is represented as a very long string of numbers and letters. It looks just like a hash digest. But it's just a human representation, a readable representation of a mathematical function. And your private key is the same-- or your private key is also a representation of letters and numbers. It's not exactly the same as your public key, but it undoes the work that your private key does. And again, these keys are generated using a program called RSA. So let's take a look at exactly how we would go about doing some asymmetric encryption using public and private keys. So here we have some original data. It's a message perhaps that I want to send. And I want to send it to you. I want to send this message to you, but I don't want to send it to you in the clear. I don't want to, you know, it's sensitive information. I don't want to send it via plain text. And I don't want to use a generic hash function because if I use a generic hash function, like SHA for example, it's irreversible. You will not be able to figure out what I tried to say. So instead, I take this original data and I use your public key. Your public key, again, is just a mathematical-- a very complex-- mathematical function. So I take this data, I feed it into your public key, your public hash function, and I get some garbled stuff out. OK? And this is what I send to you. I send you this garbled stuff. In order for you to figure out what the original message is, you use your private key. Not your public key-- your public key is what I use to encipher the information-- but your private key, which is known only to you, hypothetically. It should not be distributed to others. It undoes the work that your public key did. And so if I give you the scrambled data and you use your private key to try and decipher it, you will get back that original data. But here's the great thing. No one else's private key will be able to do that. If anybody intercepts that message other than you and they use their private key or they use your public key again, they will not be able to decipher the message that I sent to you. And so public and private keys are very interesting because they create these pairs. They're these unique encryption schemes that are unique to two people, or really even to one person. If you were to send me a message back, you would send me a message using my public key. You would then send me whatever the encrypted sort of scrambled data is for the message that you sent using my public key. I would then use my private key, which is not known to you or to, hypothetically, anyone else to decipher what you sent me. And I would get back the secret message, or the perhaps not-so-secret, but sensitive message that you sent to me. And so that's this idea of asymmetric encryption. You can encrypt using someone's public key. And anybody can do so. And for that reason, you'll often find technically-minded people will sometimes post their public key literally on the internet, such that anybody who wants to send them a message using a secure channel can do so. And programmers as well. So if I'm doing some work using a tool called GitHub, a popular service available online for sharing and posting source code, if I want to send something from my computer to GitHub's servers in the cloud, I might authenticate using a public key and private key encryption scheme so that they see that I'm using their public key to send them information, they're decrypting it. When they send information back to me, they're using my public key and I use my private key to decrypt it. It's actually part of-- it's part of a communication strategy used by technically-minded folks. And you're not restricted to just having one public and private key. For example, I have one public and private key that I use for a secure email, I have one public and private key that I would use for secure texting on my phone, and I have one public and private key that I use for my GitHub repository. So I have different sets and different combinations of these keys. But the key is that-- the key, again, pun intended-- is that the decryption can only be done by someone who has the private key, not the public key, because only those two functions are reciprocals of one another. They undo the work that the other did in the first place. But interestingly enough, that's not the only thing we can do with public and private keys. So instead of just encryption, we also have this idea of a digital signature, which is different than e-signature, an e-signature just being the tracing of a pen typically along some surface and just logging where all the pen strokes happen to be. So we're talking about something much more complex than that. We're talking about something cryptographically based when we talk about digital signature. It's kind of the opposite of encryption. And using someone's digital signature, you can verify the authenticity of a document and verify, more precisely, the authenticity of the sender of a document. And we're going to explain this in great detail in just a moment, but the basic idea is they're signing the document using their private key. You still don't see what the key is. And because these public and private key pairs are specific to an individual person, if you were able to verify that that document could only have been signed using someone's private key, then you have quite a serious belief that that person is the person who signed the document, who sent the document, and so on. Digital signatures are 256 bits long pretty consistently, which means there are 2 to the 256th power distinct digital signatures, which makes the potential of a forgery effectively zero. Again, I'm using this-- I'm trying to avoid saying never because computer scientists don't like never. But effectively, there is no chance of a forgery. Now the process for how one verifies a digital signature is quite-- there's quite a few steps involved. And I have a diagram here that I sourced from online. And what I'd like us to do now is walk through this process to hopefully give you an understanding of how these work and how you might be able to rely on digital signatures. And states and different entities are recognizing digital signatures as a valid way to sign documents, but it really helps to have a good understanding of them such that you, as an attorney, are comfortable with the fact that this does represent a specific individual. So let's take a look at how this process works. So we start with data. Data in this case is any document. Perhaps it's a scanned, signed version of some PDF with somebody's actual ink signature. But again, the whole thing is just scanned. The next step is to use a hash function. The hash function that we could use in this context could be anything. It could be SHA-1. It could be something very complex. In general, the hash function that's going to be used here is actually not a cryptographic hash function. It's going to be something like MD5. So something that anybody has access to. And that's going to result in a hash, a set of zeros and ones. In the case of MD5, it's going to be about 160 or so different characters. Now where things get very interesting is we take that hash, that set of zeros and ones, and we encrypt it using the signer's private keys. Remember, these functions are reciprocals of one another. A public key can undo what the private key does, and the private key can undo what the public key does. Notice in this case we're still not sending anyone our private key. We are just using our private key to encrypt something. So we take this hash that we received from running our file through MD5, we encrypt it using our private key, and we get some other result out of it. This number that comes out of running the hash through our private key is called the signature. We then just couple that-- so when we send this off, we send the signature plus the original document, and that would be considered a digital signature. So that's the signing part of the process. That's where we go. We start with a file. We run that file through a generic hash function. Not our public and private keys, something that is generally pretty accessible. We take that hash, we encrypt it using our private key to get some other hash that looks similar, different zeros and ones, but totally different pattern of zeros and ones. We attach the original document and the digital signature when we send it off, and that is considered a digitally signed document. Now the real crux is how do you prove that I'm the person who sent you this document, right? If you want-- if you're receiving something that has a digital signature, which is supposed to be as good as any other kind of signature, it's supposed to have legal effect. How do we verify that that person who sent you the document was actually the correct one? So then we go to the verification step. So we start, we've now received this digitally signed data. This is the same as this digitally signed data here that was sent by the sender. We also received two pieces of information. We received the document, the original document, and we received the signature. And recall, again, that the signature is what happens when we take the hash of the document and run it using our private key to get a result. Now the interesting step here is remembering that the public and private keys are reciprocals of one another. So we can take this complicated signature hash and we can use the public key, which, again, is publicly available. Anybody should ostensibly have access to someone's public key, not their private key. And notice that the signer has never sent their private key. They've only used it to encrypt some data, but they never sent the private key. The public key has always been available though. We take the signature, we run it through the public key function, and we get a hash. We take the data, the document, and we run it through MD5, the same hash function that the sender was supposed to use, and we get a hash. And we're checking to make sure that these two hashes are equal to one another. If they are equal to one another, that means the signature is valid. Let's talk about why that would be the case. If we use the MD5 of this file, the generic hash of this file, and we encrypt it using our private key, we get some result, OK? But this is very easy to calculate. It's MD5. We're taking a basic document, we're running it through a publicly known, well-defined hash function. Anybody who has access to this document and a program on their computer called MD5 can literally run this document through it and get this number. This is not the tricky part of this. We then take this hash function, we encrypt it using our private key to get some secret number. The public key though will undo that. Remember, the public and private keys are reciprocals of one another. Whatever one does, the other one can undo. And so only my public key will undo the work of my private key. So if I take this value and I encrypt it using my private key, and then I run this value through the public key, I should get the original result again, the original MD5 hash. And that's why we have to send the document as well, not just the digital signature, the numbers that we get by running it through our private key in the first place. That way we have a way to validate that yes, this file has this checksum, and the sender took that checksum, they ran it through their own private key, and when I used their public key to undo it, I get the same value, which is effectively proving, but is, we'll term it as it's very, very, very, very likely that this person who claimed to have sent the document is, in fact, the person who sent that document. And so that's what digital signatures can be used for. It is a mathematical, cryptographic way to verify the identity of the sender of a document or an individual. Or in whatever context you might be using or receiving digital signatures, it is purely a verification step that is based entirely in mathematics. There's one other potentially interesting use of digital signatures that's also quite buzzy right now, and that's blockchain technology. And what is the blockchain? Digital signatures are really key to knowing how the blockchain works and why it is trusted as a decentralized source of information for individuals. So understanding digital signatures means you are in a position to understand blockchain. And I use here the term the blockchain, but it really is a blockchain. There's no such thing as the one blockchain. There are many different-- this is just an idea that is implemented. Generally, we're hearing it in the context of a cryptocurrency, but it does not need to be restricted to that, although cryptocurrencies are so discussed in the media and have been dissected by so many researchers that they provide an interesting vehicle, an interesting lens through which to consider blockchain. And so our example today is going to focus on Bitcoin. It is the most well-documented of the cryptocurrencies. It is the most well-documented implementation of the blockchain, or among the most well-documented implementations. But this is not specifically a lecture about Bitcoin. We're just using Bitcoin as a lens through which to understand blockchain. There's also an outside source that I strongly encourage. This channel on YouTube provides interesting mathematical dissections of topics, and they tackle blockchain and Bitcoin pretty extensively. And this is an excellent supplementary resource to consider if you're trying to dig into this or understand it a little bit more, because in this video I'm going to omit some of the more technical details for the sake of, hopefully, broader understanding. But if you want to dive into it more deeply, this is a resource that I would recommend. And I really like talking about Bitcoin in the context of blockchain because it's actually how I kind of got started almost as an attorney. When I was practicing, when I graduated from law school, I decided to go out on my own and start my own firm. I live in a small town and so a lot of my early work was doing estate plans, wills and such for individuals in my town, getting to know them. But I had studied extensively technology-related law in law school and I really wanted to use it. And a few years into my practice, I had a friend who needed an estate plan prepared, and he asked if he could pay me in Bitcoin. And I had no idea what that meant. I didn't really know anything about Bitcoin at the time. And I looked it up and thought it sounded interesting, and so I said sure. So I learned how to set up an account. And it's also worth mentioning at the outset, as we're talking about cryptocurrency, that you need to understand how Bitcoin works to use Bitcoin. You don't need to understand how the federal banking system works to use a bank. And the same is true here with Bitcoin. But I ended up accepting a Bitcoin payment by creating what's called a Bitcoin wallet. I immediately sold the Bitcoin that I received and turned it into cash, such that I could use it for more generic purposes. And what I decided to do was send out a press release saying, oh, I accept Bitcoin, because it was something that was novel and I hadn't really heard that much about it. And this got the attention of my local paper and companies in the area that were technically minded as well. And so Bitcoin sort of provided this forum to meet new clients that also allowed me to explore fields of the law about which I am passionate. So it's kind of an interesting segue to be able to share that with you now. All right, so stepping away from Bitcoin again more broadly to blockchain. What is the blockchain? It's very similar to something you've already learned about, which is a linked list. So recall that a linked list is a set of nodes, each of which have connections forward and backward to other nodes in the chain. They are linked together. And similarly, with a blockchain, all of the blocks are chained together. It's basically the same terminology slightly modified. So a linked list is a set of nodes, each of which is connected to the one prior and the one after it. We learned about linked lists as having generally three pieces of information associated with them-- a previous pointer, which is basically a reference to the prior node, or in this case, the prior block; we have the next pointer, which is a reference to the next node or the next block; and we had data. And in this case, the data is actually two different things. There's the real data. And again, in the context of a cryptocurrency blockchain we're going to be talking about a list of transactions, a numbered list of transactions from person A to person b, each of those transactions being digitally signed such that you can verify that the person who logs that transaction is actually the one who made that transaction. And also, something called a proof of work. And this proof of work is very interesting because this is how Bitcoin ostensibly derives its authority. There is no central controller of the Bitcoin currency, and it is very decentralized. And there needs to be some way for people to agree as to what the true ledger is, or what the true set of transactions that have happened are. And the way that is done is by relying on something called the proof of work. And we'll dive into that shortly as well. So again, cryptocurrencies, that data is a ledger of transactions, each of which is digitally signed using the digital signature technique we've just discussed by the person who made or initiated that transaction. And that ledger is decentralized, which means that any time there's ever a change, any time any transaction is recorded, in this case, using Bitcoin, again, our lens through which to consider blockchain, that message is broadcast out. So if I make a transaction in Bitcoin, I pay you $10, I'm going to announce to everyone else who has a Bitcoin wallet or who is monitoring the blockchain, the list of transactions, hey, please add the following transaction to this list, Doug pays you $10. And that is announced to everybody, everybody records it in their ledger, and then some stuff is going to start happening. But here is a potential issue. How do you know that the blockchain is legitimate? How do you know that your copy of what is being said is the truth? How do you know that your copy of the blockchain is accurate with respect to all other transactions that have happened? Everybody else has their own copy as well. It's decentralized. We all maintain, anybody who's using Bitcoin maintains their own copy of the blockchain. How do you defend against people modifying it? That's a very interesting question. The way that cryptocurrencies do it is to assume-- and this is defined in the Bitcoin paper-- the way the cryptocurrencies do it is to assume that the chain that has the most computational work put into it is the true chain. This decision is completely arbitrary. There's no reason why one needs to be vetted over the other. But something had to be agreed upon by, collectively, users of Bitcoin to say in the event of a dispute, between which person's chain is the accurate de facto definitive list of transactions? We're going to go with the one that has been verified the most times. And again, this word verified is sort of a sketchy word. There's nothing inherently about proof of work or anything else that proves that a transaction has taken place in the way that we normally think of this term verified. Rather it is the collective standard by which we all agree to adhere, that the person-- or that the blockchain that has the most proof of work in it is the list. That is just something we must subscribe to as users and consumers of blockchain. Now how do we determine which blockchain has had the most computational work into it, which copy of the blockchain has had the most computational work put into it? Well, this is proof of work. So proof of work is how the correct blockchain of all the copies that are decentralized is determined. So recall how hashing works. Hashing allows us to take any arbitrary data and run it through a hash function and get an outcome. And that outcome is going to be, let's say 256 bits, each of those bits being, of course, 0 or 1. Now there's a lot of different combinations there. But some of them will be very unique. And the way Bitcoin works, Bitcoin's blockchain works is to prove a particular block. We are asking people who are oftentimes called miners-- that's where this term comes from because they are mining. , Ultimately the reward for doing this proof of work is to receive Bitcoin that are sort of generated out of thin air. And so these people are termed miners. But we are asking anyone who has a computer to hash the entire block. So hash the entire list of transactions, the reference to the previous block and the next block. And remember, all of that is contained in a single node of this blockchain, basically. And we're looking for a highly unusual pattern. We're looking for maybe the first 30 bits or the first 40 bits to all be zeros. That's really weird. Like, that's a really difficult pattern to find. And the only way to do it is to guess. So you take this entire block, you attach a single piece of data to the bottom of it, like 1, 2, 3. You can just count in that way trying to guess. And if you hash that entire thing together, do you eventually find a block that, when hashed in this way, produces this very, very unique pattern? If so, you just say, here's the number that I attach. So let's say I took the entire block and I hashed it with 12345 was the number, right? It's very difficult to find a value that would create this unique pattern of zeros and ones, in particular, zeros, 30 zeros in a row. But it's really, really easy to verify that someone has done it. To verify that someone has done it, all you have to do is if they announce the number that they used, 12345, as their proof of work-- and that's what the proof of work really is, it's that number that they use to figure it out-- if they announce that and you hash the block with that number, you can verify, yes, that pattern is actually 30 zeros in a row. So I guess you have proven it. Now this is, again, kind of arbitrary. Like, this seems weird. Why are you spending all your time trying to figure out a specific pattern that exists somewhere? That is a question that I cannot answer other than to say that it is the standard by which people who have ascribed to the Bitcoin standard have just agreed to be bound by. The person who finds this number is probably the-- is proving the validity of all the transactions above it. And this gets interesting when you think about somebody trying to perpetrate a fraudulent transaction. So imagine I'm trying to perpetrate a fraudulent transaction by initiating a transaction that says, I'm going to pay you $100. And I announce that to you, but I don't broadcast it to everybody else who maintains the blocks, who are maintaining their own copies of blockchains. Which is interesting because you think that I have spent $100, and as far as you're concerned I have spent $100 to you, but no one else is aware of that. So no one else thinks that I have spent $100. They all think I am $100 wealthier than I actually am. The problem then arises that I need to verify that block. I need to verify that transaction. So I append the transaction to my own copy of the blockchain because I am the only person other than you-- the two of us maybe have these copies of the blockchain, but everybody else, I didn't broadcast this transaction so no one else knows about it. In order for it to have a proof of work attached to it, in order for it to be considered the valid chain, I would need to prove that block. I would need to find that secret number that when hashed with the entire block, produces a pattern of 30 consecutive zero bits before anybody else does. So that's a 1 in 2 over 2 to the 30th power chance because I'm looking for a pattern of 30 consecutive zeros. There's a 1 in 2 to the 30th power chance that I'm going to find that pattern. And I have to find that pattern before somebody else. And in the meantime, other transactions are coming in on my ledger. On my-- other people are broadcasting their transactions. And I have to keep adding them to my ledger and keep proving that work over and over and over, all the while trying to stay ahead so that my fraudulent transaction is considered ultimately the correct blockchain. Now the odd-- you just can't beat the odds of that. One malicious person trying to perpetrate a fraudulent transaction using the blockchain cannot stay ahead. They can't win the find the secret number game over and over and over and over. Eventually, some other chain, which contains valid transactions, will win out over my attempted fraudulent chain. And it will be disregarded. Nobody will consider that to be a valid part of the chain anymore. And so that's kind of how this works. Again, it's arbitrary the way they decide to resolve or verify. There's nothing about this process that proves that person A sent person B money. It's just the consensus that we have decided, well, if people have gone through the effort to try and find these secret numbers, and many different people are doing it, and this one chain is longer than the others because it's been verified-- again, using this term verified-- it's been proven with work over and over and over, we're just going to agree that that's the right one. So again, it's kind of strange. And I do, again, refer you to that video that I shared earlier to get into some of the more technical details of this, which I'm glossing over a little bit here in this discussion. But proof of work is basically the collective consensus of blockchain users, or in this case specifically, of Bitcoin users, for which transactions they are going to consider valid. Because changing any one-- and if you go back in time, as opposed to trying to forward think I want to add a new fraudulent transaction, if you try and go back in time to modify a transaction from the past, say there was a transaction that was you pay me $10 and I maintain a copy of the blockchain, so I can go back in time and modify that file, technically, I change it to you pay me $100, well, because I've changed even the tiniest thing and I'm hashing that block, that means that when I hash it with that secret number, I'm no longer getting that secret pattern of 30 numbers, 30 zeros in a row. And so that kind of calls that transaction into question. It also, because each of those blocks contains a reference to the next block and the previous block, it also invalidates all of the other transactions in that blockchain. And so because of this weird technique we're doing of hashing blocks, hashing data, trying to look for specific patterns, but realizing that any cryptographic hash function with the tiniest change to the input creates a totally different output, we actually are pretty well defended against people who try and go back in time and make fraudulent transactions using the blockchain. So it's mathematical and it's quirky, but it does provide a clever way to defend against that kind of thing, considering we don't have a central authority to rely on to adjudicate these kinds of disputes. We are collectively, not trusting one another enough, but agreeing to trust the mathematics of the blockchain in order for it to succeed. So as I mentioned, we can very easily verify the correctness of someone's proof of work. That proof of work is just the number that is hashed with the block to produce the secret pattern of 30 zeros and then some other bits, and so on. The longer a chain gets, the more and more likely it is that all the transactions in it are "verified." Again, I keep putting air quotes around that word because it doesn't mean in exactly the same way that we might consider verified colloquially to mean. It doesn't prove anything about the transaction itself, just that we accept it as the standard. We accept this as the de facto truth because of all the mathematics that have been put into it. So the longer a chain gets, the more likely it is that it consists of only verified, legitimate transactions. But that brings up a question of, what is a transaction? A transaction is just an exchange between two people. And if we start to really spread things out, we can almost think about a transaction as a contract. I offer you $10 for you to do something on my behalf, and assuming that we're intending for me to actually give you these $10, and you're intending to actually do something on my behalf, and the thing that you're doing for me is not illegal, we've basically formed a contract. And so while Bitcoin can be used, the blockchain for Bitcoin can be used to send money back and forth between people, the data that goes into the data block of any blockchain is arbitrary. And there's no reason why, instead of being a list of transactions, that data couldn't be something much more significant than that. There's no reason it couldn't be a digitally signed PDF scan of a contract between two people. There's no reason it can't be a message from me typed to you saying, I will pay you $100 if you paint my house on Tuesday, and you sending something back in that same chain saying, I will paint-- I accept your offer for this payment. I accept your offer. I will paint your house on Tuesday in exchange for $100. We've just formed a contract with no middleman at all. We are announcing our intentions. It is being recorded publicly in everybody's version of the blockchain. There is verified, again, verified in the sense that we collectively term to be accurate rather than proving that I definitely sent this although the digital signatures associated with these transactions do, again, suggest yes, I am the person who made this transaction because I digitally signed it. If I do the same thing with a contract, if I send you an offer and you accept, and both of those items are in the chain, we arguably have formed a contract. And that is what the blockchain associated with the Ethereum technology is actually more akin to. So Bitcoin is kind of restricted in how it approaches cryptocurrency and approaches transactions between people. And Ethereum opens up a little bit more. And there are other blockchain technologies and other services that rely on the blockchain in order to do things far beyond what a cryptocurrency could do. But all these things are only possible because we rely on-- we rely so extensively on cryptography. We use computers to send information securely, encrypt information. And the mathematical unlikelihood of someone being able to duplicate our work, or certainly reverse engineer this encryption is what gives us the confidence to make these transactions in the first place. And so cryptography forms the basis of almost everything that we do when we talk about security on a computer. But ultimately, cryptography just relies on mathematics. So the moral of the story is probably this. You are probably not going to be implementing your own version of the blockchain. And really, you don't need to understand it completely in order to use it. Like I said, you can use Bitcoin without knowing the mathematics of how Bitcoin works, just like you can use a bank without knowing the minutia of how the banking system works. The point of the blockchain is to remove a central authority. We don't rely on one person or one entity or one government to determine what has happened, what the transactions are like we do with a bank. Your bank has a ledger of everybody's accounts. With blockchain technology, we are decentralizing this and making it so that everybody has access to all of the information at once, and it is everybody's responsibility to keep that ledger accurate. And because these ledgers rely so extensively on cryptography, because this technology relies on cryptography, we can use the power of cryptography, the fact that things are very difficult to reverse engineer mathematically to verify that yes, these are the things, these are the things that have happened, these are the transactions that have been logged, and everybody knows about it at the same time.
B1 中級 2019年律師CS50 - 密碼學 (CS50 for Lawyers 2019 - Cryptography) 8 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字