安全性和可擴展性--2019年以後的CS50項目 (Security and Scalability - CS50 Beyond 2019)

字幕列表影片播放

[MUSIC PLAYING]
BRIAN YU: OK, let's get started.
Welcome, everyone, to the final day of CS50 Beyond.
And goal for today is going to be to take a look at things
at a bit of a higher level.
There is going to be less code in today's lecture.
The focus of today is on two main topics--
security and scalability-- which are both important as you
begin to think about, you're writing all this code for your web application.
You're ready to deploy it so that people can actually use it.
What are the sorts of considerations you need to bear in mind?
What are the security considerations in making
sure that wherever you're hosting the application, you and the application
itself is secure and that your users are secure from potential vulnerabilities
or potential threats?
And also, from a scalability perspective,
we've been designing applications that so far probably only you
or a couple other people have been using.
But what sorts of things do you need to think about
as your applications begin to scale, as more and more people begin to use it,
and you have to begin to think about this idea of multiple people trying
to use the same application at the same time?
So a number of different considerations come about there.
We'll show a couple of code examples.
But the main idea of this is going to be high level, just thinking abstractly,
sort of trying to design the product, trying to design the project,
trying to figure out how exactly we need to be adjusting our application
to make sure that it's secure and to make sure that it's scalable.
So we'll go ahead and start with security.
And on the topic of security, we're going
to look at a number of different security considerations
as we move all throughout the week, from the beginning of the week
until the end of the week, thinking about the types of security
implications that come about.
And so one of the first things we introduced in the class was Git,
the version control tool that we were using
to keep track of different versions of our code
in order to manage different branches of our code, so on and so forth.
And so a couple of important security considerations to be aware with
regards to Git.
You all probably created GitHub repositories
over the course of this week, maybe for the first time.
And GitHub repositories by default are public.
And this is in the spirit of the idea of open source software, the idea
that anyone can see the code.
Anyone can contribute to the code.
And that, of course, comes with its trade offs.
On one hand, everyone being able to see the code certainly
means that anyone can help you to find bugs and identify bugs.
But it also means that anyone on the internet can see the code,
look for potential vulnerabilities, and then
potentially take advantage of those vulnerabilities.
So definitely, trade offs, costs, and benefits that
come along with open source software.
And another thing just to be aware of, we mentioned this earlier in the week,
but your Git commit history is going to store the entire history of any
of the commits that you have made, as the name might imply.
And so if you make a commit and you do something
you shouldn't have done, for instance-- you make a commit that accidentally
includes database credentials inside of the commit somewhere
or includes a password inside of the commit
somewhere-- you can later on remove those credentials
and make another commit and remove the credentials.
But the credentials are still there inside of the history.
If you go back, you could still find the credentials
if you had access to the entire Git repository
and could go back and find that point in Git's history.
So what are the potential solutions for if you do something like this,
accidentally expose credentials at some point in the repository
and then remove them?
What could you do?
Yeah?
AUDIENCE: Change the credentials.
BRIAN YU: Certainly.
Changing the credentials, something you should almost definitely do.
Change the password.
It's not enough just to remove them and make another commit.
And there's also something you can do known as Git purge, where
you can effectively purge the history of commit, sort of overwrite history,
so to speak, in order to replace that, as well.
But even that, if it's been online on GitHub,
who knows who may have been able to access the credentials?
So definitely always a good idea to remove those, as well.
On the first day, we also took a look at HTML.
We were designing basic HTML pages.
And there are a number of security vulnerabilities
you could create just with HTML alone.
Perhaps one of the most basic is just the idea that the contents of a link
can differ from where the link takes you to.
There's probably a pretty obvious point where you often
have text that links you to a particular page.
But this can often be misleading and is commonly
used in phishing email attacks, for instance,
whereby you have a link that takes you to URL one,
but by default, it shows you URL two, which can be misleading, for sure.
Or I can have situations where I could--
let's go into link.html--
I have a link that presumably takes me to google.com.
But if I click on google.com, it could take me anywhere else--
to some other site, for instance.
And the way that it does that is quite simply by just
having a link that takes you to a URL, but the contents of that URL
are something different or something else entirely.
And so that alone is something to be aware of.
But that problem is compounded when you consider the idea
that even though your server-side code-- application code
you write in Python and Flask, for instance--
you can keep secret from your users, HTML code is not
kept secret from users.
Any users can see HTML and do whatever they want with it.
And so on the first day, you may have been
trying to take a look at an HTML page and try and replicate it
using your own HTML and CSS, for example.
The simplest way to do something like that
would just be to copy the source code.
So I could go to bankofamerica.com, for instance, Control-Click on the page,
view the page source, and all right.
Here's all the HTML on Bank of America's home page.
I could copy that, create a new file, and call it bank.html.
Paste the contents of it in here.
Go ahead and save that.
And now, open up bank.html.
And now, I've got a page that basically looks like Bank of America's website.
And now, I could go in.
I could modify the links, change where Sign In takes you to,
make it take you to somewhere else entirely.
And so these are potential threats, vulnerabilities,
to be aware of on the internet that are quite easy to actually do.
So this is less about when you're designing your own web applications
but, when you're using web applications, the types of security
concerns to definitely be aware of.
So let's keep moving forward in the week-- yeah, question?
AUDIENCE: Can you copy JavaScript source code in the same way?
BRIAN YU: Yes.
Any JavaScript code that is on the client, you can access
and you can modify.
You can change variables and so on and so forth.
And this is actually a pretty easy thing to do.
So if I go to like, I don't know, The New York Times website, for instance,
and I look at the source code there--
let me go ahead and inspect the element, and I'll
try and hover over a main headline.
OK.
This is the name of a CSS class.
You could access any JavaScript.
You can also run any JavaScript in the console arbitrarily.
So I could say, all right, document.query selector all let's
get everything with that CSS class.
Or maybe it's just the first one, because it's two CSS classes.
All right.
Great.
I'll take the first one, set its inner HTML to be,
like, welcome to CS50 Beyond.
And you can play around with websites in order to mess around, change them.
So all of the JavaScript CSS classes, all of that,
is accessible to anyone who is using the page, for example.
Other questions before I go on?
Yeah.
AUDIENCE: Any thoughts on JavaScript obfuscation?
BRIAN YU: JavaScript obfuscation-- certainly something you can do.
So since JavaScript is available to anyone who has access to the web page,
there are programs called JavaScript obfuscators gators
that basically take plain old looking JavaScript
and convert it into something that's still JavaScript
but that's very difficult for any human to decipher.
It changes variable names and does a bunch of tricks in JavaScript
to still execute the exact same way but that looks quite obscure.
Definitely something you can do.
Still not totally foolproof, because there are ways
of trying to deobfuscate JavaScript code, at least to some extent.
So it's not perfect, but definitely something that you can do.
Other things?
All right.
Let's take a look at--
OK, when we were writing Flask applications,
we were writing web servers.
And so one thing that's just good to know from a security perspective
is the difference between HTTP, the Hypertext Transfer Protocol,
and the secure version of it, HTTPS.
And that has to do with the idea that on the internet,
we have computer servers that are trying to communicate
with each other that are trying to send information back and forth.
And when these computers are trying to send information back and forth,
we would like for that to happen securely,
that when one computer is sending information to another computer,
that information is going through a number of different routers.
And each of those routers could hypothetically
have information that's intercepted.
Someone could try and intercept a package on its way from computer number
one to computer number two.
So how do we securely try and transfer information from one location
to the other?
And this has to do with the entire field of cryptography,
which is a huge field that we're only going to be
able to barely scratch the surface of.
But the basic idea here is that we would like some way
to encrypt our information, that if I have some plain text that I would like
to send from my computer to someone else's computer,
I would like to encrypt that plain text, send it across in some encrypted way,
such that the person on the other end could decrypt it.
And so this is perhaps a more sophisticated version
of what you might have done in CS50's problem set two
when you were using the Caesar or the Vigenere cipher
in order to encrypt something.
The ciphers that are used in computing on the internet, for instance,
are just much more secure, for example.
But they follow a similar principle.
And so one form of cryptography is called secret-key cryptography,
where the idea is that if I am a computer up here
and I have some plain text that I want to encrypt,
I also have some key that only I know.
And I can take the plain text, and I can take that key
and run an algorithm on it.
And that generates some ciphertext, some encrypted version of the plain text
that was encrypted using the key.
I can then send that ciphertext along to the other person.
And so long as the other person has both the ciphertext and the key
to encrypt it, they can do the same process
and just decrypt it, generating the plain text from it.
That way, the ciphertext is transferred, not the plain text,
from one side to the other side of this communication.
And so long as both parties in this instance have access to the same key,
they can encrypt and decrypt messages at will.
Why doesn't this quite work on the internet, though?
What is the problem with this model?
Yeah?
AUDIENCE: If you're sending the key as well as the ciphertext,
then it's just revealed as sending the plain text that you have one.
BRIAN YU: Exactly.
When we transfer the ciphertext across, the other person
also needs access to the key.
We need to transfer the key across the internet,
as well, to give it to the other person.
And so anyone who is intercepting the ciphertext
could also have intercepted the key and therefore could
have decrypted the information and gotten the plain text
as a result of it.
So this secret-key cryptography, ultimately, it
doesn't work in the context of the internet
if it needs to be the case that the key is just
transferred across the internet.
Now, you could try encrypting the key, for example.
But then whenever key you used to encrypt the key,
that also needs to be sent across the internet,
and you end up with this problem where you can never figure out a way in order
to make sure that information can be transferred securely.
So the solution to this lies in a different idea called public-key
cryptography, where the idea here is that instead of having one key,
we'll have two keys--
one called a public key, one called a private key.
And the idea here is that a public key is something you can share with anyone.
Doesn't matter who has it.
And a private key is a key that you keep to yourself
that you don't give to anyone, even the person that you're
trying to communicate with.
And because we have two keys, each key is going to serve a different purpose.
They're going to be mathematically related.
And take a theory of computing class if you
want to understand the exact mathematics behind this.
But the basic idea is that the public key can be used to encrypt messages,
and the private key can be used to decrypt messages that
were encrypted using the public key.
And so what does this model look like?
Well, I have some public and private key.
And if I want some other person to send me information,
I will give them my public key.
Just give the other person the public key so that they have access to it.
Remember, the public key is used to encrypt data.
So they can use the public key and encrypt the plain text,
generate some ciphertext.
And then all the other person needs to do is send me that ciphertext.
The ciphertext comes across to me.
And I now have the private key, the key that I
can use to decrypt the information.
And using the private key and the ciphertext,
I can then decrypt the message and generate the plain text.
So this is the basic idea of public-key cryptography,
this idea that we use a public key to encrypt information and a private key
to decrypt information.
And by separating this out into two different keys,
we can share the public key freely without needing
to worry about the potential for internet traffic
to be intercepted and decrypted, for example.
And so this is the basis on which internet security works.
Yeah?
AUDIENCE: What if someone else intercepts the ciphertext
and they also have a private key?
Would they be able to decrypt it?
BRIAN YU: If someone else intercepts the ciphertext and they have a private key,
they won't be able to decrypt it, because the private key
and the public key are mathematically related in such a way
that if you encrypt something with a public key,
you can only decrypt it with the corresponding private key.
And so generally speaking, you'll generate both the public
and the private key at the same time, such that only messages encrypted
with one can be decrypted with the other.
So you can't just have some other random private key and decrypt the message.
It can only decrypt messages from the public key.
AUDIENCE: So how did this person get that specific [INAUDIBLE]??
BRIAN YU: So this person down here generated both the public
and the private key at the same time.
There's just an algorithm that you can use to randomly generate
a public and private key.
You share the public key with anyone you want to be able to send you messages.
That person you share it with can use the public key to encrypt the message.
And then you, the person who generated these keys,
can take the encrypted message, use the private key that you generated,
and get the plain text out of that.
Yeah?
AUDIENCE: How difficult is it to get the private key from the public key?
Is it impossible?
BRIAN YU: How difficult is it to get the private key from the public key?
Long story short, we don't really know.
We think it is very difficult to do.
We think that it would take a very long time.
If you took a computer and tried to get it to go from the public key
to the private key, we think it would probably take billions, trillions, more
years if a computer was operating at top speed trying to do this calculation.
But no one has been able to technically prove that it is difficult.
And so this is a big open question in computing right now.
You can take a theory of computation class
for more information on this sort of thing.
But there are some open unsolved problems in computing,
and this happens to be one of them.
Yeah?
AUDIENCE: Is it based on primes and very large primes, and you
multiply them together?
BRIAN YU: Yes, this is basically the idea of very large prime numbers
that you multiply together.
The long story short of it is it's based on the idea
that there is some mathematical operations that are easy
and some mathematical operations that are believed to be difficult.
And if you take two very big prime numbers,
a computer can multiply those numbers very easily
and calculate what the product of those two numbers is.
It's just a simple multiplication algorithm.
But if you have that result, that big multiplied prime number,
it's very difficult to factor that number
and figure out which two prime numbers were multiplied together
in order to generate that number.
And nobody has been able to come up with an efficient algorithm for factoring
it.
And so as a result, because we believe factoring numbers to be
a very difficult problem, we use it as the basis
for computing security on the internet.
Brief teaser of theory of computation.
Take any of the 120 series here at Harvard, at least,
for more information about that.
Other things?
Some other security considerations when designing web applications
to be aware of-- we mentioned this before,
but when it comes to storing credentials,
you should generally always store credentials
in environment variables inside of your application
rather than have inside of your Python code some password,
whether it's the secret key of your application,
whether it's the credentials to your database,
whether it's some other credentials for an API key,
for example, that you're using the server to access.
Usually best not to put that in the code in case someone else
gets access to the code.
Generally best to put it in an environment
variable, a variable that's just stored in the command line environment
where your server's being run from.
And then add code that just pulls the credentials from the environment.
You can use in Python, at least, os.environ.get
to mean get some information from the application's environment.
And this is generally going to be a more secure way of doing the same thing.
Yeah?
AUDIENCE: How do we do that in Heroku if we
want to upload our code to the website?
BRIAN YU: Yeah.
So if you're uploading this to Heroku, if you go to your Heroku application
and go to the Settings panel, there is a section,
I think it's called config vars, that basically just lets you add environment
variables to the Heroku application.
And that will automatically set those environment variables such
that when you run the application, it can
draw from those environment variables.
Yeah?
AUDIENCE: Is it [INAUDIBLE] yesterday, or is that something
you can't have access to?
Because if you just did [INAUDIBLE] and then the key,
it goes away when you close the terminal, correct?
BRIAN YU: Yes.
So that's true.
So you can certainly, on your own computer,
set aliases or environment variables inside
of your profile that automatically set credentials in a particular way.
The idea is that you never want to be taking those credentials
and committing them to a repository that other people might
be able to see, for instance.
That's where things start to get less secure.
OK.
Moving on in the week to talk about some other security considerations.
We'll talk about SQL, the idea of databases.
And when we introduce databases, there are a lot of security considerations
that come about.
But we'll just touch on a couple of them.
The first is how you store passwords.
So you can imagine that inside of a database,
you might be storing users and passwords together.
And maybe we have a whole users table that has an ID column,
a column for people's usernames, and a column for people's passwords.
And you could imagine just storing passwords inside of the row.
But why is this not particularly secure?
Yeah?
AUDIENCE: If anyone gets access to the data table,
they can see what all the passwords are.
BRIAN YU: Exactly.
If anyone gets access to the database, they immediately
have access to all of the passwords.
And this is probably not a secure way to go about things,
because you probably hear in the news from time
to time that databases aren't perfectly secure, that every once in a while,
there's some big security vulnerability where someone's able to get access
to passwords inside of a database.
And that becomes a major security concern.
And so one way to try and mitigate this problem
is, instead of storing passwords inside of the database,
store a hashed version of the password.
A hash function, as you might recall from CS50, just takes some input
and returns some deterministic output.
And a hash function can generally take any input password
and turn it into what looks like a whole bunch of random sequences of letters
and numbers.
And the idea here is that it's deterministic.
The same password will always result in the same hash value
whereby when someone tries to log in, when they type in their password,
rather than just literally compare their password
and say does the password match up with the password in this column,
you can say, all right, let's hash the password first.
And if the hashes match up, then with very high probability,
the user actually signed in to the website with the correct password.
And you can then log the user in.
And now, if someone was able to get access to the database,
they wouldn't get access to all the passwords.
They would only get access to the password hashes.
Now, it's still a security vulnerability,
because someone could, in theory, be able to figure out
information about the password from the password hashes.
But better, certainly, than literally storing the raw text
of the password in the database.
Yeah?
AUDIENCE: Do we know how the hash functions generate that code?
BRIAN YU: Yeah.
The hash functions tend to be deterministic,
and you look up what the hash functions themselves are.
So there are a couple of quite popular hash functions
that are out there that do this sort of thing.
But the idea of the hash function is similar to the idea
of public and private keys, that it's very easy to hash something,
and it's very difficult to go in the other direction.
I can easily hash a password and generate
something that looks like this.
But it's a difficult operation to take something that looks like this
and go backwards and figure out what it was that the original password was.
And so that's one of the properties of a good hash function.
Yes?
AUDIENCE: Did you actually hash these, or did you just hit the keyboard?
BRIAN YU: I think these are probably--
there might be hidden messages here if you look carefully.
But separate issue.
Other things?
OK.
So how is it that potential data is leaked as a result of using a database?
Well, there are a number of ways that applications can inadvertently
leak information.
Take a simple example.
Oftentimes, you'll see websites that have a Forgot Your Password
screen where you type in an email address, and you click Reset Password.
And that helps you to send you an email that allows you
to reset your password, for example.
And you imagine that you type in an email address,
and you get, OK, password reset email has been sent.
But maybe some applications work such that if you type
in an email address that doesn't exist, then
you get an error that says, OK, error.
There is no user with that email address.
What data has this application now exposed?
What information can you get just by using this part of a web application,
for instance?
Yeah?
AUDIENCE: You know that that email address is not in the system,
so you know that person is not using that app.
BRIAN YU: Yeah, exactly.
Just using the Forgot Password part of this application,
you can tell exactly who has an account for this application
and who doesn't just by typing email addresses and seeing what comes back.
So there's potential vulnerabilities in terms of data
that gets leaked there, as well.
And there are all sorts of different ways that information can get leaked.
Oftentimes, there's a growing field whereby
you can tell just based on the amount of time it takes for an HTTP request
to come back whether or not--
you can get information about the data inside of a database
based on that whereby if you make a request that takes a long time, that
can tell you something different than if a request comes back
very quickly, because that might mean fewer database requests
were required in order to make that particular operation work
or any number of different things.
And so there are security vulnerabilities there, as well.
Final one.
I'll briefly mention the SQL injection.
We've already talked about that.
But again, something to be aware of just to make sure
that whenever you're making database queries,
you're protecting yourself against SQL injection,
that you're making sure to either use a library that takes care of this for you
or escape any characters that you might be using that
could ultimately result in vulnerabilities in SQL.
Yeah?
AUDIENCE: How about the websites or tools
like LastPass that store your credentials for other sites?
Don't they have to have some way of reversing their own hash on it
in order to give you that credential when you go to another site?
So when it auto fills your username and password,
it has to-- if they're storing a hashed version on their side but filling
in the plain text version in the password field,
how are they able to reverse that in a way that is secure?
They would have to have a table of keys or something
that then is just as vulnerable as leaving the password.
BRIAN YU: Yeah.
So for password manager-type applications, it's a good question.
I think the way most of them do this is that you have a master password that
unlocks the entire database of the passwords that are stored there.
And the idea would be that they're encrypted
using the master password as the key to be the unlocker such
that they're encrypted.
And only by getting the master password correct
can you then decrypt the information and then
access the plain text version of the passwords that are inside.
And so hashing and encryption and decryption are slightly different.
In the case of encryption and decryption,
you still want to be able to go from the ciphertext back to the plain text,
whereas in the case of the password hashing,
you don't really care about the ability to reverse engineer it to go backwards.
All right.
And finally, on the topic of security, we'll
talk a little bit about JavaScript.
JavaScript opens a whole host of different potential vulnerabilities
from a security standpoint.
But we'll talk about a couple.
The first is this idea called cross-site scripting,
or the idea of taking a script and being effectively able to inject it
into some other site by putting some JavaScript that the web
application didn't intend into the web application itself.
And so here's a very simple web application written in Flask.
And this is the entire web application.
It's got a route, a default route, called / that just returns, "Hello,
world!"
And it's got an error handler that we didn't really see in the class.
But basically, it handles whenever there's
a 404 error, whenever you're trying to access a page that was not found.
And it just returns, "Not found," followed by request.path, whatever it
is that was the URL that you requested.
And so I could run this application.
I'll go ahead and start up Chrome, and I'll go ahead
and go to the source code for XSS1.
I'll run this application.
Go here.
It says, "Hello, world!"
And if I go to helloworld/foo, for example, some route that doesn't exist,
I get not found, /foo, because that's not a route that's available on this
page.
I go to /bar.
Not found, /bar.
What could go wrong here?
Where's the security vulnerability, again,
thinking in the context of JavaScript?
The page my application is returning is literally just "not found"
followed by whatever was typed into the request path.
And so what I could do is you could imagine that instead of running /foo,
I could instead make a request that looks something like /script
alert('hi) and then /script, for instance,
injecting some JavaScript into the request path whereby if I do that,
I say, OK, /script alert('hi') /script.
Press Return.
And OK, Chrome is being smart about this.
Chrome actually isn't allowing me to do this,
because Chrome has some more advanced features that are basically
saying Chrome detected unusual code on this page
and blocked it to protect your personal information and error blocked
by XSS auditor.
That's cross-site scripting.
So Chrome is automatically auditing for this.
But not all browsers are like that.
And I can, I think--
let's see if I can disable--
if I disable cross-site scripting protections,
I think I can get this to-- yeah, OK.
Disabling cross-site scripting productions,
we can still type in the URL and actually get some JavaScript
that the page didn't intend to still run on this particular web page.
And so if someone were to send you a link that took you to this page,
/script alert('hi'), you could get JavaScript to run that you
didn't intend.
And maybe that's not a big deal.
But it could be a bigger deal in a situation that
looks like this, where we have JavaScript
and document.write is a function that just add something to the page.
And here, we're loading an image, img src,
and the source is some hacker's website.
And then we say, cookie= and then document.cookie.
Document.cookie stores the cookie for this particular page.
And so effectively, what's happening in this script
is that your page, when you load it, is going to make a web
request to the hacker's URL.
And it's going to provide it as an argument whatever
the value of your cookie is, for instance.
And that cookie could be something that you use in order
to log in as the credentials for some website,
like a bank application or whatnot.
And as a result, the hacker now has access
to whatever the value of your cookie is, because they
can look at their list of all the requests
that have been made to the application much in the same way
that you've been able to do in the terminal
to see all the requests for your Flask application.
And they can see that someone requested hacker_url?cookie= this cookie,
and they can then use that cookie to be able to sign in to other sites,
as well.
So most modern browsers, like Chrome, are
pretty good at defending against this sort of thing.
But definitely something that is a potential vulnerability, especially
for older browsers.
Questions about this cross-site scripting?
Yeah?
AUDIENCE: Are you getting the user's cookie,
or whose cookie are you getting there?
BRIAN YU: Whoever opens the page.
So the user's cookie, potentially on an entirely different site.
The idea is that if your site is vulnerable to cross-site
scripting in this form, then you open up a possibility
where someone could generate a link to your website that
includes some JavaScript injected like this whereby someone else could
steal the cookies of your users on your website.
And they could get the cookies for themselves
and use those cookies to sign into your website
and pretend to be people that they're not, for example.
There's a potential security threat there.
So cross-site scripting is one example of a JavaScript vulnerability.
Another vulnerability is called cross-site request forgery.
Imagine that you have a bank website, for instance,
and that bank gives you a way to transfer money.
And if you go to that URL /transfer and then you provide arguments as to who
you're transferring money to and how much money you're transferring,
you can transfer money.
Might be a web request that allows you to do that.
Imagine some other website, some website where
hackers are trying to steal money, where they have code that
looks a little something like this.
They have a link that says, "Click Here!"
And when you click on the link, that takes you to yourbank.com/transfer
transferring to a particular person, transferring a particular amount.
And some unsuspecting user on this website could click the button.
And as a result, that takes them to their bank.
And if they happen to be logged into their bank at the time,
that could result in actually making that transfer.
So cross-site request forgery is the idea
that some other site can make a request on your site as by, in this case,
linking to it.
This still isn't an amazing threat, because the person actually still needs
to click on the button in order to be able to load in order to actually go
to yourbank.com/transfer/whatever.
But you can imagine that a clever hacker might be able to get around this
by doing something like this--
rendering an image, for example, and saying the source of the image
is going to be this.
And when an HTML sees an image tag, the browser is just going to go to that URL
and try and download that image.
It's going to go to the URL, try and fetch that resource.
And here, that resource is yourbank.com/transfer and then
transferring that money.
So the user doesn't even have to click on anything.
And by making a GET request to yourbank.com/transfer,
if yourbank.com isn't implemented particularly securely and just allows
you to go to a URL like this to transfer money, then that could be the result.
So how do you protect against this?
How would you protect against your website
being able to do something like this?
Because your website probably wants some way
of being able to transfer money if you have a bank application,
but you don't want to allow people to make requests like that.
Answer, yeah?
AUDIENCE: Yeah.
It's facetious.
BRIAN YU: Go for it.
AUDIENCE: You get a better bank.
BRIAN YU: Get a better bank.
OK.
Certainly something that would work.
Other thoughts?
Yeah?
AUDIENCE: Change the form request type so it's not literally in your own
[INAUDIBLE].
BRIAN YU: Yeah.
Change the form request type so that it's not literally here.
So this right here is a GET request.
You might imagine that instead, it's a form that's submitted by a POST,
like a POST request, a form that you actually
have to submit, click on a Submit button, in order to submit that form.
And so now, you could imagine that someone could still
create a vulnerability by doing something like this.
They have a form whose action is yourbank.com/transfer submitting
by a method POST.
And now, they have these input that are type hidden,
which are just input fields that don't show up inside of a page.
And they can have hidden input fields that
specify who it's to, what the amount is, and then just some button that says,
"Click Here!"
And if they click here, then unwittingly,
the user could be submitting a form to the bank that's
initiating some transfer.
And in fact, if the hacker is being particularly clever,
you don't even need the user to click anything,
because we can use event listeners to get around this.
I could say body onload--
in other words, when the body of the page is done loading,
run this JavaScript.
Document.forms returns an array of all the forms in the web document.
Square bracket 0 says get the first form.
And there's a function in JavaScript called .submit that submits a form.
So you can say, all right, get all the forms, get the first form,
and run submit.
And that's going to result in submitting this form,
making a POST request to yourbank.com/transfer,
which results in some amount being transferred.
So this is a potential vulnerability, as well.
If you're writing this bank application, you
don't want to allow a code like this to be able to get through your security,
because that opens up a whole host of potential security vulnerabilities.
And in general, the way that people tend to deal
with this is by adding what's called a CSRF token, a Cross-Site Request
Forgery token, basically adding some special value that changes
into their own forms and then, anytime someone submits
the form, checking to make sure the value of that token
is, in fact, a valid token.
And that way, someone couldn't fake it because some other form
on some other hacker's website isn't going to have a valid CSRF
token inside of their form page.
And so larger scale web application frameworks, like Django,
offer easy ways to add CSRF tokens to your forms, as well.
But just something to be aware of as you begin
to think about, when you're designing a web application,
how could someone exploit it?
How could someone make requests on behalf of users
that they don't intend to in order to get
some malicious result to come about?
So lots of security things to be thinking about.
Questions about security or any of the security topics
that we've covered or talked about?
Yeah?
AUDIENCE: [INAUDIBLE] the token is generated [INAUDIBLE] event,
or it's a unique token for every user?
BRIAN YU: Yeah.
Imagine that in the case of CS50 Finance,
for instance, that when I click on the Buy page that takes me
to the page where I can buy stocks, my route for buy
is going to basically generate a new token
and insert it into the form that then gets displayed to me.
And then when I submit that form, it gets submitted back
to the same application.
And the application can then check.
Did the token that came back match the token that I inserted into the page?
And if they do, in fact, match, then that's
a way of sort of verifying that the user was actually
submitting the actual form and not some fake form
that they were tricked into submitting.
All right.
In that case, let's switch gears a little bit,
and let's talk about scalability.
Here again, there's going to be even less code.
And the idea is just going to be, all right, what happens when
we begin to scale our web application?
We've got some web server, and we've got some users
that are using that web server, which we're going to represent as that line.
And so what happens when that server starts
to have more users that are all trying to use
the application at the same time?
What do we do?
Well, the first thing to probably do is figure out how many users
our website can actually support.
How many can it handle before it stops being able to support users?
And so this is where benchmarking is quite important.
Benchmarking is just this process by which we can test and sort of load test
our application to see what we can do to see how many users we could potentially
handle on our server.
And so what happens if we find out via benchmarking that,
OK, our server can only hold 100 users?
What if we need to support 101 users or 102 users?
What can we do?
One thing we can do is called vertical scaling, where the idea here
is, all right, we have a server.
And that server only supports 100 users.
All right, well, let's just get a bigger server, right?
Let's get a server that supports 200 users or 300 users.
And that's going to be able to better handle that load.
But there's a limit to this, right?
There's a limit to how much you can just increase the size of a server
and increase its ability to handle load.
And so what could you do to be able to handle more users?
AUDIENCE: More servers.
BRIAN YU: More servers.
Great.
And this is an idea called horizontal scaling, where
the idea is that we have some server.
And let's say, instead of having one server,
let's go ahead and have two servers that are running the exact same web
application.
And now, we have two servers that are able to run the application
and handle twice as many people.
What problems come about now, logistically?
User tries to access our website, and now what?
Yeah?
AUDIENCE: That means you could have a race condition situation
or how the servers communicate to each other [INAUDIBLE]..
BRIAN YU: Yeah.
How do the servers communicate with each other?
Certainly, race conditions become a threat, as well.
And then a fundamental problem is a user comes to the site,
and which server do they go to, right?
We need some way of deciding which server to direct a particular user to.
And so generally, this is solved by adding yet another piece of hardware
into the mix, adding some load balancer in between the user
and the servers whereby a user, when they request the page,
rather than going straight to the server, they go to the load balancer
first.
And from there on, the load balancer can split people up,
say certain people go to this server, certain people go to that server,
and try and decide how it is that people are going to be
divided into the different servers.
And so how could a load balancer decide?
If there are five servers and a user comes along,
how should a load balancer decide which server to send a user to?
There is no one right answer to this.
There are a number of possible options, a number of different
what are called load balancing methods.
But how could you decide where to send a user?
Yeah?
AUDIENCE: The server with the least amount of users currently.
BRIAN YU: Sure.
The server with the fewest users currently, what's often
called the fewest connections load balancing method.
You try and figure out which server has the fewest people on it.
And whichever one has the fewest people on it, send the user there.
Definitely good for trying to make sure that each one has about an equal load,
but potentially computationally expensive.
You're doing a lot of calculation now, so there's a trade off.
Yeah?
AUDIENCE: You could just do it randomly.
BRIAN YU: You could do it randomly.
You could just generate a random number between 1 and 5
and randomly assign someone to a particular server.
Definitely something you could do.
Other things?
Certainly the random approach is quick.
It doesn't involve having to do any calculation across all
the different servers.
But if you're unlucky, you could end up putting
a lot of people on server number two and not many people on server number eight
or whatnot.
And so what else could we do?
Yeah?
AUDIENCE: Just set up a counter [INAUDIBLE]..
BRIAN YU: Sure.
Some sort of counter.
If you only have two, you just alternate odd, even, odd, even.
Go to this server.
Go to that one.
If you've got eight, you just rotate amongst the eight--
1, 2, 3, 4, 5, 6, 7, 8 and go back to 1.
And so these are probably three of the most common load balancing methods--
random choice, whereby you just pick a random server, direct the user there;
round robin, where we do exactly that, just basically go one up until the end
and then go back to server number one; and then fewest connections, whereby
you try and actually calculate which server currently
has the fewest number of people on it and then
try and direct the user to that one with the fewest connections.
There are other methods in addition to this,
but these are perhaps three of the most intuitive
where you can start to see their trade offs.
Depending upon the type of user experience
you want, depending on how computationally
expensive certain operations are, you might choose different load balancing
methods.
Yeah?
AUDIENCE: [INAUDIBLE] benchmarking, and what are some common ways to do that?
BRIAN YU: Yeah, there are software tools that can do this.
There are a number of different ones-- the names are escaping me
at the moment--
where you can basically test on a particular URL
and get a sense for how well it's able to handle that load.
And if you have particular use cases, I can chat with you about that, as well.
So all right, let's imagine we have two servers now.
And every time a user makes an HTTP request
to a server, every time they request a page,
we direct them to one server or the other server using
one of these methods, either by choosing randomly or by round robin
or by figuring out which one currently has the fewest users connected to it
or is handling the fewest connections.
What can go wrong?
Whenever we're dealing with issues of scale, we just try and solve a problem
and figure out what new problems have arisen.
Yeah?
AUDIENCE: You only have five servers, and now you need six.
BRIAN YU: Yeah.
Certainly, if you only have five servers and suddenly you need six,
that could potentially become a problem, as well.
But let's even assume that we have enough servers.
We have five servers, and every time someone load a page,
they get sent to a different server based on one of these methods.
What can still go wrong with the user experience?
And in particular, I'll give you a hint.
Let's think about sessions.
What can go wrong?
Remember, sessions were ways of storing information-- in our case,
inside of the server--
about the user's current interaction with the server.
It stored which user was logged in.
It stored the current state of the tic-tac-toe game.
It stored other information.
Yeah?
AUDIENCE: You have to pick one [INAUDIBLE]..
BRIAN YU: Yeah, exactly.
If I initially load a page and I go to server one and some information
about me is stored in the session, like whether I'm logged in
or the current state of my game or something else,
and then I load another page and it takes
me to server four this time, well, now, that server
doesn't have access to the same session information
that server one had if the information about the session
was stored in the server.
And now, that information is lost.
So I could load a page, and suddenly, now, I'm
logged out of the page for no apparent reason
even though I've logged in just a moment ago.
And then I could go to another page, and maybe by chance,
I'm back to server one, and now I'm logged in again.
So strange things can begin to happen.
And so to solve that, what could we do?
How can we make sure that sessions are preserved
when the user is requesting pages?
Again, no one correct answer.
Multiple possibilities here.
How do we solve this problem?
Yeah?
AUDIENCE: Would there any way to store the session on the load balancer?
BRIAN YU: Store the session on the load balancer.
That's a good idea.
And that will actually get me at the first idea here,
which is this idea of sticky sessions.
And this is slightly different.
Rather than store all the session information in the load balancer,
it just needs to store for this particular user which
server has their session information.
So if I went to server number one initially,
the load balancer will remember me based on my IP address, cookie, or whatever
and say, all right, next time I try and request a page,
let me direct them back to server number one, for instance.
That way, whenever I come back, I'm always going to go to the same place.
There are other ways to solve this problem, as well.
You could store session information in the database
that all the servers have access to.
You could store session information on the client side, whereby
it doesn't matter what server you go to, because all the session information is
inside the client.
So there are a number of ways to solve this problem,
but these generally fall under the heading of session-aware load
balancing.
Someone mentioned the problem of, OK, well, I have five servers,
but what happens when I need six?
To solve this in the world of cloud computing,
where nowadays most people don't maintain their own hardware
for their web applications, they just rent out
hardware on someone else's servers, for instance, on AWS, for instance,
use Amazon servers--
you can take advantage of auto scaling, which automatically will grow or shrink
the number of servers based upon load, whereby you could initially
have two servers.
But if more users come about and you need more,
we can add a third server into the mix.
More people come out, we need even more.
We add a fourth server.
And auto scaling goes in both directions.
So if suddenly we find, all right, we had a lot of load
at this particular peak time of the day but now there are
fewer users on the site, the auto load balancer can sort of say,
all right, we don't need four servers anymore.
Let's go back to three and then later on, if it needs doing,
go back up to four again.
And it can automatically, dynamically reconfigure the number of servers
in order to figure out what the optimal number is
given the number of users that are currently using the application.
What happens, though, when one of the servers fails for some reason?
The server just dies, for instance.
The load balancer doesn't necessarily know about that.
And so if it's still directing people across four different servers,
it could direct users to that server that is no longer operational.
Any thoughts on how we might solve that problem?
Yeah?
AUDIENCE: Have the load balancer ping the server at determined intervals
to see if it's still there.
BRIAN YU: Yeah, some sort of ping to make sure
That the server is still there.
And often, one of the easiest ways that this is done
is via what's called a heartbeat, whereby each of the servers
gives off a heartbeat every fixed number of seconds or minutes, for instance,
whereby if every 10 seconds the server pings the heartbeat,
that gets sent to the load balancer.
If ever the load balancer doesn't hear the heartbeat from the server,
it can know that that server is no longer operational, and it can say,
all right, you know what?
Let's stop sending users there and only send users to the other three servers.
Questions about that or any of the ideas of how we scale our servers
to be able to handle load?
We decided, all right, if too many people are on one server,
we need to split up into two different servers.
But that introduced a bunch of problems that we
had to solve-- problems about load balancing, problems about what to do
about sessions, so on and so forth.
Yeah?
AUDIENCE: Do you hear a lot about distributed servers?
I'm wondering how they [INAUDIBLE].
BRIAN YU: Sure.
How do servers share data?
Well, they use databases.
And of course, as we start to figure out what to do with more and more servers,
we also need to figure out what to do about databases,
figure out how to scale databases and make sure that as we scale them,
the databases are able to handle that load, as well.
And so in the past, we've had, all right, a load balancer.
We've got servers.
And in our model right now, we have a database that both of these servers
are connected to.
But of course, the problem is soon going to arise of, all right,
now we've got a lot of servers that are all
trying to connect to the same database.
And now, we've got yet another single point
where things could potentially go wrong or where
we could potentially be overloaded.
So how do we solve this type of problem?
One of the most common ways is database partitioning.
One form of database partitioning you've, in fact, already seen,
and it's just an extension of what we've been doing with SQL,
whereby we have this flights table.
And we could say, all right, rather than store the origin and the origin code,
let's go ahead and separate what's in one table
into a couple different tables.
Let's separate the flights table into a locations table
where the locations table has a number for each possible location.
And then it also, in the flights table, now,
only needs to store a single number for the origin ID and the destination ID.
We could also separate tables in different ways.
If we have some general way we could partition
a table into different parts that are generally
going to be queried separately, then we can
do another partition where I could say, all right,
my flight's table is getting big.
Let's split it up.
And all right, at my airline, the international departures and arrivals
are handled separately from the domestic departures and arrivals.
So no need for those to be in the same table.
Let me just go ahead and take flights and separate it
into a domestic flights table and an international flights table,
for instance.
One way to just partition things into two different tables that
could potentially be stored in different places that ultimately
allows for handling of scale.
But ultimately, all of these are problems
that are still going to lead to the fundamental problem of if I only
have one database and 10 or dozens of servers that are all
trying to communicate with that same database,
we're going to run into problems.
The database can only handle some fixed number of connections.
And so one solution to this is database replication.
So all right, how does database replication work?
Well, probably the simplest form of database replication
is what's called single primary replication, whereby
I have one what's called primary database and maybe
three databases in total, but only one that I'm
going to consider the primary one.
And you can read data from any of the databases.
You can get data out of any of the three databases,
whereby if there are three servers and each one wants to read data,
they can just share among the three databases reading data
to make sure that we're not overloading any one
database with too many connections.
But you can only write data to a single database.
And by only writing data to a single database,
that means that anytime this database is updated,
then this database, our primary database,
just needs to update the other two databases.
Say, all right, there's been a change made to the primary database.
And it's the primary database's responsibility
to then communicate to the other two databases what those changes are.
And so that's single-primary replication.
Yeah?
AUDIENCE: How is that more efficient than just communicating with all three
of them?
Because I think you're sending information
from the first database to the second and third.
[INAUDIBLE] information sent that's just rewriting to all three of them.
BRIAN YU: That's true, though.
Databases could potentially batch information
together into transactions and things and groups
so as to be a little bit more efficient.
So certainly ways around that problem.
But yeah, a good point.
Of course, this helps the read problem.
It makes it easier to be able to read data out of databases.
But it leaves open a potential vulnerability
or a potential scalability problem with regard to writing data,
because there is still only a single database on which I can actually
write data to if that one database is responsible for updating
all of the other databases.
And so a more complex version of this is what's
known as multi-primary replication, where
the idea is that each database can be read to and written from.
But now, updates get a lot more complicated.
All of the databases need to have some notion and some way
of being able to update each other.
And there, conflicts begin to arrive.
You can have update conflicts where two different databases
have updated the same row.
All right, how do you resolve that problem?
You can have uniqueness conflicts, whereby
if you add a row to each of two databases at the same time, maybe
they get the same ID.
Maybe this one only has 27 rows, so this database
adds a new row with ID number 28, and this database does the same thing.
And now, when they try to update each other,
we have two rows with the same ID.
And now, we need some way of resolving those,
because the IDs are supposed to be unique.
And so that can create problems, as well.
And then there are other types of conflicts, too-- delete conflicts,
whereby one database tries to delete a row at the same time
another database tries to update a row.
So which do you do?
Do you update the row?
Do you delete the row?
And so these are all conflicts that when you're setting up
a multi-primary replication system, you need
to figure out how you're going to ultimately resolve those conflicts.
You gain the ability to write to all the databases,
but new problems arise as you begin to do that.
Yeah?
AUDIENCE: So is the information in each database the same?
Are they [INAUDIBLE] with each other?
BRIAN YU: Yeah.
In this model, the databases in general are
going to be the same, though they're not always perfectly going
to be in sync, which is yet another problem, whereby there might
be some time after I write to this database
before that data propagates through all of the databases, for instance.
AUDIENCE: So why not keep it in one?
BRIAN YU: You could keep all the information in one database.
But a single database server can only handle so many connections.
And so you might imagine that having three different servers, three
different computers that are all able to handle incoming requests,
just increases the capacity of your application
to be able to handle that kind of load.
All right.
Questions about databases, database replication, any of the scale problems
that come about there?
All right.
Final thing I'll mention on the topic of scaling that can be helpful
is just the idea of caching.
Caching is something we've talked about a lot before.
But a general idea could be that in order to try and solve this problem
of constantly having to request information from the database,
if we could store data in some other place-- in particular,
inside of a cache--
then we don't need to access the database as often, because we've
got the information already stored.
And so one way to do this is via client-side caching.
And so inside of the HTTP headers, when an HTTP response
is sending back information to a user, you
can add an HTTP header called cache control that basically
says for up to this number of seconds, you can just store information
about this page and not request it again if you try
and request the page for a second time.
And this helps to make sure that if the browser tries to request
the page again, it doesn't need to.
It can just use the version that's stored inside of the cache.
And a more recent development is this idea of an ETag, or an entity tag.
And the idea here is that if we have some web resource, some document,
some piece of data from a database that our web application is sending out
to users, when I send users that resource, that document,
I'll send that document, and I'll also send an entity tag that
corresponds to that particular version of the document
and send them both to the user.
And imagine this is a big document.
It's a lot of data, so it's expensive to query and to send to the user.
The next time the user tries to request this page, what the user can do
is the user can send the entity tag, the ETag, along with their request.
I would like to request this resource, and, oh, by the way,
I already have this version of the entity stored
locally inside of my computer's cache.
And if the web application then looks at that ETag and says,
all right, you know what?
That's the latest version of the document.
The web application can just respond--
in particular, with an HTTP status code of 304, meaning not modified,
to just say, you know what?
This entity tag is the most recent entity tag.
Don't bother trying to request the document again.
Just use the version you saved locally in your cache.
And if, on the off chance, the document's been updated
and therefore has a new ETag value, then the web application
goes through the process of sending that entire document back to the user.
But by taking advantage of technologies like this,
this can allow us to make sure that we're not
making too many requests to the database,
that we don't make redundant requests if a particular resource hasn't changed.
So caching can be done on the client side.
Caching can also be done on the server side, which
changes our diagram slightly so as to look a little bit more
like this, whereby now, we've got some more complications here.
We've got some load balancer that's communicating
with a bunch of different servers.
All of those servers have to interact with the database,
and maybe you've got multiple databases going on here that are each able to do
reads and writes, either in a single-primary model
or a multi-primary model.
And those servers also have access to some cache that makes it easier
to access data quickly, in a sense, saying,
if there's some expensive database query,
don't bother performing the database query again and again and again.
Take the results of that database query once.
Save it inside of the cache.
And from then on, the server can just look to the cache
and get information out of there.
So lot of security and scalability concerns
that can potentially come about as you begin web application development.
And so goal of today was really just to give you
a sense for the types of concerns to be aware of,
the types of things to be thinking about,
and the types of issues that will come about
if you decide to take a web application and begin to have more and more people
actually start to use it.
So questions about that or about any of the other topics
we've covered this week?
All right.
So with the remainder of this morning, between now and about 12:30 or so,
we'll leave it open to more project time, an opportunity
to work on any of the projects you've worked on
so far over the course of this week and also an opportunity to work
on something new if you would like to.
I know many of you yesterday decided to start on new projects, projects
of your own choosing built in React or Flask
or using JavaScript or any of the other technologies
we've talked about this week.
Before we conclude, though, I do have to say a couple of thank yous,
first to David for helping to advise the class, to the teaching fellows--
Josh and Christian and Athena and Julia--
for being excellent in helping to answer questions
and helping to make sure that the course can run smoothly, to Andrew up
in the back, who's been taking care of the production side of everything
over the course of this week, making sure that all the lectures are recorded
and making sure they're posted online, such that afterwards, you,
when you're here or when you're not here,
are able to come online to see them.
So thank you to everyone for helping to make the course possible.
Thank you to all of you for coming to the course.
Hope you enjoyed it.
Hope you got things out of it.
We've really only scratched the surface, though,
of a lot of the topics that we've covered
over the course of the past week.
There's a lot more to CSS and HTML and JavaScript and Flask and Python
and React than we were really able to touch on over the course of the week.
It was really meant to be more of an opportunity
to give you some exposure to some of the fundamentals of these ideas,
some of the tools and the concepts that you can ultimately
use them as you begin to design web applications of your own.
So I do hope that you've learned something from the week but,
in particular, that you found things that are interesting to you, such
that you continue to take those ideas and explore them.
Go beyond just what we've been able to cover over the course of this week
and explore what else these technologies and these tools and these ideas
ultimately have to offer.
So thank you so much.
We'll stick around until 12:30 to help with project time.
[APPLAUSE]
But this was CS50 Beyond.