CS50 2017--閱讀10--SQL (CS50 2017 - Lecture 10 - SQL)

字幕列表影片播放

DAVID MALAN: All right, this is CS50 and this is lecture 10.
Welcome to those parents among you.
Today, for those tuning in online, is freshmen family weekend.
We have quite a few guests with us and we
thought we would begin today in fact by perhaps putting everyone
at ease since I know this is perhaps the first time in a while
that you've been back to college.
Maybe this is your first class for the fall semester.
And so we thought we'd share with you by proxy
what you might be feeling by way of a short film
that CS50's video team put together with a little friend of ours
just a few semesters ago.
[VIDEO PLAYBACK]
[MUSIC PLAYING]
- Hi.
- Hello
[END PLAYBACK]
DAVID MALAN: All right, well, welcome to Harvard and CS50.
So if you'd like to actually return in just a few weeks time,
your children or siblings or friends or here in the room
will be exhibiting their CS50 final projects in just a few weeks' time.
This is the so-called CS50 fair, which is
meant to be a campus wide exhibition of students' final projects.
And it's worth noting that, as the students in the class know,
68% of the students in CS50 this year had no prior CS experience.
And so what you and they will see on display at the CS fair
in a few weeks' time in December is all of their accomplishments.
And in fact, the goal of this event really
is to delight in what everyone has achieved and glancing
at each other's screens and phones to see with great surprise what
each of our students has pulled off.
So you are invited back if you would like for that as well.
Now CS50 of course is characterized as a class
for those less comfortable and those more comfortable
and those somewhere in between.
And I daresay some of our guests today are perhaps
at one end of that spectrum or the other.
And so rest assured that what we'll dive into today,
the world of databases and SQL with applications to data science
and beyond, is new to most everyone in the room as well.
So you are in good company, and I daresay 68% or more
of percent of the parents in the room are perhaps feeling exactly as you are.
So if we can now turn our attention back to some things lower level,
those students among you will recall that we focused in recent days
on Python, this other higher level language by which you
can solve problems a bit more readily than you can using, say, something
lower level like C. Now you might have noticed I'm ready if you dived
into problem set 6 as I'm sure you have, the debug 50 only
supports C and not Python.
So you have relatively fewer tools at your disposal for debugging.
But do realize that within the CS50 library for Python
is a function like this called eprint, which
you might recall if you've used eprintf with C.
This is a Python version of the same.
And so if you need help debugging your programs,
it suffices to write a program in Python.
And anywhere you want to see diagnostic output,
simply call this eprint function.
And it will print the file name and the line number for which you
have written that line of code.
But let's turn our attention now to things more web-oriented
and perhaps something with which you're familiar Facebook
and the login screen thereof.
So there's so many websites out there today where when you visit them,
you're prompted to log in-- something like this.
But you're not prompted to log in every time you click a link on facebook.com.
You're not prompted to log in every minute or every two minutes
because indeed, the websites, if they required as much,
they would degrade into something downright unusable.
So instead, these websites remember that you're logged in somehow.
So you log in once per day, per week, per year.
It varies by website.
And then somehow the website remembers that you are Maria or you are Brian
and you have already logged into this website.
Now how might that work?
How does Facebook or any site like it, Gmail and beyond,
remember that you are logged in once you are logged in even though you only
type your username and password once?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: OK, so they store information on the server.
And indeed, that's exactly right.
On the server, as we're now starting to see with Python and today with SQL,
you have the ability to store information on a server.
We saw very simply about a week ago the ability to store data
in, like, CSV files, just text files on the server.
That's not going to be very high performing,
so we're going to do better than that today.
But you can also store data in variables inside a flask
while it's running as we saw as well.
But we'll need some more sophistication than that.
But yes, we can certainly remember that Maria or Brian have logged in.
But how do we remember from which computer or browser
Maria or Brian have logged in?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Via the IP address-- so could work.
You could just recall that from those virtual envelopes
which I no longer have today.
We've been writing the to address in the from address, the IP address, on those
envelopes.
So you could certainly remember, oh, I saw Brian from 1.2.3.4 and Maria
from 4.5.6.7 or whenever the numbers actually are.
But that's a problem in, like, households and frankly even
on campuses these days because of NAT, Network Address Translation,
that technique whereby a bunch of us probably in this room right now
on our laptops or phones are sharing the same IP address, at least so
far as the outside world is concerned.
So we could confuse things.
And that would be bad because that means if Brian sort of logs into Facebook
and Maria logs into Facebook, suddenly Brian
might actually see Maria's profile page or the like.
So what else might they be doing?
What's that?
AUDIENCE: Cookie?
DAVID MALAN: Cookie, so-- eh, correct answer, but what in the world
does that mean?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Don't know--
OK.
So, perfect-- so you're in the right place.
And most everyone in this room probably, if you use the web frequently,
have generally heard about cookies and probably that they're dangerous or bad
or some way or you've probably googled how
to clear your cookies if you've indeed heard they're
a bad thing or something like that.
And indeed, cookies are the answer to this question.
But how does that actually work?
Well, if you've been to like an amusement park or a club
or someplace where you typically have a ticketed access gate, sometimes if you
want to come in go to and from the parking lot or out to the street
or wherever, they might stamp your hand using like a little red stamp
or invisible stamp or the like.
And then you just show it to the bouncer or the person at the checker
that you had been there before because you have this hand
stamp via the remembering you.
So web servers and browsers are kind of the same.
When you visit Facebook or Gmail or wherever for the first time
and moreover log in, they essentially put like a digital hand
stamp on your browser called a cookie.
And that handstand happens to be implemented
not obviously as ink but like as a big random number that
is given just to you.
And it's stored either in your computer's RAM
or on your computer's hard drive longer term.
And your browser, because it speaks HTTP,
that protocol we've been talking about, it
has been designed to show its hand, so to speak.
Every time you revisit Facebook, every time you click an email,
every time you send a message, anytime you browse around on that same website,
the browser is, without you realizing it, reminding
the server of that hand stamp.
So how might it be doing it?
Where can a browser sort of tuck away information like this such
that it's always in that envelope?
Yeah.
AUDIENCE: It can put it like into a list like you did before with the invites
or something like that.
DAVID MALAN: The invites, the invites--
oh, so that would be lower lev-- so, OK, I remember now.
So that would be not quite there because we
want to make sure that this has nothing to do fundamentally
with our application.
This has to do with HTTP.
So this was invented years ago before there was like a list of registrants
for [INAUDIBLE].
So we can't put it there.
Yeah.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: A concept of a cache-- not here, no, no.
Yeah.
AUDIENCE: [INAUDIBLE] header.
DAVID MALAN: The header-- so the header is
where we have access to metadata, sort of lower level technical information
that most of us never see unless you poke around Chrome
or actually sniff your own internet traffic and see.
But we do have the ability, within these HTB headers, to send, so to speak,
this virtual hand stamp.
So let's recall what these headers look like.
So the headers that get sent in this virtual envelope
when you request Facebook's home page, for instance,
might look like this-- get slash, which means get me the home page,
using version 1.1.
The host I'm visiting, just as a reminder,
is facebook.com or www.facebook.com.
And then below that is maybe more headers.
We've generally waived our hands at it because they're not as interesting
but there might be some more information there too.
Well, when Facebook replies, especially after you've
logged in, it turns out one of the headers we haven't looked
at in addition to these which we have seen, the response,
the 200 OK from the server, is that Facebook and servers
like it are probably sending this header as well, literally
Set-Cookie colon and then a name and then a value or key and a value.
The name here, for instance, I'm going to start calling session,
as is the term of art.
And then the value to the right can actually be anything.
So value might be 123, 456, or any unique value
that's sort of randomly generated by the server to remember who you are.
And so session equals value is like this is your hand and this is your value.
That is now the cookie that's been planted on the computer.
And what your browser thereafter is supposed to do is not send set cookie.
That only comes from server to browser.
But whenever you on Facebook request another page,
click a friend's profile page or the like,
your virtual envelope will contain not these familiar headers
but also cookie session equals value.
And so that is the presentation of that same key value pair.
So if Facebook now to your point earlier remembers
that user 123456 has logged in and that user happens to be named Maria
and we can store that information on the server,
every time Facebook sees this same hand stamp
or sees this same session equals value, it'll be like, oh, that must be Maria.
And so I'll show Maria her messages and her profile page
and all that comes with it in a website.
So let's take a step back because the mechanism is as relatively
simple as this, even though it's certainly arcane
and you have to kind of know what you're doing to leverage this.
But is there a threat here?
Like, why might cookies be dangerous and why do people
tell us occasionally to delete them?
AUDIENCE: If some one gains access to [INAUDIBLE]..
DAVID MALAN: Yeah, if someone gains access to your cookies,
you could masquerade as someone else like Maria.
And so in the sort of equivalent at a club or a part
would be if you kind of do this with someone else's hand and try to say,
hey, we're both here even though one is now backwards, that would be kind
of a duplication of the session value.
So you probably want to make sure these are encrypted.
And what's the easiest way to ensure that the traffic between you
and a server is encrypted?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Not hashing necessarily.
AUDIENCE: SSL.
DAVID MALAN: SSL or the HTTPS prefixed URLs.
And thankfully-- and the story has changed over the years.
Thankfully, web servers are getting better
at supporting HTTPS because then all of this stuff,
even though the browser can see it and the server can see it,
it's at least encrypted or scrambled to anyone
who's eavesdropping and trying to see what's going on between the two.
But there have been bugs along these lines.
In fact, it was just a few years ago in CS50
where just days or hours prior, a security researcher
had presented at a conference some very easy to use
software via which you could hijack other people's sessions,
or session highjacking being the term of art here.
And what he had implemented was this plug-in for Firefox, another browser.
And that would allow you, when you ran this guy's plug-in,
to just sniff all of the wireless traffic in the same room.
And it would show you, so user friendly, a list of all of the session cookies
that were kind of going back and forth through the air, so to speak,
that were not encrypted.
And this was, I don't know, four or five years ago.
When HTTPS existed, it's just not, a lot of websites
were actually really using it.
So a lot of really popular websites weren't using it.
And so the demonstration in class, which thankfully we
can't do anymore because most websites have been more secured,
it was as easy as clicking a button for me
to log into that here Rose's Facebook account because she was using Facebook
during lecture and unbeknownst to her, so now was I.
And it was simply a matter of capturing this kind of value.
Moreover, if the session cookies are stored, as they're called,
on your computer's hard drive, there's just more remnants
of where you have been on the web.
Not just that little drop down menu, but elsewhere on your file system
might be reminders for better or for worse of where you have actually been.
So for these and other reasons do you sometimes
want to clear these or at least be mindful of them.
And frankly, from a development perspective,
one of the reasons why we keep encouraging you students to use
incognito mode occasionally for development purposes
is to essentially clear all of your cookies.
Incognito mode gets rid of all of your cookies
temporarily so that you can interact with the server
in a more predictable way.
And for software developers, that's just useful
not to be confused by older stale cookies.
Any questions then about cookies?
All right, so who cares?
Like, that's all fine and good.
Now we can be a little more paranoid when using the web.
But what can we actually do with this now that's actionable?
Well, if I go ahead into CS50 IDE, let me go ahead and open up an example
that you can also look at later on via the course's web site.
And this is a little web-based store that I whipped up.
It's super common these days to, like, buy things on the web.
And of course when you're buying things on the web, hopefully
the website is remembering what you are interested in.
If you go to amazon.com and click Add to Cart, where does that go?
Well, let's think about it.
Amazon is somehow remembering, after I logged in, who I am.
And we can assume that they're doing that via cookies already.
So they're putting some handstamp on my hand
and my browser is reminding Amazon on every page click
that I am me who logged in previously.
So what do you think a server is doing any time you add something to your cart
or quantity two of your car or something else in your cart?
Where Where's that information being stored as well?
Say again?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Also-- so in the header, it suffices just
to put that cookie because you proposed actually
earlier that we can store stuff in a server, whether it's a CSV file
or in RAM or today in a database.
If you can store anything you want on the server,
it suffices just to remember via this cookie
that the person has logged in before and uniquely identify him or her
thereafter by that cookie's value--
123456 or whatever it is.
Yeah.
AUDIENCE: But there's a list associated with your cookie?
DAVID MALAN: There's a list associated with my cookie.
That is actually the perfect segue to, yes, exactly.
And let's take a look at what that actually means.
So this is a web based application.
To those of you in the audience who weren't here last week
or the week prior or the whole semester, this
is a programming language called Python.
It's the language we've now started using over the past week.
And this is using what's called a framework, freely available software
that other people wrote, called Flask that makes it easier to implement web
based applications or websites.
And what we're doing in CS50, now recall, is not making static websites.
We're not talking about web design, per se,
which tends to be the art of creating static content, very
beautiful informational websites, but web programming, whereby we're actually
using logic to generate web applications or web sites dynamically.
And so Python lets us do this.
So this, recall from last week, is the magical line
that says give me a Flash based web application.
And with this comes certain features, like the ability to say,
hey Flask, define a route for slash.
That means anytime the user visits slash, the default page
on a web site, return what file, just to be clear?
Index.html.
And then we looked last week at where those index.html are
and that they are templates and so forth.
But this simply means render index.html.
So recall that I can run this web app as follows.
If I go into the store directory and type flaskrun,
I have now turned on a web server, so to speak, in the cloud,
and it's literally telling me what URL to visit.
So I'm going to go ahead and click on this and click Open.
And we see a super simple store.
Let me zoom in.
It's really ugly-- haven't used any CSS.
This is indeed not web design.
This is all functional today.
And you'll see, even though the projector is a little cloudy,
that there's three text fields there.
There's a number for Foo, a number for Bar,
and a number for Bas, which if I'm familiar
are just go-to words in computer science when
you need to talk about something that doesn't need a specific name.
It's like x, y, and z in math.
So right now, here's a store a la Amazon.com that sells three things--
Foo, Bar, and Baz.
And I can buy any number of these things.
But what we haven't had yet is the ability
really to allow a user to interact with a web form
and then store that information and change that information.
We kind of scratched the surface last time with the CSV file for [INAUDIBLE]..
But here now we have the ability to actually store something in a database.
So let me go back to the IDE for a moment and focus on exactly that.
So first of all, here is how this program works.
I'm going to go ahead and buy myself one Foo, two Bars and three Bazes,
whatever those are.
I'm going to go ahead and click Purchase.
And now I see my shopping cart.
And just to be clear, if I hover over the URL, I am now at slash cart
preceded by the long URL, which is where I began.
So notice what's neat here.
I can reload and it's still the same.
I can reload, still the same.
I can even close the tab, open a new one, go back to it, and it's still me.
Now let's see what's actually going on underneath the hood in terms of code.
Well, to C/cart is actually super simple.
Just render cart.html.
But cart.html, which I wrote before this morning,
didn't know in advance presumably that I was
going to buy one Foo, two Bars, and three Bazes.
So it's probably not hardcoded.
So there's this other feature of templates
in a web application like this one whereby
in addition to telling the computer show me cart.html,
recall that you can pass in values.
And I think we did this once last time.
So cart is just the name I've given to this concept, like a shopping cart.
And I'm going to pass it something called a session.
And a session is the term of art that describes
whatever it is the server is remembering on that user's
behalf per his or her cookie.
So the user's browser has the hand stamp, the cookie--
123456.
The server actually stores the session, which
is just going to be a box filled with values, something
that has been put into that box before.
And so what I have now in my cart.html is this.
Let me go into templates, open up cart.html.
And this is going to look especially cryptic.
But for those who've seen it before, recall
that this just means descend from or inherit from layout.html.
That's where, where, like, of the standard HTML is-- head and body
and all that.
Then this says, hey, server.
Here comes the body of my website.
And there's relatively little in there.
Here's my big bold title--
cart.
And that's why a moment ago, we saw exactly that, big bold cart.
And then over here, notice that I'm using
Jinja, that templating language-- very lightweight, very Python-like,
and you'll get more used to it over time.
Notice that we have here a for loop saying for each item in cart,
go ahead and do this colon that, which looks super cryptic.
But does anyone recall perhaps from [INAUDIBLE] last time,
like, what does this syntax mean with the two curly braces on left and right?
AUDIENCE: Variables.
DAVID MALAN: Variables-- just plug in these values here.
And now as an aside, it turns out that these curly braces are actually
escaping our data, so to speak, in a way so
that if the user types in kind of dangerous characters
at his or her keyboard, this prevents that from becoming a problem,
but more on that some other time.
And so this is just going to say item colon cart bracket item.
And this might be new syntax to some.
Where have we seen square brackets before?
What does that mean in C?
AUDIENCE: Arrays?
DAVID MALAN: Arrays.
So in Python, we still have arrays, but they're called lists.
But we have another feature too, and you wouldn't know this
just by glancing at the syntax.
But when you see square brackets next to a word
like cart inside of which is not a number,
but it's going to turn out a word.
This word is literally going to be Foo or Bar or Baz.
This is a way of using a hash table or a dictionary or a dict in Python,
as it's called, whereby if the cart is what's been passed into this template
and inside of the cart are a bunch of items like Foo and Bar and Baz,
you can get the number of Foos by doing cart brackets Foo, which would look,
if I hardcoded it, like this.
You can get the number of Bars by doing this.
And you can get the number of Bazes by doing that.
And yet if I just plug in item, which is a variable,
I'm going to get out, within this loop, the Foos,
then the Bars, then the Bazes, and so forth.
So what is the end result?
Let's go back to the browser.
When I hit reload here, the reason we're seeing Foo colon 1,
Bar colon 2, Baz colon three, is because of that template.
And if I reload again and again and again,
I'm seeing the same thing because my browser is probably doing something.
So let me do that.
Let we go to inspect.
Let me go to Network.
And just for good measure, and you should get into this habit too,
click preserve log.
Because this way, if you change pages, the log doesn't go away.
You can still poke around what you're seeing.
And again, this is my Network tab.
So I want to look at what Chrome is doing to talk to this web app.
I'm going to go ahead now and I'm going to hold Shift and then click reload.
Or if you click and hold--
you've probably never noticed this.
But if you click and hold on your own reload button in Chrome,
you can do a normal reload, which is what all of us do all the time,
or a hard reload or an empty cache and hard reload, which is just
a way of saying get rid of all of my cookies,
all remnants, and just forcibly reload everything, which
is great for software development so we know what state we're always in.
But you can also do that by holding Shift--
whoops-- and clicking reload.
So let's do that.
Reload-- and here we go.
That made one request because there's no images or CSS or fancy stuff
to send more requests.
This is going to be a little overwhelming at first glance.
The IED itself uses cookies for other purposes.
So you are seeing bunches of cookies that
have nothing to do with this application-- has
to do with the programming environment.
But what we can focus on is this here.
Notice that any time I get a response from the server,
it's indeed sending me set cookie and then some value.
Now this is a longer value than 123456, but it's
the same thing-- a big random number.
So what's happening is that is the server's way
of telling my browser you are user 123456 or whatever
that big long number was.
And my browser is now supposed to present that again and again and again.
So lastly, what is actually going on when we continue shopping and let's
say buy 10 of these and 10 of these and 10 of these and click Purchase?
Notice that if I look at the form, this is going to slash update.
And it's via Post.
And so if I now go back to the code to see what happens--
let me go back into application.py, focus on update.
This now is how I'm updating the contents of my server side cart.
I am saying for each item in the form that
was submitted, the requests form, go ahead and update session bracket
item, which is going to be Foo or Bar or Baz because that is what
was in the form, and store there this--
request.form.get.
So get me the value of that item, whether it's 10 and 10 and 10 or 1
and 2 and 3, whatever the numbers are that the human typed in.
And why do you think we put this word int?
Why might that be necessary?
Yeah.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yeah, it's simple as that.
It's kind of a casting in C. What the user typed in was text.
It might look like numbers, like 1, 0.
Certainly looks like 10.
But it's not a number.
That's a string that looks like a number.
So using int converts it to an actual number so we can do math on it.
So long story short, what is it that made this possible?
We've not used this before, but you will have seen this perhaps in P set six
or rather P set seven's distribution code, which is now on the web site.
There's a few lines of code that just enable sessions.
With those few lines, are we able to tell Flask, hey Flask,
start sending cookies and expecting cookies
back so that the programmer can use what is called a session, which
I've additionally imported up here?
So I've just added some additional functionality.
And if you've ever wondered where all this data is going,
let me go ahead and stop the server, clear my terminal and type LS.
A moment ago, the only two things in this folder
were application.pi and templates.
Now there's apparently a directory called Flask session.
So if you've ever wondered where Amazon stores the contents of your shopping
cart or where Facebook stores Maria or Brian once they've logged in,
it's probably in a folder in a file like this
where these big long strings represent your cookie value
or, better yet, it's in a database.
And so that's why those are there-- because I apparently
created four cookies by reloading and experimenting a few different times.
OK, any questions on what the potential then
or the usage of cookies and sessions are?
All right, so let's then transition to solving
problems a little more powerfully with these things called databases.
So here is, like, a depiction of what we've been doing the past week
and we'll do a bit more today and you'll do with problem set seven as well.
We have been having browsers and servers talk to one another
using kind of a certain paradigm.
And the paradigm is generally called MVC, Model View Controller, though we
didn't call it that last week.
And MVC means this--
you are writing code that is called controller.
This is like the logic, the intellectual property,
the business logic of your application.
Like, if a user submits form, then add update contents of shopping cart.
Any of those kinds of ifs and elses and loops
are in what's called your controller code, AKA application.pi.
You might also have views, V in MVC, which specifically are your templates,
the idea being that you separate all of the interesting, juicy, intellectual
part of your program in the controller code, the application.pi,
and all of the important but kind of aesthetic stuff
in your templates directory, AKA views.
This way, among other things, you, for instance,
can implement all of the logic of a web site
whereas a friend of yours who is much better at design certainly than me
can do all of the aesthetic stuff, all of the views,
in the templates directory, and you can collaborate a lot more effectively.
But there's one more letter in MVC.
It's Model View Controller.
And what we'll focus now on today is the M in MVC.
Model is just a generic way of talking about data--
like where does the actual data go.
So the mental model to have moving forward as you write code,
even though the lines are not going to be always as
clear as this picture suggests, is any time
you're writing code in application.pi that's probably controller code.
Like, that's the intellectual workflow of your application.
Anytime you're doing HTML or CSS or Jinja,
that's the view of your application, the aesthetics, the presentation
of information.
And any time you're reading and writing data, long term or short term.
That's the model.
And sometimes the lines will be very blurry
and it could all be in the same file.
But those three types of operations are the mental model to begin to adopt.
So let's get there.
You'll recall that in week zero, we introduced this guy, Scratch,
and the graphical programming language with which you began the semester.
A week later, we introduced C, which had the same ideas and capabilities
for the most part but using a keyboard only instead of graphics and a GUI.
Then we transitioned a week ago to Python, a couple of weeks
ago now to Python, which is very similar in spirit to see.
But as you may recall with great delight or frustration,
you can get so much more work done quickly--
for instance, P set five--
using the right language.
And so here too we begin to use a new tool
for the job called SQL, or Structured Query Language,
or S-Q-L. This is yet another language, but it's much smaller than Python and C
as we shall see today.
But it is a language that you typically use
when you want to talk to a database.
So based on your familiarity already, what
is a database even if you've never actually used or programmed
with one before?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Place where your store data.
So a database is a place where you store data- a base where you store data even
perhaps.
And so that's true.
And what does that actually mean?
Well, a database, like a web server, is actually a piece of software.
We might think of it is this, like, physical entity,
but a database is technically a database program
that is listening for connections just like a web server or an email server.
And any time it gets requests, it serves up a response.
So we have the same client server relationship that we do with the web
itself.
Now the database hopefully is better and smarter and faster at storing data
than certainly little old me.
And it's hopefully better than something like CSV files.
Like, recall we used CSV to store the registrants for [INAUDIBLE] last week.
I think it was me and Rob and Maria all at once.
Then what was good about that?
What problem did we solve by introducing CSVs with that example?
Yeah.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: You didn't have to go through my inbox, right?
One of the other versions of the registration program
were to just email the proctor in charge of [INAUDIBLE],, which
is fine because he or she can just use labels and keep track of it there.
But that's not a database.
You're Just kind of abusing Gmail as a database.
So why not put in a Text file instead like a CSV?
Because what kinds of programs can open CSVs?
Yeah, like Excel and Numbers and similar spreadsheet programs.
You can import it into Google Spreadsheets as well.
So that was a plus of using CSVs.
But what's perhaps a minus even though we didn't really
experience the minus so much?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Harder to what?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: It is harder to do stuff with it because at the end of the day,
it's just a stupid text file, and we humans
have agreed to put commas in between values we care about.
But that's it.
Like, there's no features.
You open up that file, you're just going to see ASCII text and commas with them.
You get no, like, features, no functionality.
And do you have another thought?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: It's inefficient why?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: It's harder to find things.
Certainly not in the case of, like, me and Rob and Maria.
But as soon as you start having hundreds of students registered or thousands
of customers in your database or hundreds of thousands, like, in CSV,
the best they can do is linear time for everything.
You have to search the entire file top to bottom.
So wouldn't it be better if we can borrow ideas from week zero,
integrate them into the context of this problem, databases,
and actually query for data, ask for it, change it, delete it, and edit it
much more readily than simple little text files allow?
So what should you be thinking about when you think about a database?
Well, there's different types.
And today, we're going to focus on what are called relational databases, which
is a fancy way of saying tables of columns and rows.
So this is a screenshot of Google Spreadsheets
with which you're likely familiar.
The projector seems to be hiding the fact that there are actually
rows and columns here.
And instead, we're just seeing a big white box.
But there are indeed rows and columns inside of here
as you can see on the screen.
And if I use a [INAUDIBLE] this is a screenshot.
So I can't even enable rows and columns.
So here, we have our columns and rows.
So why is this useful?
Well, if you've ever store data for your student group, or for your customers
or for other applications around the house,
you've probably been in the habit of storing things in rows and columns.
And what does each column typically represent in a spreadsheet?
What kinds of things might it represent?
And here's an actual version of Google Spreadsheets.
And let me actually highlight this and turn on, just because, darker lines.
So there's our rows and columns.
So when you're using a spreadsheet, maybe from personal experience,
like, what do you put in the columns?
I'm sorry?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Categories of what?
Give me a concrete example.
AUDIENCE: [INAUDIBLE] the name of the student and lab that they're in and--
DAVID MALAN: Oh, what a delightful example.
So yes, we can say, like, the student's name and the student's dorm and maybe
their email or if we collected their phone number
and bunches of other things.
And you know, just to be sort of tidy, we might boldface it,
and you can use features like this, which just kind of lock that row there.
But these are just GUI features.
These are not sort of intellectually that interesting.
But what do I typically put in each row thereafter?
The first row is kind of special usually, which
is why you can kind of lock it off.
What do you put in the rows?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: I'm sorry?
AUDIENCE: Elements [INAUDIBLE]
DAVID MALAN: Yeah, the actual data--
so, students' names.
So if, like, David registered from Matthews,
you would put his information in there.
And then, like, if Rob registered from Thayer put his in--
oh, that's not--
Rob's from Thayer, not Rob from Bowden.
And then Maria, who--
damn it, I asked her this the other day.
Where does she live?
[INAUDIBLE] Matthews, just like me, which is why I should remember it.
OK, so here's-- we now might have multiple rows in this database,
each of which represents an entity would be the fancy way of describing where
an entity is like a real world thing or a physical object or whatever it is you
care about.
So this is how we might form at this.
And of course literally, we might format things using Excel or Numbers
or Google Spreadsheet's, like, format column, which kind of
presents the data in a different way.
Maybe it centers it to the left or-- the lines to the left or right.
Maybe it removes leading zeros.
Maybe it adds dollar signs, percentages.
You can kind of present the information in different ways.
But for the most part, at the moment, it's just text that I've typed in.
But if I want to actually find someone in this spreadsheet, what can I do?
Well, I can do Command-F and then type in Rob.
And there we go.
We found Rob.
But what did the spreadsheet do?
Pretty much looked from top to bottom looking for R-O-B and as,
soon as it found it, highlighted that row.
And if I had thousands of rows or tens of thousands
of rows in the spreadsheet, Rob might be all the way
at the bottom at which point it could take longer
to actually find all instances of Rob.
And in fact, programs like Excel and Numbers
and even Google Spreadsheet eventually just start to break down.
In fact, I experienced this in grad school
myself I was doing some research where I wanted people to analyze data
and it was just convenient to be able to double click on my CSV files,
open them in Excel, and look at or search for some of the rows
that I cared about.
But it turns out that because of the long running simulations I was doing,
I had, like, over 65,000 rows of data, and Excel just wouldn't do it.
I would double click and, sorry, this file
is too big because someone at Microsoft, in prior versions of Excel,
had decided, first and foremost, surely no one
needs more than 65,000 rows in this product.
But specifically, they used a number of bits to represent the row number.
If you recall, we tend to deal in powers of 2 or multiples of 8 specifically--
so 8 and 16 and 32 and so forth.
Well, 16 bits, 2 to the 16th, is 65,536.
And indeed, I must have had 65,537 rows or more.
So Excel just kind of wouldn't cooperate,
and I would have to use an actual database like Microsoft Access
or Oracle or, as we'll see today, something like SQLite.
So as soon as you kind of hit some ceiling with the tool
whereby it's slow to search or the program just can't open all of the data
or you can't kind of edit information or delete information any better
than linearly, it's probably time for an actual database.
And one such database is called, indeed, SQLite.
So SQL, Structured Query Language, is the language we're about to see.
SQLite is like a very free and very lightweight
implementation of this database.
And it's a lightweight in the sense that you
don't need special software licenses.
You don't really need to install it.
All you have to do is store data in binary format.
So up until now in CS50, we've pretty much
stored almost everything in ASCI or text format except for P set 4's images.
Well, today we're back to storing things in binary format, zeros and ones.
But again, we are well past the point of caring what those zeros and ones
look like, the patterns there of.
So SQLite is going to handle all of that complexity for us,
and it's going to store in one big binary file containing
zeros and ones the same kind of information as David and Rob
and Maria and others but much more efficiently
because one of the things you get from a database
is the result of a lot of smart people's effort trying
to figure out how can we, on behalf of the world, store data more efficiently?
How can we provide users around the world with searching and deletion
and editing capabilities more efficiently than he or she could
implement themselves?
So again, in the spirit of CS50, we sort of
are standing on the shoulders of people who've come before us.
Now what is this going to mean in real terms?
We're going to use at least one of two tools.
There's a tool called PHP lighted min, happens to be
written in another language called PHP.
But that's incidental.
We're going to use it as a graphical tool with which we
can interact with the database program.
But there turns out there is also a terminal window version of a management
tool called SQLite or technically SQLite 3,
which is just a text based way of interacting with the database as well.
But that's just how we're going to see the database visually
in my terminal window or in my browser.
At the end of the day, we're going to talk to our database using
actual Python code.
But before we get there, let's seed ourselves
with some of the basic building blocks that we'll have
at our disposal with this language.
So unlike C and Python, where we're generally focused
on the logical controls, conditions and loops and functions and so forth--
some of those do exist in SQL.
But we're actually going to focus really on just the operations, so
to speak-- the verbs that SQL supports.
So it's a little different mental model to have than C and Python.
Specifically, perhaps the most common operations
we're going to use when storing data in not a spreadsheet like Google
but rather a binary file like a SQLite file is these five operations.
We can create tables.
We can create-- we can insert rows.
We can select rows from the table.
We can update rows in the table and we can delete rows from the table.
So these are kind of like what you would expect
you would get from a database program.
It's just these happen to be the terms of art as well.
Like, these are the verbs we use.
And it's conventional to capitalize them,
but they don't strictly need to be.
So let's look at a few examples now.
If I wanted to, in an actual database program,
create kind of the equivalent of this idea--
rows and columns and things called tables--
here's how I might do that.
So in my administrative program, which we'll see in a moment, or in code,
I might do lines like this, one at a time.
We'll review.
So in yellow now is CREATE TABLE.
So if you want to create the binary equivalent, the computer equivalent
of like a Google spreadsheet for [INAUDIBLE] registrants,
you would literally do CREATE TABLE quote unquote registrants
where that's the name of the table or the sheet you want to deal with.
And then this didn't fit on one line.
So I wrapped it.
But this is technically one long command.
Then in parentheses, you specify a, separated list of all of the columns
that you want to have in that table, but--
and this is fancier than Excel and Numbers
and Google Spreadsheets-- you specify a few other details as well.
For instance, if my leftmost or first column is going to be called ID
and my second is going to be called Name and my third
is going to be called Dorm--
so similar to what you proposed earlier but
we're not going to worry about email and phone number and the like.
And I have added, for good measure, a unique identifier, an integer
to identify each freshmen registering for sports.
There's a few other details I'm clearly providing.
Like, what type of information is a name going to be clearly?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Text, and indeed, SQLite has a data type
called not string but literally text.
Similarly, as dorm going to be text.
But ID is going to be a little different.
It could be text.
But we get better efficiency if we just use numbers
to uniquely identify humans, right?
Because even in my name, D-A-V-I-D, how many bytes is that minimally?
How many bytes?
D-A-V-I-D. Five-- yes, five bytes total for D-A-V-I-D.
But if we instead represent me in a computer using a number like the number
one or two or even the number 2 billion if we have a lot of users,
how many bytes does it take to store that number?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Well, even fewer.
If I just rep-- if my Harvard ID number is just a single number,
like one or two or 2 billion, how many bits
do you need to represent an int like that?
Typically, we've used four.
You could use eight if you want a bigger integer,
but generally speaking, it's going to be fewer than the number of characters
you're actually carrying about.
And so that's why I might use an integer.
We'll come back to what primary key means in just a moment.
But long story short, that line, those lines in yellow
if executed in a database program are going
to create the equivalent of a spreadsheet containing
three columns that I care about.
Thereafter, if I want to insert someone like myself into that table.
This is how you use the SQL language to do so.
You literally execute insert into registrant,
which is the name of the table, then a, separated list
in parentheses of the columns for which you have values
that you're about to provide.
And this allows you to change the order if you really want,
but I'm just keeping it clean and going left to right.
Then you literally write values.
And then in parentheses again, you list a comma separated list of the values
that you want to plug into those columns.
And again, it's up to you to make sure that if you say ID first,
you put a number first.
Name second, you put the name second, and so forth.
That's where we're in control.
Meanwhile, if we want to search for everyone who has registered for
[INAUDIBLE],, we're going to see syntax like this.
You're going to say to the database select star everyone from the database,
specifically from the table called registrants.
And that's going to return to you literally
every row somehow in code from the table Meanwhile.
if I want to change my name because I wasn't sort of following directions
and they technically wanted my full name for sports, well,
here I can do update registrants, the table,
set the name column equal to David Mailand,
but this last part is pretty key.
If I don't put where ID equals 1, what intuitively do you think
might happen in the database?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: All the names go to David Mailand, which is probably incorrect.
And some of the C course's head teaching fellows
will remember years ago-- we actually had
a teaching fellow named Michael [? Tingly ?]
whose nickname was Tingles.
And we accidentally executed a SQL query omitting
this condition, this predicate.
Where we should have said where ID equals Michael [? Tingly's ?] ID
number, instead we left this off.
And thereafter, we had like 700 Michael [? Tingles ?] registered in the class
until we restored from backup.
So it can happen to you as it did to us.
Finally, if you wanted to delete someone,
same principles, just different verbs-- delete from the table
that you want to delete from where you are
specifying an actual ID in this case.
And actually just for good measure, though it doesn't strictly
matter, let me be consistent and just capitalize that Where as well.
The reason for the capitalization, just to be clear,
is that it kind of helps the human eye to distinguish
what is, like, SQL syntax from words that we humans have invented
for own spreadsheet.
And notice a few characteristics as well.
I have standardized on all lowercase for my column names
ID, name, and dorm, which is common, but not strictly required.
The table as well is all lowercase.
But what is absent from all of these field names is I'm not using spaces.
I'm not using funky characters like exclamation points,
like "registrants!", or anything like that.
I'm just kind of keeping it clean with very simple characters
and that's good for portability.
It's very commonly the case.
And you might have seen or experienced this in the IDE.
Any time you start using spaces or funky characters,
computers sometimes get confused because they're not expecting them.
So the best practice with things like database design
is just avoid those issues altogether and keep things simple.
Use underscores instead of spaces if you really want to put something--
a gap between two things.
Now meanwhile, there are different data types.
There's not terribly many, and the only ones
we have to care about for the coming days are these here, one of which
is kind of strangely named.
But SQLite, the database we're using henceforth, supports text as we've
seen, integers as we've seen, or AKA int-- some of these things
have aliases or nicknames
Null is going to be kind of the opposite of having any value.
You can actually specify, no, there's no value here.
Real is going to be what, perhaps?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yeah, like a real number, like a floating point value,
as we called it in C and in Python.
And then BLOB is binary large object.
So if you actually want to store 0's and 1's in some form in your database,
you store it as a BLOB, which isn't super common
because there are better ways to store binary data like, in files.
But it is possible as well.
And it's worth noting that there's kind of sort, of some other types,
but they're not sort of officially there.
SQLite is nice because it's pretty flexible.
There are a lot of database programs out there.
You've probably heard of some of them like the ones I rattled off--
Microsoft Access, Oracle, MySQL, Postgres SQL, and others.
SQLite is just an alternative that's pretty flexible.
And so if you try to load data from like another database like Oracle
into SQLite, it will actually tolerate unofficial types
like date times, which look like years and time-- or dates and times
literally-- numeric, which is when you're not sure
if it's going to be a floating point number or an integer,
the database can figure it out for you and some others as well.
So don't be confused when reading up on SQLite.
If you see some things that aren't wholly consistent,
it's just consistent with that kind of flexibility.
And in fact, here's an exhaustive chart of all the things from left
that map to the right.
So in the right are all of SQLite's official data types, these five here.
And you might see on the internet in SQL tutorials
and like bunches of other types here.
The most popular might be, like, bigint, which is a 64-bit integer,
which is really big, varchar, which is a variable number of charts--
either fewer or more, depending on the data--
and lots of other things as well.
So just realize they're out there, but the five on the right
are the ones we really care about.
SQL also has some built-in functions.
You can do this in Python code too as we'll see.
But date and time and datetime allow you to get, like, the current time.
So if a user registers for your website and you
want to record when he or she registered,
you can call functions like this and the database
will just figure it out for you.
You don't have to check the computer's own clock.
But most importantly, there's going to be some of these decisions.
But before we forge ahead with too much more, let me pause for any questions.
Yeah.
AUDIENCE: [INAUDIBLE] when you [INAUDIBLE]
DAVID MALAN: Ah, good question.
Typically, no.
But if you want to be really precise, yes.
So I was really precise and I actually read the documentation this morning
to make sure I got all of those quotes right.
The rule of thumb if you're curious--
though for the most part, this won't matter and it shouldn't hang up--
is that any time you are referring to the name of an identifier,
as it's called, like a column name or a table name,
you use double quotes with SQLite.
Anytime you're referring to a literal string
that the human has typed in somewhere like with David Mailand
and like with Matthews, you use single quotes.
So if it's kind of user input, single.
If it's a name in your database, then it's double coated,
but it doesn't typically matter.
So don't fret two much.
So let's consider then for just a moment where we
started this particular conversation.
We were trying to create a database or spreadsheet for [INAUDIBLE] IM's,
but we kind of stopped short of the juicy part.
Like, I typed in some names and some dorms.
But I could have started typing email, which
would have hopefully some standard format-- something at something dot
whatever.
And then phone number is kind of interesting, right?
Like, how should I store a number?
Like, if I just want people calling Harvard's main number,
is that, like, 459-- what is it 5,000?
Well should I store it like this--
10 digits and two hyphens?
Should I instead store it without the hyphens just to kind of save space?
Should I, need to be consistent with international folks,
store the country code as well, which now Google is confusing for math?
So that's what Harvard's phone number is when you add the numbers together
apparently.
How do you store the information?
Or better still, what should you expect the users to type in?
When you visit a website and you are asked for your phone number
to buy something or create an account, like, in what format
are you typically prompted for your number at least in the US?
AUDIENCE: Area code, [INAUDIBLE]
DAVID MALAN: Yeah, you're often prompted [INAUDIBLE] like an example like this.
And then you have to type it in and hit Enter.
And then it's stored, like that.
What else have you seen-- probably the other formats too?
So which way should you do it?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yes, yes, all of these ways are perfectly reasonable,
and any website that forces you to type your data into one format is stupid.
Like, it is poorly designed.
And this is one of my greatest-- well, it's not my greatest pet peeves.
But like, one of my pet peeves is when a web site just completely
technologically unnecessarily puts the burden of the human to format the data.
Like, I should certainly be able to type my information in like this
or with the plus one or without the parentheses
or with an additional hyphen or the like because you know what you
the programmer who made the website can do?
You can just use some kind of pattern matching
in the spirit even of P set six, throw away all of the punctuation,
and then store just the digits if you want or go in
and forcibly add the hyphens where you want them
or the parentheses where you want them.
The burden does not need to be on the human.
And there's actually compelling UX, or User Experience,
reasons for this beyond me going off kind of on a rant.
Like, if you can't copy and paste something into a field,
that is a bad user experience.
And if you've ever tried to just copy one field into another
and the website is disabling that for you, it's not the best practice.
Worse yet are websites that don't let you paste your passwords into the forms
because if you store your passwords, as is good practice, in like a password
manager, encrypted somehow, it's better to have a really long password
and then paste it into the form than memorize a really short password
that someone else could guess.
So in short, as we move forward with web programming,
don't necessarily take as canon, like, what you see out there, but consider
for your own users, like, what would be the best way to do this
and how can I, knowing programming, accommodate different styles
and expectations as well?
But at the end of the day, there's an issue here
because is this phone number now a number, or is it a string or text?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: So OK, so I heard a few strings or text.
So that's good because SQLite supports text.
So I could specify, when I convert the spreadsheet into SQL commands and a SQL
database, I could use text.
And so that's probably a good thing.
What about zip codes, right?
Zip codes, at least if we do the first five digits in the US--
like 02138 to.
Is that a number or is that text?
AUDIENCE: Text.
DAVID MALAN: Text?
Why?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: I'm sorry?
Say again?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yeah, it's more of a category
than it is a quantity, certainly.
Like I don't have zero, one, two, three, eight of things.
That is actually just meant to be a unique symbol.
And in fact, as a side effect of the spreadsheet trying to be helpful, here
is an actual bug, or reason to store it as text.
I typed in 02138, which is Cambridge, Massachusetts,
one of its largest zip codes.
And unfortunately, Google just decided, you don't really
need that leading zero, do you, and it got rid of it.
And so this is because, at least in Google Spreadsheets, my data here,
if I go under Format, it's being stored automatically.
Google, for better or for worse-- often for worse--
is inferring that, if it looks like a number, it is a number, therefore let's
throw away the leading zeros, which humans don't need.
But if I change it to plain text and now do 02138, now it kind of sticks.
And another sort of true story, years ago,
I used to use Microsoft Outlook for my email.
And then eventually, I switched to Gmail.
And before I did that, I wanted to copy my contacts,
like my address book from Outlook into Gmail.
And so Outlook actually has a feature where
you can export a big CSV, literally, from Outlook.
And that contains columns of, like, all your friends' names and numbers
and addresses and zip codes.
And I foolishly didn't realize that I probably
shouldn't open the file first in Excel to look at it,
probably instinctively hit Command S or Control S just to save it,
because that's like a habit.
As soon as I did that, Excel presumptuously
did a Google just did, and dropped all those leading zeros.
And to this day, like 10 years later, I'm
still finding friends whose addresses are Cambridge, Massachusetts
2138 for that very reason.
So data formats matter.
And we actually have a number of decisions to make, and moreover,
if we want this database to be highly performing-- that is, quick--
we actually have a few other design decisions.
And so let's take our five-minute break here, and when we come back,
we'll actually wrestle with how can you store the data most efficiently so
that your web applications are super fast.
So we are back.
And let's see what problems now arise besides the most immediate, which
was the phone number.
So let's consider that these aren't necessarily students registering
for [INAUDIBLE],, and so therefore they don't just
have dorms and so forth, but maybe these are actually
like customers in a database, Amazon customers or the like.
And so people are indeed going to have names, we'll say.
And people are going to have addresses, address.
And people will have phone numbers still, email, and maybe
some other identifiers, as well.
So when it comes to storing actual data here--
and just for clarity, these will be my column headings now, so
special first column--
so what's my name here going to be, David Malan?
Should it be David?
Like, how do you actually think about something as trivial as this,
and what are some of the upsides and downsides?
Like, which should it be?
AUDIENCE: David Malan.
DAVID MALAN: David Malan, OK.
So David Malan is what we store.
OK?
But now, later on, I realize I'd really like
to send customized emails to my customers,
and it feels kind of lame and informal if I say dear David Malan, where
clearly it's not very intimate.
I'd like it to say, dear David comma.
So how do you address that?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: OK, so maybe we should store first and last names separated.
So to do that, maybe I'll just kind of redo this, and actually
call this First, and then call this Last.
And so now, I can keep my two columns separated.
Other thoughts?
That's pretty good.
Or--
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Sorry?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Oh, so we have like the greeting, as well,
which you might be asked by a dropdown or a text box, too.
So maybe we need that, as well.
So technically, I should probably move this over like this.
And this will be like the greeting, and maybe this
will be Mr. or some number of other symbols.
So we might have to deal with that, as well.
What about address?
Let's move on to that, because honestly, with most of these questions,
there's not necessarily a best answer.
There might be some best practices, but even then you're
going to hit corner cases.
For instance now, anytime I do want the user's first name and last name,
I'm gonna have to concatenate them together, which is fine.
It's not a big deal in Python.
It was a bigger deal in C. So there's these tradeoffs.
Like, I just have to now do that, or write more code later.
What about the address?
So it might be 33 Oxford Street, the CS building in Cambridge, Massachusetts,
02138 USA.
Let me even make more room for all of this.
Good?
Bad?
Why?
Let's go here.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Sorry?
AUDIENCE: [INAUDIBLE] separate the data [INAUDIBLE] zip code [INAUDIBLE]..
DAVID MALAN: Well you say need to separate the data into street and city
and country and zip code and state, but why?
AUDIENCE: [INAUDIBLE] right now to sort people by [INAUDIBLE]..
DAVID MALAN: Yeah, that's a great example.
So if everything has just kind of been this long string, which
we humans and the postmaster general in the US can understand,
and computers can parse, it's not that easy to just glean what we care about
from that whole long string.
Like, we're gonna have to use some kind of heuristics,
we're gonna have to look for all numbers to get the zip codes.
And I can, as you say, not sort everyone who's in Massachusetts,
or everyone who's in 02138, because like it's not a separate field.
And so even mechanically in Google Spreadsheets,
I can't sort by anything other than 33 Oxford Street.
And I probably don't really care to sort people
by their street addresses, which really have no notion of ordering
that's interesting.
So we probably want to split this up.
So maybe this is actually Street, and maybe this is actually--
let me shrink this to make a little more room now--
maybe this is City, maybe this is State, maybe this is Zip,
maybe this is now Country.
And then over here, we have Email and Phone and so forth.
And so now, I could do 33 Oxford Street.
And then I could do Cambridge, Mass, 02138, USA, malan@harvard.edu, 6174495.
See, this is why you shouldn't allow user input in different formats.
495, 5000 and so forth.
OK, so suppose we do that.
But you know, Rob has an office in the CS building, as well.
And so he's actually Mr. Rob Bowden.
So now he and I are both there.
And oops, we didn't fix this, so technically let's go back here.
02138.
And now it's like rob@cs.harvard.edu.
And maybe we share the same office number.
And there are certainly, like, dozens of other people in the CS building,
if not a couple of hundred.
So this is actually yielding another problem.
At least if I keep typing, I'm going to dislike this for some other reason.
What's also bad about this?
Yeah.
AUDIENCE: [INAUDIBLE] the office room.
DAVID MALAN: The office room, yeah.
So we didn't specify the room in the building.
So we're missing data all together.
What else really rears its head over time?
Yeah.
AUDIENCE: It's very redundant.
DAVID MALAN: What is redundant?
AUDIENCE: [INAUDIBLE] wasted space [INAUDIBLE]..
DAVID MALAN: Which information is redundant, then?
AUDIENCE: The street [INAUDIBLE].
DAVID MALAN: Yeah, there's a lot of--
I mean, literally I copied and pasted, which in coding in general
should kind of be a bad sign or a bad habit,
certainly if you do it more than just once.
And so Cambridge is identical, Massachusetts is identical,
zip code is identical.
And even though this is gonna be a small white lie,
it's probably the case that every zip code actually maps to a unique city.
So I don't-- if I know 02138, do I really need to store Cambridge, Mass,
Cambridge, Mass, Cambridge, Mass for everyone?
No, it's probably all the same.
So I should be able to delete some of this data.
And this kind of messiness is a problem that databases can now solve.
You could do this in Excel and in Google Spreadsheets and so forth.
You can quote unquote normalize your data by factoring out commonalities.
But to do so, the most common way, then, is
to start assigning unique identifiers to things.
So for instance, if I want to go in here-- let me give myself a new column,
call it ID.
And I'm just going to be user one, Rob will be user two, and then three
and four and so forth.
We're just gonna use an int to keep it simple and efficient,
because now it's only like 32 bits, or maybe 64 bits,
no matter how long the humans' names and street names and so forth are.
Let me just fix the border so that everything looks uniform.
So now I've done this.
And you know what, if you've never noticed,
like Excel and Numbers and Google support
multiple sheets in a spreadsheet.
So just for good measure, let me actually rename this to Customers.
And you know what?
I'm gonna go ahead in Google Spreadsheets,
give myself another one called Cities.
And then here, I'm gonna do something like this.
I'm gonna store over here maybe a let's say city and state.
And actually, you know what, because cities I'm
gonna claim for the moment are uniquely identified by their zip codes,
let's just put zip code there.
And then in this spreadsheet, let's do--
preemptively this time-- plain text--
02138 shall be Cambridge, Massachusetts.
And now, back in my customers database, which
of these columns, A through I or so, do I actually not need anymore?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yeah, I don't need this.
I don't need this.
I probably still need zip.
So let me just go ahead and delete these and start
to assuage your concern, which was this redundancy of data.
Now, I can do a little better, but now at least I'm storing zip code only.
And actually, this is getting a little lazy.
Let me get rid of country, as well, and just put this in here, as well,
for generalizing zip code to be postal code now, say.
And this shall be USA.
So now my Customers table has a lot less redundancy.
So clear upside is that it just takes up less space,
and less space means less money, and just more room for other things.
So that's surely good.
But it probably means too it's easier to update the data, right?
If someone-- if Cambridge for whatever reason
changes its name back to Newtown, as it apparently once was,
we just change it in one place, not for every resident in the Customers table,
and so forth.
So this is just a good principle, right?
Anytime in coding we've done something again and again, or in CSS even,
like we've tried to factor it out.
In HTML, we factored it out with templates or with functions
or with other techniques.
So this normalization process feels like a good thing.
But you know what?
02138, we can do a little better, because that's a string.
We could just store this as a number.
So you know what I'm actually gonna do?
I'm gonna also have in this spreadsheet a column called
ID whose value for this city is gonna be one, and then for subsequent cities
it's just gonna an arbitrary number that's just increasing and unique.
And so now what I can actually do here is I don't strictly need to store even
02138 dash whatever whatever, I can just store the ID.
And I'm gonna go ahead and call that zip ID, just to make super clear to me
that now this is just a number.
And so by starting to assign arbitrary but consistent unique integers to all
of our data, you can kind of relate them back to one another
by looking up one value in the other.
So for instance, if you're looking at this table as a human
and you see that, oh, David and Rob live in zip ID 1,
you do have to ask a follow-up question, which is, well, what is zip ID 1?
You can only answer that by looking at the Cities table, checking it,
and being like, oh, Rob and David both live in Cambridge, Massachusetts,
02138, USA.
So it's added a step, but it's also massively shrunk
the amount of data we need, assuming there's more people in the world
than just the two of us, and there's millions
of users or thousands of users, or just otherwise context
where you have lots and lots of data.
And in fact, as an example of this, what we put together on the course's website
is a Google spreadsheet that has a whole bunch of sheets.
This is based on a freely available open source
database that looks like real data, but it's mostly
taken from one guy's iTunes account and then made up names and company names
and so forth.
And you'll see that this is actually a database for like a digital music
store, a la iTunes or Google Play or Spotify or whatever.
And so there's different entities in the world of digital music.
You have album names, you have artist names,
you have customers who might buy those albums and listen to those artists.
You have employees at the company who is selling the music.
You have genres for those songs.
Because this is actually a very slippery slope whereby, over time,
you're gonna have lots and lots of data that
ideally would be in multiple places at once, but that's just bad design.
Keeping it factored out is ideal.
So if we glance at this album title, or rather album tab,
here are all of the albums that this digital music store sells.
Here are the unique IDs, album ID, that have
been assigned to those albums, arbitrarily but for permanently.
And then here are the artists who published those albums.
So the only way to know the artist's name
is-- at least in Google, I'm gonna actually have
to scroll over to my Artists tab and, OK,
here are the artists and their numbers.
Meanwhile, all of these albums later-- or songs are associated with genres.
So here's all the genres and the IDs associated therewith, and so forth.
So you're welcome to poke around this kind of data later.
Here's even data involving birth dates, so dates and times and so forth.
Here are invoices, like the addresses and the cities
of people who bought this music, supposedly, in the sample database.
But honestly, this is very, very, very quickly becoming
really really annoying, right?
Because if you the human just so much want to know, OK,
who produced Restless and Wild, artist two, who the heck was artist two?
This is not good for humans.
But this is good for computers.
We just need a language with which to reconstruct what we want and query for
and change what we want.
And that is the kind of thing that SQL is going to do for us.
So let me change to another tab here in CS50 IDE.
Let me find the right tab here.
So if I go back to CS50 IDE, let me introduce a tool called phpLiteAdmin.
Name doesn't really matter, but this is a tool that comes with the IDE--
and we downloaded it for free off the web and just installed it for you--
that allows us to do this.
I have among today's distribution code a file called lecture.db.
.db just means it's a database.
It could be another file extension, as well,
but this is one of the standard ones.
And notice that-- let me close the console-- if I double-click this,
this is going to open a special tab in the IDE called phpLiteAdmin.
What I am now seeing is the content of lecture.db.
Lecture.db is a binary file, zeros and ones,
that represents a SQLite database that I made before class.
So I essentially downloaded all this data, I made a few changes to it,
and then saved it in a file called lecture.db.
phpLiteAdmin, this GUI here in the tab, is just
a way of visualizing for us humans what's in this database.
And you can see there's a bunch of tables, I think 11 different tables,
that I named identically to the Google spreadsheet.
So this is the SQLite equivalent of the Google spreadsheet.
But notice just from the links alone-- not that this is a very good design--
the overwhelming amount of functionality that I now get.
I have the ability apparently already to browse the data, search the data,
insert, export the data, empty the data, and so forth
that we just get from a database.
And we'll see how to do that in just a moment.
If I go ahead and click on Album, I'm gonna
see the same data as before within this graphical tool,
but again I just have some handy functionality.
I have the ability to edit each of these albums, delete them.
But I can also see the artist ID.
Now, this isn't all that useful, but let me go ahead and do this.
Let me go ahead and go to the SQL tab.
And this is where this tool is actually handy.
phpLiteAdmin is not something you put on your resume.
This is just like a tool we will use to help visualize what's inside
of our database, and to try out queries before we actually
write them into code.
I recall from before that one of the operations that SQL supports is Select.
So let me try that, select star from album.
And these are the key words from SQL.
This is the name of my table.
I could quote it, but it's not strictly necessary to,
so I'm not gonna bother here.
And if I go ahead now and zoom out-- whoops--
and click Go, this is going to select for me
all of the albums in the album table.
Now, this is not all that interesting, because just a moment ago I
just clicked on the word album and I showed you all of the albums.
But suppose I don't care about the album ID.
I don't care about the artist ID at the moment.
Well, let me go to SQL tab again and do select star from album.
But instead of star, which was the wildcard, what might
I want to select instead?
Just the name of the album?
So let me do that, select--
oh sorry, it was called title, title from album.
Click Go.
And now I get back a set of rows, a result set, as it's called,
of just the titles.
Now again, I'm not programming, I'm not using this data.
But this is the language, the way I can express myself in the SQL language
to get back only the data I care about.
And you know what?
Let me go ahead and do this now.
And this is where it gets more powerful.
If I go back to selecting all the data, there's the artist ID.
How do I get the artist ID?
Well let me go to Artist, and there's that same data, artist ID
and all of the names.
But here's where SQL gets powerful.
I can now take one table, one of whose columns
is these numbers, take another table, one of whose columns
is these numbers, as well, and I can kind of join them cleverly
by lining up the numbers that are identical in both.
The syntax is a little funky at first, but let's try this.
Select star from album join artist.
I want to join the two.
But now I have to tell the database what two columns to join.
It just looks like numbers, but I have to tell the database which
columns should line up.
So artist on album dot artist ID equals artist dot artist ID.
So this is just a predicate or a condition at the end,
but you can kind of read it intuitively.
Select everything from the result of joining album and artist,
provided you join them on album dot artist
ID, which is the artist ID column in the album table on the artist ID
column in the artist table.
So that's kind of doing this digital equivalent.
And so now if I go ahead and click Go, amazing.
Like, now I get back all of the data reconstructed
in a way that's useful, because now every row contains the album
ID and the artist ID, but more importantly the title
of the song and the name of the artist.
And I now have the ability to query for the data I want without inefficiently
storing all of this together.
And I can actually store a join on yet more tables in here, too.
You'll notice that playlist, for instance, is one of the tables in here,
because this digital music store supports the ability to make playlists.
And here we have playlist IDs which map to playlist names.
But how do you associate songs with a playlist?
Well, if a playlist just has a name and an ID,
you need some way of associating songs with a playlist by way of these IDs.
So we actually need kind of a many to many relationship, so to speak.
And that's what a database person would say.
If you want multiple people, playlists, to all
be able to have different songs but maybe overlapping songs,
so that you can't dedicate one song to one person,
you want a many to many relationship.
And so if we look at Playlist Track, notice this column is crazy.
This column-- or rather this table--
has just playlist ID, track ID, whereby track ID maps
to one of the rows in the track table, playlist ID maps to one of the rows
in the playlist table.
And by using this table and joining with multiple tables,
we can map playlist, which is just names of playlists and IDs,
to tracks, which is names of songs on certain albums.
And with the right join commands, I can actually
reconstruct all of the data I care about, and then in one breath
get all of the data back and let the database figure
out how to find all of those people and all of those songs
and all of those playlists efficiently.
A CSV, you would have to read every darn file top to bottom
and then figure it all out yourself.
But there are some key design decisions to be made.
I mentioned earlier not just these data types that SQL supports,
but also the constraints pictured here.
So these are a few new terms that we now need as ingredients
to designing a good database.
And we've seen one of these before, Primary Key.
It turns out that, anytime in one of these tables
we have defined for ourselves an arbitrary but a unique number,
generally in the left-hand column, just by convention, called ID or called
Artist ID or Album ID, whatever you call it,
that we've declared typically as a primary key.
This is a way of telling the database, when you design the database,
give me a table with an ID column that is meant to be the primary key.
That is, database, I promise to you that this column will always be unique.
Therefore, database, use this in the future.
Anytime you want to do something efficient,
rely on that column as uniquely identifying rows.
Don't use albums or artist names and titles,
which can be longer and non-unique.
Instead, use that, as well.
But there's other fields that you might want to flag as being unique,
but not your primary key, the number that uniquely identifies.
Like in the world of customers, which field that we played with a moment ago
might you arguably want to make unique?
And this was our own little database here of customers.
Which are the candidates for unique values here?
In back?
No, here.
AUDIENCE: Phone.
DAVID MALAN: No.
Phone, phone could be unique, certainly in the world of mobile phones.
But some people still have landlines.
That might get a little messy.
So maybe, maybe not.
What else?
AUDIENCE: Names?
DAVID MALAN: Names, maybe.
But if you Google yourself, odds are there's
many people in the world with your same name.
And you don't want to have one of you only buying things from the store.
Yeah.
AUDIENCE: [INAUDIBLE].
DAVID MALAN: What's that?
AUDIENCE: [INAUDIBLE].
DAVID MALAN: Email probably more likely.
Now, we can probably think of some family members
even who are maybe sharing an email address at home,
which complicates even that scenario.
But maybe.
If you do at least want to mandate that email addresses be the same--
I'd probably go with email over name, which
is gonna collide with multiple people--
then maybe we could specify yes, email address for our customers
must be unique, because maybe they're using it to log in.
Right?
If you're using email as a login name, it better
be unique so that there can't be two people with different passwords
but the same username or email address.
But an email address is kind of long.
Malan@harvard.edu, rob@cs.harvard.
Like, that's more bytes than an integer.
So long story short, it might make good sense
to uniquely identify your users at the end of the day
with just an ID, a numeric ID, the primary key.
But one of the features you get from a database is you can tell the database,
hey database, make this other column also unique,
and make sure that I or my colleagues can never accidentally insert data
into the database that is duplicated.
The database will defend against that for you.
You don't need ifs and else ifs in your own code,
you can let the database do that for you.
Index, meanwhile, is really powerful.
You can specify by the keyword index, hey database,
this column is important to me.
I am going to want to do searches on this column.
I am gonna want to look up people by this column.
Therefore, please go ahead and index it by using
your sort of secret sauce using fancy data structures and solid algorithms
to actually find data efficiently.
And here again is where we're standing on the shoulders of others.
Other computer scientists have figured out really good algorithms and data
structures, usually involving trees, which
we did discuss a few weeks back, that allow you to find data generally
in logarithmic time, not linear time.
And so simply by using this keyword index,
we'll see you can tell the database use those years of knowledge
to actually find data for little old me more quickly.
Not null.
You can specify don't let null end up in this column.
I must have a user's name.
I must have their phone number.
It cannot be blank.
Foreign key is even more fancy, but we've seen it already, in fact.
When we looked in just these examples here, rather for the songs database,
notice that in Album we had two types of IDs.
We had album ID in the Album table associated with titles,
and then each of these albums was associated with an artist ID.
And we've got two keywords here to apply.
I claim that this is the primary key for this table
because it just uniquely identifies the albums.
Why is that the primary?
Well, the table itself, the spreadsheet is called Album.
This is the unique ID that's primarily responsible for identifying rows.
That key over there is not primary, because there can only be one,
so it's called a foreign key.
But it's foreign in what sense?
AUDIENCE: [INAUDIBLE].
DAVID MALAN: Say it again?
AUDIENCE: [INAUDIBLE].
DAVID MALAN: To the album.
It's related to the album in that it's the artist,
but it's foreign in what way?
AUDIENCE: [INAUDIBLE].
DAVID MALAN: It's in a different file in a different table.
So it is a primary key in another table, which
is what makes it foreign if we mention it in this table, right?
Strictly speaking, this column doesn't need to be here.
I could just go ahead and forget about all
of the values in this highlighted column.
But once I add it, that allows me to link it primarily to another table,
because in this context it's just called foreign keys.
So that's it for the vocabulary there.
So what does this actually all mean?
Let me go ahead now into phpLiteAdmin, which
is again just our web-based tool for manipulating data.
And let me go ahead actually and do this.
Let me go ahead and close this version of phpLiteAdmin.
Let me go back over to my file browser.
And let me open up a terminal window for just a moment.
And I'm actually gonna go ahead and do this.
If I want to create a new file in Linux, the operating system we're using,
you can actually do this pretty easily by just touching the file,
even if it doesn't exist.
So touch customers.db, and now notice at left customers.db exists.
It's just an empty file.
I can now, wonderfully, double-click that and begin interacting with it
as though it's a database.
There's no zeros and ones in it yet until I start manipulating it,
and that's why it says no tables in database.
It's just an empty file.
So let's create an actual table in this database akin to what I was doing
in Excel, or in Google Spreadsheets.
So here is where in this user interface I
can create a new table on database Customers.
So let me go ahead and call this my Users table, because they-- well,
we're just gonna call it--
yeah, Users is good.
And number of fields, let's just do four this time.
And now I'm gonna go ahead and click Go.
And this is again what you get from phpLiteAdmin.
It's just a nice, user-friendly interface
for, like, answering questions.
So what columns do I want in this table?
Well, the first one is going to be the customer's name,
or maybe it should be first name, so like first underscore name or F Name
or whatever you want to call it.
I'll just call it First.
What should the data type be?
AUDIENCE: Text.
DAVID MALAN: Yeah, text.
And then last name should be text.
And actually, I'm getting ahead of myself.
I don't want to uniquely identify users by their names.
What did we decide?
AUDIENCE: ID.
DAVID MALAN: Yeah, ID.
So just to be kind of tidy, I'm gonna put it first.
It doesn't strictly matter, but it's just kind of the norm.
So I could call this ID, as would be a norm.
You could call it user ID, which might be another norm,
so long as you're consistent is what's important.
This should be not text but integer.
And notice these other features.
I can tell the database this is my primary key.
And now no other field should be primary.
I can tell the database, you know what, you deal with auto incrementing this.
I don't want to keep track of who is what number.
The database can assign an auto-incremented ID,
so one becomes two becomes three over time.
Not null is a given, because if it's a primary key it can't be blank.
And then this is irrelevant here, but you can give default values.
So for instance, if, for whatever reason,
people want to remain anonymous in their name, we could say,
well, the default name for everyone in my database
will be Bob, or something like that.
But that is kind of silly, as well.
We probably want to say that there is no default value, but rather first name
and last name must not be null so that the users must cooperate and provide us
with a value.
As for the last one here, I don't know, we'll just use email for now.
We'll make that text.
And it turns out you can't specify in the database
that an email address must have something at something.
We're gonna have to do that if we ever want in code.
But I can at least make it not null.
Now notice what happens.
Even though this is a GUI, graphical user interface,
when I click Create now, I'm told this-- table Users has been created.
But what's nice about phpLiteAdmin, which is largely why we use it,
is it's a good teaching tool, too.
This is what phpLiteAdmin just executed in order to create this table.
So it's a way of kind of learning SQL as you go by just click, click,
clicking in a user-friendly way, and then seeing immediately feedback
as to what command you could have just typed manually
in order to create that same table.
And it's wrapping because it's a little long, and it's not nicely indented.
But here, we have something very reminiscent
of what I put on the board earlier.
We have an ID field that's an integer, that's a primary key.
And by the way, it should also be auto-incremented.
That's a new feature.
But it cannot be null.
Over here, first name is text, must not be null.
Last is-- should not be an integer.
That's a screw-up.
So I should have made that text, so that's flawed at the moment.
And then email all the way over here is text, not null.
So what does this actually mean?
Well, let me show you one other way before we use this in actual code
as to how else I could be interacting with this database.
If I open up a terminal and I go ahead and type this now, SQLite
3, which is the third version of the program,
I'm gonna make this my alternative database and hit Enter.
This is not as user-friendly.
Like, I have to know what I type, or it seems I can type Help to get some help.
And these are all of the commands that I can type at the command line.
It's not all that user-friendly.
But I do know this.
Once I know SQL, I can just interact with this with SQL commands.
And I'm gonna go ahead and fix this in my alternative database.
I did not mean to say integer, so let me make that text.
I'm gonna put a semicolon at the end, which is typically needed,
though it's not always needed.
So let me go ahead and hit Enter now.
Nothing seems to happen.
But in SQLite, if I type Schema, schema is like my database design.
.schema means show me my database.
And any command that starts with a dot is SQLite specific.
It is not SQL, it's now part of the language, it's just a local command.
Oh, that's kind of cool.
It just spits back out what it is my tables look like.
I can see this in another way.
Let me go ahead and close this for just a moment.
And notice that, in today's files on the website,
I have lecture.db, which was the music database that we opened up earlier.
Let me look at that.
SQLite 2, lecture.db.
Let me do .schema.
Oh.
That is all of the data inside of this database redisplayed to me as text.
So I can kind of back it up if I want, or do other things.
But let's just skim it for something interesting.
Let me go to the top.
And here we have, nicely formatted, these are the commands
that the people who wrote this database online from whom we downloaded it
essentially made.
Create Table, Album.
The square brackets are optional, but they just are alternatives to quotes.
One of the fields will be called Album ID, Title, Artist ID.
We've seen those things before.
One is an integer, one is text, another is integer.
And then we'd see these features here, which we didn't see earlier.
And I'm gonna wave my hand at them for now.
But there is a way to tell the database that there is indeed a foreign key.
So you can say, in this table, there's a number that refers to another table,
and you can let the database make sure that you can never
add an artist ID that doesn't map to an actual artist.
So you can't just insert artist number one, two, three
if artist number one, two, three doesn't even exist.
So again, another feature you get from the database.
And for PSET 7 we're just gonna scratch the surface of some of these.
But again, this is one of the reasons that you sort of graduate from text
files and you graduate from [INAUDIBLE] and start using [? full-pledged ?]
database software like SQLite or Oracle or other such tools.
So let me pause to see if there's any questions before we dive into Python.
All right, so let's now actually use this in code.
The value of seeing, hopefully, SQLite 3 and in seeing phpLiteAdmin
is just so that you have tools with which you can create your databases
and actually see what's inside of them without having
to write code to do both, which just makes it very hard and annoying to get
started.
So phpLiteAdmin is what you'll use for problem set seven.
And we'll end with a look at that in a bit.
But it's just a way of administering your database.
It's not actually the intellectually juicy part.
For that, we're gonna need some actual Python code.
So for that, let me go ahead and open up my console again.
Let me get some tabs open here.
And let me go into let's say Lecture 0 folder from today.
And this is not online now, but soon will be.
Let me open up application.py and see what this app is.
Super simple.
This is just copy paste pretty much from past examples.
And notice that it has one route, slash, that's gonna render index.html.
And then what is that template?
Well, index.html is one of two files.
My layout is big generic HTML structure of the page.
Index.html looks like this.
It's a big to-do.
So let's actually do something with this program as follows.
But first, let's take a stepping stone.
Let me go into New File, save this as lecture.py.
It's just a local script, nothing to do with web programming just yet.
And let me go ahead and do this.
First, for those in the room who've never seen Python before today,
python lecture.py--
that is a program written in Python.
If you're underwhelmed, that's deliberate at this point.
It's a one-line program.
But now we can do something so much more powerful.
Let me go ahead and essentially do this.
In pseudo code, query database for all albums.
Whoops, albums.
For each album in database, print title of album.
Like, suppose this is now the program we want to write.
How do I do this?
Well, in Python, I'm gonna go ahead and do this.
From CS50, import CS50 SQL library, which
will be introduced again in PSET 7.
And it just makes SQL a little easier to get into.
And for your final projects, you're welcome to drop use of the library
all together.
It's just training wheels.
And I'm gonna do this.
DB for database, but I could call that anything I want,
is going to get a SQL database whose file is of type SQLite colon slash
slash slash just because dot DB.
So long story short, this one line of code just says,
hey Python, open a SQLite database called lecture.db
that's in the same directory.
That's all.
So after that, I have a Python variable called DB.
And this is a variable that allows me to talk to the database in code.
And this is the missing piece.
So far the only way we've interacted with lecture.db thus far--
and lecture.db, again, is the musical database--
is by using phpLiteAdmin, which is just this administrative GUI,
or SQLite 3, which is the command line version of the same.
We haven't actually done anything intellectually interesting by querying
for data in a programmatic way.
I've done it all very manually.
But not now.
Now let me go ahead and do this.
Let me say, go ahead and store in a variable called Rows.
All of the rows in the table that are the result of executing,
select star from album.
So I'm writing Python code, but inside of my Python code I have a string,
and that string has SQL in it.
So what do I now want to implement this pseudo code?
So that's what I've just done.
That's my comment.
Now I want to iterate over each album in the database.
How can I do this?
Well, let me go ahead and do that.
So for each album in database, so for row in rows, old syntax
by now, what do I want to do?
I want to just go ahead and print out row--
what were the columns in my album table?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yeah, it was album ID and?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: I think it was title in the album one.
So I'm gonna do this.
Quote unquote-- I think it was called title.
We shall see if I'm remembering correctly.
And now let me just put my comment over here, as well.
So now we have a pretty short program that just executes a query,
and it turns out-- oops, I indented excessively-- this variable Rows
contains what's called a result set.
It is a list, an array, a list of all of the rows that match that query,
which hopefully is all of them.
Then this is just Python from a week or two ago
for each of the row in that Rows list, print out that row's title.
Now what is this exactly?
You can think of this as a hash table, or a dictionary.
Row is one of the rows.
It has a bunch of columns.
And the syntax in Python for indexing into those columns
is just square bracket notation.
This is a dict, or a dictionary object in Python.
OK.
So let's cross my fingers because I'm not sure
if I named everything correctly here and do Python of lecture.py, enter.
Whew.
It worked.
So outputted here is every title from the database.
So let me go ahead and just pick one.
For instance, a soprano inspired.
And just do another proof of concept.
You know what?
I don't want to get all of them.
I want to go ahead and get where title equals.
And now I will use quotes to distinguish it.
So how many rows should this query now return?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Hopefully just one.
So let's see.
Let me go ahead and rerun it, lecture.py.
And very nice.
And now two things printed, though, which
we didn't see before because there was a lot of output, but it was there.
It turns out that the CS50 library, to help you out, especially with problem
set seven, always prints out this debugging information that you
can technically disable.
It's not part of your program, just kind of sent in addition.
This shows you exactly what query was executed,
which is gonna be helpful because, when you start using web programming,
as we're about to, as well, you can then see in your terminal window
what your web app is actually doing to make sure your
can diagnose any bugs in your code.
But we can be fancier than this.
Remember for problem set six, or soon see in problem set six,
that if you import sys for a Python program, this gives us access to what,
among other things?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yeah, command line arguments.
So what if I do this, where title equals something.
I want that to come from the command line.
So how can I go about doing this?
Well, turns out there's one other piece of syntax
that you need to know from the CS50 library.
The Execute method supports what are called name placeholders, where
you can essentially say this, colon t, or colon title,
or colon whatever you want.
It's just a placeholder.
And I'm just gonna call it t for title.
And what you can then do is this.
You can specify that the value of t shall be,
for instance, sys.argv bracket 1, which will
be the first word that the human typed in at the command line.
And what is Python gonna do?
It's gonna plug this into colon t.
That's all.
It's just a placeholder.
And just as an aside, we've seen this before.
Let me delete this for just a moment.
As of like two minutes ago, the way you might have done this
is like this, with a so-called f string or format string, just plug it in.
This is dangerous, for reasons we'll see in just a moment.
So I'm gonna actually do this instead, the colon approach,
and let the CS50 library actually scrub or sanitize my data so
that the user can't accidentally trick me into, like, deleting all of my data.
So now let me go ahead and do this.
Python of lecture.py-- whoops.
Python of lecture.py.
And now let me go ahead and type in that same title, which was--
what was it, a soprano?
AUDIENCE: Soprano inspired.
DAVID MALAN: A soprano inspired?
Inspired?
And now I'm using quotes.
Why?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yeah, it's one phrase.
I want it all in argv1.
I don't want it partly in argv1 and 2 and 3.
It's just gonna be messy.
Enter.
Whew.
It actually worked.
So now I've written an interactive program
that takes user input at the command line, plugs it into the SQL string,
and actually allows me to get the data that I care about.
Now, why did I do this?
As a quick aside, if you do it this way, which would have been our technique
as of, again, like minutes ago, before seeing this, this technique--
and actually, now I need to quote it inside the SQL string,
the library otherwise does that for me--
this is actually very dangerous.
This is violating a principle that henceforth
you should be very mindful of.
Why might this be dangerous, just plugging in the command line
argument to that string?
AUDIENCE: I could put malicious quotes, or I could put, like, [INAUDIBLE]..
DAVID MALAN: Exactly.
You can plug in sort of code that you didn't intend.
So let me go back down here.
What if a really malicious user didn't just type
in the beautiful name of a song, but did something
like haha semicolon delete from album to try to delete your whole database.
This isn't perfectly right.
This won't actually execute properly.
Haha is not a keyword.
But the semicolon means that's it for this query, the delete from album
is a valid query that we saw before, and this is
what's called a SQL injection attack.
And in fact, this is very common even today,
because one of the features you're getting from the CS50 library
is sanitization or scrubbing of the user input.
And the takeaway for today is that you should never
henceforth trust the user's input ever, even if it's you.
Get into the habit now of distrusting even yourself,
lest you somehow inject data into your database that shouldn't be there.
Let me demonstrate this as follows.
So if you're logging in, for instance, to a Yale University website,
your login form looks like this.
Or at Harvard University, it looks like this, Harvard [? key. ?]
Now I'm gonna g sort of reveal what a password prompt actually
looks like without showing the bullets, which is the norm,
but in the context of the web, if a user were
to type in seemingly dangerous instructions, maybe this could happen.
So suppose that my name is me@exampleprovider.com or whatever,
and my password is this funky single quote or quote unquote one equals one.
Like, this is not a complete thought.
But notice it's miss-- it ends a quote here, and it lacks one here.
So it's kind of like this partial thought,
because if an adversary out there on the internet
is trying to hack into a Harvard or a Yale system or the like,
he or she might think what if someone at Harvard or Yale wasn't so bright
and did not sanitize users' input, and just trusted whatever it is.
Maybe I, the malicious adversary, can try
to finish their thought, so to speak, in a malicious way.
And what does this mean?
Well, suppose that the code that someone at Harvard or Yale had written
looks like this.
Give me a user name variable and a password variable
using syntax that we started seeing last week,
and you'll see more of in problem set seven.
This just says get me whatever the user passed into the web application,
as opposed to from the command line.
And then suppose that the Harvard or Yale staffer actually
executed select star from users where username equals this f string syntax.
What's interesting here is that they are correctly
quoting whatever the string is supposed to be for both username and password.
But remember what the user was about to type in a moment ago.
If I kind of close one quote and then like leave one open,
I can kind of inject arbitrary statements in here.
And so what a SQL injection attack might look like is this.
If I type in exactly those values, me@exampleemailprovider.com,
and that funky password, notice if you look past the syntactic weirdness what
really this is saying.
Where password equals quote unquote, which
is kind of silly, if your password is blank, or if one equals one.
Now, which of those statements is obviously true?
One equals one.
And this is arbitrary.
I could've said foo equals foo or anything.
The point is, I sort of finished one thought with that single quote.
Then, I typed whatever I wanted because I'm
presuming a little cleverly, but also kind of guessing,
because the adversary can't see this code,
I just have a hunch that, if I construct the right query,
I can trick the database into returning any user because one always equals one.
So I'm gonna get back a row which is probably gonna let me into the system.
So what the end result is is very bad, unless you sanitize the user's input.
So this other approach with which I started,
actually using these colon-based conventions, which
is very similar to percent [? s ?] in C, but it's
just now SQL-specific, or SQLite-specific,
this now ensures that when the user types
in even the most malicious-looking data, CS50's library, or other libraries
out there in the world that do the exact same thing,
is gonna make sure that any dangerous characters like apostrophes
are escaped with backslash.
So long story short, you don't even have to know
or care what a SQL injection attack is fundamentally
if you just practice good database sanitization techniques, which
doesn't sound very cool.
But it's very easy to do so by using, in our case, the CS50 library,
but in the more general case, actually using a library like it.
And so this is kind of a fun meme that kind of goes around the internet,
supposedly like some actual person's car.
This was meant to do what, do you think?
AUDIENCE: Hide the license plate.
DAVID MALAN: Well, not just hide the license plate.
This is probably one of those cities where you actually
have cameras overhead that are using optical character recognition to bill
the right person for their license plate, or ticket the right person
for their license plate.
So this person, correctly or incorrectly, was presuming,
hey, maybe, you know, the city of Cambridge
has some buggy software that's not sanitizing its inputs like that.
And before we look at an actual web example,
I would be remiss if I didn't teach a generation of aspiring programmers
this comic.
[LAUGHTER]
So henceforth, if you ever hear about Little Bobby Tables,
this is what CS people are actually referring to.
So let's actually now use this in a web context.
So I'm gonna go ahead and do this.
I'm gonna go back to the IDE, close out the command line program
that I was writing a moment ago.
And let's just fill in this to-do.
Let me go ahead and go back to application.py, which
doesn't do much of anything just yet.
But I'm gonna go ahead here and do the following.
I am going to initialize my database.
So this gives SQL of SQLite colon slash slash lecture.db, so just like before.
And then in my index route, I'm gonna do something like this,
rows gets db.execute, which is the only method in there that you need to know.
Select star from album.
And that's gonna get me all of them.
And now I'm gonna pass to this template all of the rows.
And I'm gonna call it albums, and I'm gonna pass in all the rows.
And this is a way now of handing all of those rows and those titles
into my template.
If I now go to my template where I have this to-do,
I can use some Jinja syntax, which honestly is not
going to sink in until you start practicing it some more,
but I do know from experience it's for album in albums.
Now close brace, and then endfor is gonna
be how we end this in my Jinja template.
And in here, I'm gonna go ahead and output each time
the album's quote unquote title, if I didn't get a syntax error.
And so now, let me go ahead into my terminal window
and do in my source directory in my lecture example
flask run to start the web server.
Open up the URL.
Whoa.
So that's kind of a mess.
But it kind of looks like all of the titles, probably.
Let me clean this up.
And here's where HTML now becomes really useful.
In HTML, what's the tag for, like, an unordered list, [INAUDIBLE] list?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yeah, UL for unordered list.
So let me go ahead and make an unordered list tag there,
but close it after this loop.
Let me go ahead and nicely indent everything.
And now inside of the loop, let me go ahead and output
a list item, recall from a few weeks back.
And now it still looks super cryptic, but what is the logic saying?
Within the body block of my page, go ahead and start an unordered list.
Iterate over each of the albums, and then output
its title inside of a list item.
So if I go back here and reload, what should I see?
A whole list of those album titles.
Moreover, I can do a little bit better.
Let me go ahead back now here and do something like this.
Let me go ahead and say q equals request.args.get q.
And q is, again, what Google uses for its queries,
but I could call this anything I want.
Let me now go ahead and pass in the following.
Where title equals colon q, and then let me pass in q equals q.
This feels a little silly, but just think about what it's doing.
I want to substitute for colon q the actual value of q.
And maybe I can be more explicit here.
If I call this query, what I'm really doing here is passing that in to there.
So now that I've done that, what can I do?
This is not a very fancy web application,
but suppose I want to search only for what was it, the soprano song before,
soprano inspired.
Actually, let me search for something simpler.
How about just worlds.
A la hello world.
Let me go to my URL.
And remember, if you understand HTTP, we can
do question mark q slash question mark q equals worlds, and enter.
And dammit.
Where title equals worlds q-- where title equals worlds.
Why did you not come out?
request.args.
request.
Stand by real quick.
We were doing so well there.
All right, let me just reload.
Oh actually, what a good time--
none.
OK, none is bad.
What am I doing wrong?
Did I screw up?
OK, hang on.
Stand by.
What a good time to introduce eprint for just a moment from earlier.
Now let me go ahead and reload the page to trigger the load.
OK, that's bad.
Dammit.
eprint is not defined from CS50.
Oh.
Oh, so close to--
yes!
OK.
[APPLAUSE]
In fairness, I have no idea what just happened.
But it worked.
So let's now, with those inspiring words,
give us a sense of what it is you're going
to be doing with these ingredients.
We focused a lot today on SQL and on database design,
but the ultimate goal at hand is going to be to use these ingredients
and use this new language to query for real information.
So here is what we call CS50 Finance, AKA problem set seven.
And you'll be given a little bit of distribution code
that essentially gives you a user interface similar to this that just
allows users to log in and log out.
But it doesn't allow users to register.
It doesn't allow users to buy or sell stocks yet, as will be the goal.
CS50 Finance is like building your own e-trade website
where you can buy and sell stocks.
But it's going to integrate thanks to code
you write with Yahoo Finance, which happens to provide essentially a very
lightweight API, application programming interface, whereby if you ask Yahoo
for the current stock price of a certain stock ticker, they will respond
and actually give you that answer so you can integrate it into your website.
So what you'll ultimately be able to do--
and this is simply the staff solution--
is build a website like this one here, such that I can go ahead now
and log in to CS50 Finance as a John Harvard with his password.
I can then see that, by default, all of your customers,
wonderfully, for a special deal, are going to get 10,000 virtual dollars.
I can go ahead and get a quote for a stock, like GOOG is G-O-O-G.
Google as of right now is just over $1,000.
That's great, because I have $10,000.
So I'm gonna go ahead and buy one share of GOOG by typing in its stock ticker
symbol and one.
And now I actually have one share of Alphabet AKA Google plus some remaining
cash.
So all of this and more you will implement.
Happy to stick around for questions.
That is it for today.
And thank you so much for today's parents and families for coming.
[APPLAUSE]