CS50 2019--閱讀7--SQL (CS50 2019 - Lecture 7 - SQL)

字幕列表影片播放

[MUSIC PLAYING]
DAVID MALAN: All right.
So this is CS50, and this day we continue our look at Python,
but also introduce another language.
This one is called SQL, or Structured Query Language,
and it's going to allow us to solve some different problems as well
as some old problems better.
But to do this, let's try to motivate some of the problems initially.
So this, of course, is just an example of a spreadsheet.
Odds are you use these pretty frequently,
whether it's in the form of Google Spreadsheets or Excel
or Apple Numbers and the like.
And this is actually the simplest form of what we might call a database,
just some kind of application that stores lots of data.
And odds are if you've ever taken registrations for a student group
or had people sign up or fill out a form or a survey,
odds are you've used Google Forms to get your data into a Google Spreadsheet.
And so we thought we would begin there today,
a representative problem whereby you might want to collect a lot of data
from users and then do something with it,
but ideally more powerfully than you could do as a human on your own.
So we've put into place a Google Form that quite simply
asks a couple of questions today.
What are your favorite TV shows?
And we ask you for the title of your favorite show and the genre
or genres into which that show falls.
So if you wouldn't mind, go ahead and open up this URL here
on your laptop or phone.
If you don't have one, that's fine.
You can use the person's next to you, if I may.
And take just a moment to go to that URL there and answer those two questions.
What is the title of your favorite TV show, and into what genre or genres
does it fall?
Favorite TV shows.
And when you go to that URL, you'll find yourself at, again, this form here,
asking for your title and for your favorite show.
So by the look of my own screen here we've
got some 62 responses, 74 responses coming in already,
so we can track this live.
So 111 responses so far.
That's a lot of favorite TV shows.
This number's-- keep going up.
169, up to, next, 191.
Keep them coming.
One per person is fine.
It will give us more than enough data.
246.
We'll give it another few seconds, but suffice it to say,
this is kind of a lot of data.
And if the next step in this process were
for me to analyze the data because I'm curious to know
what's the most popular show on campus or if I wanted
to do some other analytical work on this,
it'll be useful to have all of this data,
not in Google Form's proprietary interface,
but in Google Spreadsheet form with rows and columns, a format you're probably
pretty familiar with.
So we'll let this counter keep going up and up and up,
and I'm going to go ahead and click on responses.
And if we really wanted to, we could look through--
The Office is up there, followed by Games of Thrones, Friends, OK.
Perhaps pretty predictable here.
Let me go ahead and click this icon, which
is going to open the Google Spreadsheet version of this.
And you'll see that there's three columns by default.
[LAUGHTER]
Oh, that's funny.
[APPLAUSE]
Thank you.
Thank you.
OK.
OK.
Well-played.
Now-- so you'll see that by default we got three columns.
Timestamp, Google Forms just does automatically
to timestamp the submission of the form.
Title comes from the first question.
Genres comes from the second question.
And if we kept scrolling and scrolling, odds
are we're seeing more and more rows.
Let's propose that we have enough of these rows already,
and how might we go about analyzing the data?
Well, if you're pretty expert with Google Spreadsheets or Excel
or Numbers, maybe you could use some functions that
are built into these programs or you could do some Command-F or Control-F
and search for the favorite show that you're
looking for to try to run some numbers.
But we have programming abilities nowadays.
And if you haven't used this feature already,
if you go to Google Spreadsheet's file menu.
You can go to download, and you can download your data
into a bunch of formats.
And the most useful one I'd propose for now
is going to be this one, Comma-Separated Values or CSV.
And we've seen this before.
In fact, you've generated this before.
So if I go ahead and do that, it pretty quickly downloaded a CSV file
that I really wanted to I could open in Excel or Apple Numbers or the like.
But let's do something instead with that, and go ahead and put it into,
for instance, my IDE so that I can actually do something with it.
I'm going to grab my downloaded file here.
And I'm going to go ahead and drag that into the IDE.
And so now you'll see a whole bunch of files here.
Let me put-- whoops.
Let me put this right in the root of my folder.
And we'll see this file here.
And if I double-click on it, you'll see exactly the same data,
albeit in CSV format.
So it looks the same, but we have commas separating the values.
And Google Spreadsheets did that for us automatically.
So let's now write a program that analyzes this data
and see if we can't make sense of what some of the most popular values
are in there.
Let me go ahead and close this, open a new file,
and I'm going to go ahead and call it, for instance, favorites.py,
since the goal is to figure out everyone's favorites here.
And let me go ahead and use some syntax from last week
that you might be familiar with now from P set 6.
Let me go ahead and import CSV, the module,
so that I have access to that library of functionality.
Then let me do this with Open.
And then I'm going to go ahead and open this file, and I could rename it,
but you know, I can just grab the whole file name this way, too.
It's a little excessive, but I'll go ahead and paste that in in read mode.
And that gives me a file.
Let me go ahead and zoom out just a little bit there.
And once I have this file, I can throw it in to a dict reader,
for instance, to make it easier to read the CSV.
So reader gets CSV.DictReader.
Pass in that file.
And then after this, I have the ability, now,
to iterate over each row like this, much like you probably did for DNA
or one or more other problems, as well.
Now, just as a sanity check, let me just go ahead and print out
each of the row's titles.
So I'm not going to see anything new here,
but this will be a little sanity check that I'm writing correct code.
So run Python of favorites.py, and voila, fwoom, there's all the data.
So The Office is both at the beginning and at the end here, it seems.
All right, so that's all fine and good, a nice stepping stone, but let's start
to analyze this data, this time using some code.
So let me go ahead and say this.
Suppose I wanted to keep track and determine
what the most popular show is.
So I've got this big CSV file with title, title, title, title.
Invariably, there's going to be some duplicates.
A bunch of you probably like The Office or some other show,
and so it'd be nice to kind of aggregate that data.
What might be a useful data structure in Python whereby for each row
we can use this data structure to just kind of do
the equivalent of like 1, 2, 3, and count each
of the occurrences of those titles?
Any-- OK, I heard a hash table, but more properly in Python?
A dictionary.
So a dictionary is kind of the Swiss army
knife of data structures for a lot of people in the sense
that insofar as it allows you to store keys and values,
and those keys can be strings, and those values can be anything you want,
including, for instance, in this case, numbers, like a counter,
it's a really versatile data structure when you just
want to throw some data into a data structure and do something with it.
So let me go ahead and give myself a dictionary at the top of the file.
I can do this in a couple of ways, and if my goal is to store the counts
of each of these tiles, I could call it counts,
and I could set it equal to dict(), which is just a function that hands me
an empty dictionary.
Or I recall I can use this syntax, which is
equivalent to just shorthand notation.
But what I want to do now inside of my loop, every time I
iterate through this file, I want to check if my--
I've seen this title before, I want to add 1 to my counter.
But if I've never seen this title before, what do I want to do instead?
Yeah?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Add the title to the dictionary
as a key with the value of what, the first time I see it?
Well, the first time I see it.
AUDIENCE: 1.
DAVID MALAN: So we can just initialize it to 1.
So there's two scenarios here.
Either we've seen this title before and we want to do plus 1,
or we've not seen the title before and we just want to set it equal to 1.
So how might we express that logic?
Well, let me go ahead and say, first of all,
let's get the title into a variable of its own,
just to shorten our syntax a little bit.
And then I can say something like this, if title in counts--
well, what does that mean?
That means that inside of my counts dictionary,
I can index into it at that location, and I can increment whatever is there
by 1, assuming on some previous iteration,
else I went ahead and did counts of title equals 1.
So these are the two scenarios.
If I've seen this title before, go ahead and add 1
to its value in the dictionary.
Otherwise, just default to I have seen it one time.
So some of the shows are going to stay stuck at one if they're not
very popular among the audience, or they're
going to keep getting plus 1, plus 1, plus 1.
So what I often do when writing code like this
is, as I'm sort of finding my way, after I'm done with this,
I'm not going to write the whole thing.
Let me go ahead and just, at the end of this, print counts, right?
I'm just curious to see what it is.
I'm just going to go ahead and add print counts.
Let me go ahead and run now Python of favorites.
And voila, you see a huge, messy data structure.
But let's go to the very top of it.
Notice that it starts with curly braces, indicating it's, indeed, a dictionary.
It starts with a quoted value, which is one of the titles in it,
and it looks like two of you said Dynasty is your favorite show.
25 of you said The Office is your favorite show.
One of you said Blindspot is your favorite show.
One of you, 24.
Friends, 19, and so forth.
So the data is kind of all over the place,
because it's based on the order in which you all input it.
So let's just clean this up a little bit.
And let me go ahead and print this out as follows.
Instead of just printing a big, messy dictionary,
why don't I go ahead and iterate over the dictionary?
And we didn't see this necessarily last week,
but you can actually do something like this.
If your dictionary's storing titles as keys and counts as values,
we can do something fancy like for title comma count in counts,
but if you want to iterate over not just the keys but all of the items therein,
there's this other function in Python associated
with a dictionary that will hand you back two things at once,
key and value together.
And you can call them anything you want.
I'm calling them title comma count, respectively.
Now, let me go ahead and print out, for instance, something like the title,
and then let me go ahead and just print out the count right next to it.
So now, if I save this and run favorites.py, OK, it's prettier,
but it's still a little messy in that we have--
the numbers are just right there after the name.
So let's go ahead and tidy this up.
Recall that when you print two things with print
separating them with a comma, what character
did you get automatically in between them?
Yeah, just a single space.
We saw that at the very beginning of last week.
But we can override that separator, otherwise known as sep,
which has a default value of quote-unquote with a space.
But suppose I just want to give some visual delineation here,
maybe like a vertical bar to kind of sort of make it look like a spreadsheet
again.
Let me go ahead and run this.
And now it's not super, super pretty, but you
can see title followed by a vertical bar followed by the number 1.
So this is all fine and good, but it'd be nice, for instance,
if we could sort these, because I don't really
want to sift through this visually, looking for all of the possible values.
So you know, the easiest way to sort these things might be like this.
Instead of passing in counts of items, there's
another function in Python where you can literally just sort something
with sorted.
You don't have to worry about bubble sort, selection sort, insertion sort,
merge sort, anything.
Just, you sort it and Python.
We'll use one of the fastest implementations
of sorting available to it.
Now, if I rerun this, we'll see that--
[CHUCKLES] OK.
We'll see that we have all of these values
here starting with 24, 9009, Adventure Time.
But that's not what I meant.
I wanted to sort not by title, but by what?
Count.
So this gets a little less obvious.
And it turns out Python does this in a couple of possible ways.
But let me go ahead and introduce this one first.
It turns out that this sorted function takes an optional argument that
happens to be called key.
And key simply specifies, how do you want to sort
the dictionaries that you're being--
or how do you want to sort the key value pairs that are being passed?
By default, it literally uses the key, so title.
But if you want to use the value, you know what you can do, is this.
Let me go ahead, and up here, even though this is a little weird,
and let me go ahead and say the following.
Def function item.
So def f.
I'm just giving myself a pretty generic name
for a function, because I'm going to get rid of it in a moment.
I'm going to go ahead and return item bracket 1.
Now, why is this?
Take a look at this line here.
And even though sorted is new today, as this counts.items(),
all items does is return key comma value, and sorted sorts then by key,
by default, handing them back as title comma count, or again,
I could call these anything I want.
But I'm calling them, more semantically usefully, title comma counts.
So that's giving me back to things.
Well, just like an array in C or a list in Python,
if you're getting two things at once, the first of those things,
conventionally, is going to be accessible in bracket zero.
The second thing is going to be accessible in bracket 1.
So this super, super simple function called f takes, as input, an item.
And if that item has two values, a key and a value,
return item one is just literally going to return
the second thing, whatever it is.
In this case, it's going to return, of course, the value.
Now, why is this useful?
Well, because the sorted function takes an additional argument called key,
you can actually pass it the name of a function
to use in order to determine how to sort the previous items.
So this is an example of a really powerful feature
of Python and other higher-level languages, in which case
you can pass in, as arguments, the names of functions.
I'm not calling f.
I'm just passing in its name.
If I were calling it, I'd have open paren and a closed paren,
but I'm just passing it in by name.
And now, this is going to, by default, sort things from small to large.
Turns out if you read the documentation, there's
another argument to sorted called reverse that will invert it so that now
you'll get largest to smallest.
So now let's try this.
Let me go ahead and rerun Python of favorites.py
enter, and now, if I scroll up to the top,
we should see The Office is indeed the most popular show, apparently
consistent with Netflix's data these days, too,
followed by Friends, Game of Thrones, Breaking Bad, Rick and Morty, Black
Mirror, and so forth.
And the least popular show among you, although there looks like there's
a lot of ties, Demon Slayer.
[CHUCKLES] Or Game of Thrones, but GOT.
So interestingly, there are some--
[LAUGHTER]
There's-- OK.
OK.
So here, we've gone and sorted these values.
Now, let me go ahead and clean this up a little bit.
Much like in C where it's kind of silly to define a variable
and then use it only in one place, similarly in Python,
if you're going to define a function, but then only use it in one place,
you don't actually need to.
There's a fancy one-line equivalent wear you can actually say this.
And you won't see this too often, but it's a handy thing just
to have seen at least once.
You can define a function called a lambda function that takes in as input
any argument like this, and then you can literally just return that value.
So if I go ahead and get rid of this function f and replace it with this,
the syntax is super weird-looking at first glance,
but lambda is the term of art in computer
science for an anonymous function.
I don't care that it's called f.
I'm never going to use it again.
So lambda just means give me a function.
Item means that is the input to this function, the argument, and the return
value, or output, is going to be item bracket 1.
So it's a little arcane.
Odds are you'll have to double check the documentation for using this oneself
in the future, but it's the same thing.
And indeed, if I rerun the program, I'm going
to get back the exact same results, this time with the items still sorted
like that.
All right, any questions on the code here, the features of the syntax?
Anything at all?
Yeah.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yes, if you're writing a lambda function like this,
you are indeed limited to one line.
That was OK, because f, even though it was two lines, only
had one line of actual content.
But yes, it has to be one line, indeed.
All right.
So the GOT thing opens up some interesting messiness, right?
Because GOT presumably denoted Game of Thrones, and yet most of you
wrote Game of Thrones, capitalized properly,
and yet, it'd be nice if we could have some equivalences.
And honestly, I'm guessing there's a bunch of capitalization differences,
too.
So you know what I might be inclined to do?
What might be the easiest way for us to standardize
what we're calling these shows' titles that would be a little more
resilient against some of you typing with capitals,
some of you typing with lower case?
Let's ignore the GOT problem, though.
Yeah, so make them all lowercase, or heck, make them all upper case.
It doesn't matter so long as you're consistent.
So you know what?
Before I get title from the row, let me go ahead
and force it all to lower case, and then let me go ahead and--
in my-- go ahead and force it to lower case there.
So no matter what the input is, I'm going
to go ahead and force it on entry to lower case.
Now, if I do this and scroll back up, looks like our counts went up a little,
right?
The Office is now at 27.
Game of Thrones is now at 21.
Friends is at 20.
So it looks like some of you, kind of, reasonably if sloppily,
didn't capitalize it exactly as the show officially is called.
But that should be OK, right?
This is all about UX or user experience.
This should be one stepping stone toward making your program much, much more
resilient against user input.
But you know what?
What if there were a better way to do this, right?
Instead of writing code, this feels pretty tedious.
Could we not just use a different tool, different language,
and achieve these same goals?
Today's answer is indeed going to be yes, with SQL.
But first a question.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: No?
No question.
All right, so then the segue is exactly this.
I'm going to go ahead now and close favorites.py.
And I'm just going to maximize the size of my terminal window here.
And I'm going to go ahead and run a program called SQLite3.
SQLite3 is a command line program that you can install on your Mac or PC--
it's already on CS50 IDE--
that allows you to use another language called SQL interactively.
And it's got this really cool feature whereby
if I have this file called, very verbosely, CS520 2019 Lecture 7,
and so forth, I'm going to copy that after highlighting it.
I'm going to run SQLite3, and I'm going to put SQLite3 into .mode CSV.
It's a weird syntax, .mode CSV, but you don't have to use these commands very
often.
Then I'm going to use one other command here, which is called import.
And I'm going to go ahead and import this into a database.
But what database, exactly?
Well, let me go ahead and more specifically-- let me exit out of this
here.
Whoops.
Let me exit out of this here and run SQLite3 in the name of a database.
I'm dealing with favorite shows, so maybe I'll
call this favorites.db, DB denoting database.
I'm going to now go ahead and do .mode CSV.
I'm going to now do .import, quoting the name of the file,
and quotes are important, because it's got some ugly spaces in it,
and I'm going to import this into a table, kind of like a spreadsheet,
that I'm just going to call in all lowercase favorites.
And hit Enter.
Nothing seems to happen now.
But if I go back to my terminal window and type ls,
notice that in addition to my CSV file and favorites.py and src7,
which I downloaded in advance with today's code,
I now have this file called favorites.db.
And if I rerun SQLite, it turns out when I pass in the name of that database,
now I can ask the same kinds of questions as before,
but I can do it with a different syntax.
I can go ahead and select all the titles from my favorites
by typing this command, SELECT title FROM favorites; Enter, and voila,
we get all of the titles there.
Suppose I want to sort those titles as before,
or I can SELECT title FROM favorites, ORDER BY title; ending with--
oh, sorry.
Semicolons are back.
Then I'm going to go ahead and hit here.
And you'll see that now it's alphabetized,
and there's all those Offices by first letter.
Now, there are some bugs.
If you scroll down, you'll see lowercase letters as well,
so we'd have to deal with that situation, too.
But this seems to be a user-friendly way of just selecting data
that I care about.
And watch this.
Let me try to group things together.
I'm going to go ahead and select title, and I'm
going to go ahead and say count the number of titles
from this table called favorites.
But I want to group those things by title.
Now, we'll see in a moment what this means, but the effect is what's cool.
Voila, I now have output much like my Python program.
It's not really as orderly as I would like, but notice
what it's done at the very top.
The very first line of output, if I keep scrolling is this here.
On the left is my title, then a vertical bar.
On the right is literally count of title, how many titles match that.
So let me go ahead and do this.
Let me go ahead and first of all limit this to the top 10 titles,
so I can just deal with this and see all this at once.
So here we have some alphabetized list of titles.
But now, suppose I want to go ahead and rename this thing.
Let me go ahead and say that.
Let me go ahead and say, as--
COUNT AS n.
Now notice that the title of this output is title, and then n.
But the reason I did that was so that I can do this.
I'm going to now do ORDER BY n, descending order,
and then limit to the top 10.
Enter, and voila.
Now, using a completely different language,
I think I've gotten the same results, except for the lower casing issue,
which we can come back to another time.
But The Office is here as the number one hit.
Friends then, 19.
Game of Thrones, 18, and so forth.
So what's then the motivation in part for this new language called SQL?
And you've seen just snippets of it here.
It allows us to solve the same kind of problems
much more easily, because I don't have to sit down and write
a whole darn Python program, faster though it is then
C. I don't have to write custom code just to answer questions about my data.
And in fact, if you pursue the web track in a couple
of weeks' time or the mobile track, both of those
are going to offer support for SQL so that if you want to store data
on users or interactive information on an app or a website application,
SQL is going to be ultimately where you can put that data and access it later,
thereby writing even less code than today.
So let's go ahead and take a look at some of the features of these things.
What we've really just done is introduce this,
something called a relational database.
The goal at hand is not to rely on CSV files anymore.
CSVs are like the weakest form of databases you can use.
Why?
Well, every time you want to look in a CSV file for data, you have to open it.
You have to read it in.
You have to iterate over it, line by line by line,
looking for the data you care about, and then you're done.
That's going to be super slow when you have a lot of data in a CSV file.
And in fact, in a little bit today, we'll
download a really big database in text form
that we're going to then see 500-plus megabytes later is much better handled
by what are called relational databases, pieces
of software you can run on your Mac or PC or servers in a company.
All they do is store data and provide you with faster access to it.
So SQLite3 is just a command line program
via which we can interact with that data.
It's not terribly compelling.
It's, again, a black and white blinking prompt,
but we'll see a graphical user interface, or GUI, in a bit,
too, that'll allow us to navigate that same data.
Here's how we went ahead and imported the data.
And .schema is the last of the commands.
In fact, let me go ahead and type this at the prompt.
Not .mode or .import but .schema.
And you'll see what automatically happened for me is something
like this, CREATE TABLE favorites.
This all happened automatically when I imported my data, but in a bit,
we'll be doing this more manually ourselves.
So where does this then bring us in terms of features?
Well, in the world of storing data, whether it's
storing users who've registered for your websites
or the number of likes a post on Instagram has gotten
or any form of data that you might want to read or write,
there are four fundamental operations in the world.
CRUD is the sort of crude way of remembering this.
These stand for Create, Read, Update, Delete.
And I dare say that with these four operations
you can do anything you want when it comes with data.
Creating that data, reading that data, updating it, or deleting it.
Now, in the world of SQL, or Structured Query Language, or S-Q-L,
this is just another programming language.
It tends to be used only in the context of databases,
but you can use it to solve problems not unlike Python, but in the context
of the data you care about.
So with SQL, these four CRUD operations actually
have different keywords mapped to them.
They happen to be these, INSERT, SELECT, UPDATE, DELETE.
That is to say, unlike Python, which has hundreds of functions available to you,
today is kind of neat in so far AS SQL has maybe a dozen or so total functions
that you can use, a few dozen functions total that you can use.
Much more manageable, and these are the four that we'll use most frequently.
So beyond this, how do you go about creating a table?
Well, what does that mean?
Well, relational database is literally that.
It's like a fancier version of Google Spreadsheets or Microsoft Excel
or Apple Numbers that allows you to store all of your data
in rows and columns.
But it gives you these four functions and more
via which to select data you care about, or delete or update or insert data
that you care about as well.
So we'll see in a bit that this is going to be the syntax VIA which in SQL you
can create a new table.
In Google Spreadsheets you would literally go to the plus icon
and say New Sheet.
In SQL, you would express this more programmatically with a line of code
like this.
But you're going to have to make a few decisions along the way,
and that's because SQL has its own data types.
Now, fortunately, they're pretty reminiscent of C and Python.
But there's five main ones here.
Let's just walk through a few examples thereof.
So first, INTEGER.
If you know in advance that you want to store a whole bunch of information
in a database like someone's age or the number of likes a post has
or anything that lends itself to an integer,
SQL supports an integer data type.
But in some versions of SQL, you actually have a decision to make.
Do you want an integer, a smallint, or a bigint?
And these speak to, as in C, the size, the number
of bits that are used to store it.
An integer is 32 bits.
A bigint is 64 bits.
A smallint is fewer than both those.
Now, we're using, for class, something called SQLite.
It's a free and open source, very user friendly version of SQL.
But there's others you might have heard about.
So if you've ever heard of the company Oracle,
their bread and butter is a database program
called Oracle that is a SQL database that
allows companies to store huge amounts of data
and select and create and update and delete data
using software they've written.
MySQL, Postgres, MariaDB, Microsoft Access Server,
SQL Server, and bunches of others, are all examples of SQL servers.
And they'll each support data types, typically, like this.
We happen to be showing you ones from Postgres,
which is a super popular open source one that you can use on the internet,
for instance for final projects.
Real numbers, just like in Python or C, are similar in spirit to floats.
However, you have a couple of options.
One is a real, which uses 32 bits.
Another is called double precision.
It's actually two words, but it gives you 64 bits as well,
to give you more precision numerically.
Then you've got this catch-all known as numeric in SQL,
whereby if you have a number that's number-like, essentially,
but isn't just an integer, categories--
this category encompasses things like Boolean values, dates,
which have a predefined format like YYYY,
four digits typically implements a year.
That would be considered numeric here, as would be a date and a date time,
which has something like 00:00:00 for hours, minutes, seconds.
And then you have other values as well.
What's nice about this numeric type specifically
is you can actually solve the floating point problem in both Python
and C. Recall that floats or even doubles in C
ultimately have finite precision, which is bad in terms of values
not quite adding up as you intend.
But with numeric data types and databases,
you can specify exactly how many digits you want before the decimal point
and after, maximally.
So if you're doing with financial information or scientific information,
you can be super, super precise thanks to this data type in the database.
So finally, a problem that we've solved, versus those other languages.
Text.
When it comes to storing data in a database,
you can specify that your columns can either
be characters, which isn't an individual characters or chars.
You literally specify n, the number of characters
that every cell in that column will be.
You never do this in Excel.
You never do this in Google Spreadsheets or Apple Numbers.
In a database, though, you can tell the database ever more precisely,
I want to store two characters in every cell in some column.
Why?
Well, maybe like US state codes.
If you have MA for Massachusetts or CA for California,
it might be nice to tell the database that you can store, minimally
and maximally, two characters for every value I'm going to give you.
Varchar is a little different.
This is a variable number of characters.
If your column might have a few characters or a lot of characters based
on what some human types in, you can give an upper bound
n on how many characters will be stored in every cell.
So if you don't know in advance how long a user's email address is going to be
or how long my name or your name is going to be in advance,
you can specify varchar and then pick some upper bound, some value that, ugh,
no one's going to have more than 20 characters in their name or 200
characters in their name, whatever line you decide to draw in the sand.
As an aside, does anyone want to conjecture
what a good upper bound is for the number of characters
in a human name that might register for a website you're making?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: 25?
Pretty good.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: 30, I heard.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: 32, OK.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: 45?
Higher.
60?
AUDIENCE: 64.
DAVID MALAN: 64?
This is the kind of thing that Google, I dare say, is good at.
So let me go ahead and open up Google real fast.
Suppose the goal at hand is the store, in a database
table, that is the equivalent of a spreadsheets column, the longest
name in world.
All right.
Looks like we had 988 characters in a fellow's name, Hubert.
And I won't even try to pronounce his last name, but there it is.
So if you want to fill--
if you to be able to fit everyone's name--
oh, and note, senior.
[LAUGHTER]
You might need even more characters than that.
So this is a rabbit hole we won't really go down today,
but suffice it to say that the answers to these questions
are often not obvious.
And if you think about websites you've visited in the real world,
whether on campus or off, odds are, you have tried typing in some input
to some web-based form or even some mobile application,
where all of a sudden your keyboard stops working, right?
They won't tolerate any more characters.
Maybe it's the Common Application and the length of your essay.
Maybe it's the length that your resume can be in a field.
Maybe it's the length of your name when registering for a website.
Well, why is that?
Those applications are probably using a database, probably using SQL,
and they had to decide, in advance, what is the maximum length of an input
that we're going to tolerate from a human?
And then the computer ultimately enforces that.
Why?
Well that would be a very clever attack, for some bad actor out there
to exploit.
If you had no bounds on how big the input could
be that a human could type in, they could-- much like I
need to try the other day with Emma's name--
just keep pasting pasting, pasting, pasting a massive input,
hoping to overwhelm your computer's memory, maybe make your server crash,
and therefore deny service to other people.
So there are reasons for these kinds of defenses.
Lastly is BLOB.
At the very top, Binary Large Object.
Don't need it typically that often, but if you
want to store binary data, raw zeros and ones that represent files or the like,
you can store it in a BLOB format as well.
Now, just to give you a whirlwind tour of other syntax
before we start using it, you can insert data
into a database using syntax quite like this.
You can select data from a database using syntax like I already did.
You can use functions as well.
In fact, in just a bit, when we load more data into the database,
I could actually select all of the distinct names from a database--
and in fact, I could have done this a moment ago.
Rather than SELECT title FROM favorites, as I
did earlier-- that's going to give me all of the titles in that database.
I could instead do something like SELECT DISTINCT title FROM favorites,
and that's going to filter out all of the duplicates
without me having to write a single line of code to do that.
You can also count, as we did before.
You can compute averages, mins, or max, which
is really useful for analytical work that you might want
to do as a data scientist or the like.
And there's a whole bunch of other functions available to as well.
But there's some more powerful features.
And we'll begin to use these in just a little bit.
We used WHERE before.
I used LIMIT before.
I used GROUP BY before.
And we'll see a bunch of examples that reinforce all of these.
But there's keywords like LIKE, whereby if you
want to search for everyone's input who mentions The Office, you know what?
I bet we could do something like this instead.
I know that I can SELECT star for-- oh, let's go ahead
and SELECT title FROM favorites WHERE, and this is kind of nonsensical,
but title equals "The Office," quote unquote.
And I get back everyone who typed in literally The Office.
However, if I instead did something like this, "the office" in lower case,
I'm going to get one of you who typed in "the office" in lowercase.
And the thing here is just the title of this output.
But what if I want to do anything like The Office?
I could say something like LIKE "%office%", and it turns out in SQL,
this new language, percent is a placeholder.
It's a wild card that just says, eh, anything can go before,
and, eh, anything can go after.
So now I'm searching for any titles that have the word The Office.
We picked up another entry here.
So one of you said Office, without the word "the," also pretty reasonable.
Some of the uppercase and lowercase is all over the place.
Now we've caught those.
And if you want to count the total number of offices now,
now I can do something like this.
COUNT title-- so I can combine all of these building
blocks, not unlike Scratch, and get back the total number
of offices, which looks now to be 33 when we tolerate a lot more
variability in our users' input.
Most powerfully of all, we'll see something
like JOIN in just a little bit, and that'll
be all toward the end of using databases better with better design.
But with SELECTs, we, of course, don't just want to select all of the titles.
We might select where some condition is true.
So conditions, just like in Python and C,
exist in SQL, but at the end of our queries, as we've seen.
If you want to update something, for instance,
you can update data in a database by saying
UPDATE table name SET the column name equal to some value,
WHERE condition is true.
So how might I use this?
Well, let's again look at the data where it's like The Office.
And you know what?
I'm going to go ahead and do this.
If I want to clean all of this data up-- so data cleaning is
a very common approach, whether you're doing
research or analytical work, often when you have messy data,
the first pass is honestly grunt work.
Like, writing some code, or god forbid, manually copying
and pasting in an Excel file, just to clean up your data.
You can do this a lot more powerfully in SQL.
So for instance, let me go ahead and say UPDATE my favorites SET
title equal to "The Office," the canonical form, capital
T, capital O, WHERE title LIKE--
and now let me do this wildcard search, knowing
that that's going to slurp in all 30-plus of those rows.
But now, when I hit Enter, nothing seems to happen.
But if I then SELECT title again WHERE LIKE "%office%",
now they all look the same, because I've updated my data.
And so in fact, now I could go back to The Office,
because I've canonicalized my data, and everything now looks the same.
So with SQL, can you do that as well?
Previously, let's go ahead and SELECT a title FROM favorites.
And let's just limit this to 10 of them this time, so they don't all fly by.
Suppose-- VEEP is a good show.
Maybe I'm not a fan of, say--
let's give ourselves 20 here, see what comes up next.
OK, I actually don't like Friends, even though everyone in the world seems to.
No problem.
DELETE FROM favorites--
[LAUGHTER]
--WHERE title = "Friends";.
Now, no more Friends.
And so you can delete your data as well.
[LAUGHTER]
You can delete your data as well as by using a query of this form.
This one is perhaps the most dangerous of all.
Consider what might happen if you're a little sloppy, a little hasty,
and omit the WHERE condition.
What happens if you go ahead and DELETE FROM favorites;?
Everything goes away.
And slightly worse than that, though we won't have occasion to use it,
there's another function called DROP, where if you just
want to delete the table altogether, delete all of your rows and columns,
you can say DROP TABLE favorites;, and that will delete all of the data
as well.
And we'll come back to that, because the mere fact that you can scarily
delete lots of data at once makes you super
vulnerable to bad actors on the internet or in the office who
want to potentially wreak havoc on your data.
So we'll come back today, talking about matters of security, as well.
So any questions before, now, we take things up a notch
and look at a much more massive than our hundreds of favorites here?
Is that a question there?
Yeah.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Did not look like Friends got removed.
Oh, interesting.
You are right.
Let's try a little harder.
LIKE "%friends%".
Better?
Now they're gone.
So I'm guessing-- I would have to look closer at the data,
but I'm guessing one or more people typed it
with a slightly different capitalization,
or maybe they hit a spacebar character or something, where
I wasn't doing a precise match, thereby making
the LIKE operator a more robust fix.
However, let's consider for just a moment.
What might the gotcha here be?
Like, this feels like a nice quick and dirty fix,
but it really is quick and dirty in what sense?
AUDIENCE: Like, if you have another TV show
that has the word friend in it, but not Friends, [INAUDIBLE]..
DAVID MALAN: Yeah, there might be another show
with the word "friends" in it.
Quite reasonable.
And I just blew those away as well from my database.
And in fact, we can probably see this for real.
If I go ahead and look at, for instance--
let me go into a browser here, IMDB.com, super popular website for just getting
movie reviews or information about movies.
Let me go ahead and search for friends.
And yes, the most popular hit seems to be the 1994 version of Friends,
but there's Friends with Benefits.
There's Just Friends, We Are Your Friends, My Best Friend's Wedding,
Fisherman's Friends, and so forth.
Some of those movies, some of those TV shows.
And had we poked a little further among our own data of hundreds of rows, maybe
some of you did have some of those shows as your favorites as well.
So beware that, too.
So in preparing for today, we wanted to play
with an even juicier data set then one we could generate for ourselves.
And wonderfully, IMDB makes a lot of their database
available for download as TSV files, Tab-Separated Values.
It's really no fundamentally different from Comma Separated Values.
One human at IMDB just decided to use tabs to separate
their values instead of commas.
And if you read through this page here, you'll
see that they let you download a bunch of different files, one of which
will focus on, initially, specifically on title.basics.tsv.gz.
So that's a mouthful, but this is just saying
this is basic information about titles in their database.
Gz means it's G-zipped, which is like Zip.
If you're familiar, on a Mac or PC, it's just
another way of compressing information, but you can unzip that on a computer
as well.
Dot TSV means Tab-Separated Values.
Now, we read in advance the information here,
but let me summarize it as follows.
There's some juicy information in that file.
And I'm going to go ahead and download this file over here.
If I click a link on IMDB's website, I'll
see a file called title.basics.tsv.gz.
I'm going to go ahead and let that download in the background.
It looks like it is 100-plus megabytes.
If I click on it on my Mac, it's going to decompress it.
And you'll see here, this apparently is 532 megabytes once it's decompressed.
So that's way bigger than the CSV file I downloaded from Google
a few minutes ago, which is 24 kilobytes, 24,000 bytes.
This is 532 million bytes.
So it's an order of magnitude bigger, which means our design decisions today
are really going to matter.
Now, if we take a look inside of this file,
we'll see a bunch of interesting fields.
And I've essentially summarized this from their own online documentation.
There's these five fields that are of interest right now,
and we'll tease these apart now.
So tconst, I didn't recognize that.
It's an IMDB-specific term, it seems, but it means some kind of title
constant.
It's a unique identifier for every movie in the database.
So for instance, one of the movies in that TSV file,
which I'll open up in a moment, has the unique value tt4786824.
Why?
Just because.
But it turns out we can actually see what this is.
Let me actually grab that value and go to IMDB.com/title/tt4786824.
Enter.
These actually do have significance.
That is the unique identifier for The Crown,
popular hit TV show that you can now watch on Netflix.
So if you go back to the actual field--
that's how I spent the summer.
So if you go back to the fields here, we have not only tconst,
which uniquely identifies the film, we also have title type.
So it turns out in this 500 megabyte file,
there's huge amounts of information on movies, TV series, documentaries,
shorts, way more data than we could possibly play with in one day.
So we'll focus just on TV series, like this, the show The Crown.
Primary title is the name in the file that is
given to the primary title of the show.
Sometimes shows apparently have different titles
in different geographies or communities, so the primary title
is the one that most people know it by.
Start year is the year for a TV show that-- in which that TV show began.
The Crown came out first in 2016.
And genres, a little weirdly, is a comma-separated list
of genres into which that show falls.
And I say it's a little weird, because we do have a TSV file,
but they're using commas inside.
So this is probably among the motivations for having used tabs in--
to separate all of the columns, because they
want to use commas for some other purpose here.
But it's a little messy.
This is actually not necessarily the best design decision here.
So what can we go ahead and do with all of this data here?
So let's go ahead and start to grab some of this data
and write some scripts involving it.
But let me issue a disclaimer.
It turns out when you have a lot of data in a database,
it can take quite a long time to analyze it or parse it.
And so if you ever watch certain baking shows like Julia Child
from down the road here in Cambridge, very often there
would be some TV magic whereby someone mixes the cake mix
and puts it into the oven, and then two minutes later, voila, it comes out.
That's because some of the scripts we'll write today
might actually take many minutes to run, but we'll
go ahead and run them in the backgrounds,
and I'll use the outputs that I've created here in advance.
So how do we go about beginning to navigate this data set?
Well, first of all, let me go ahead and just open this file.
And I'm going to use my own terminal window here.
This file was called titles.basics.tsv.
Notice that even in my program here, it took a moment to open,
but this is what it looks like.
It's a little cryptic at first glance, but notice the first row
is the headers of the file.
So we've got tconst and title type and primary title and so forth.
Then every row thereafter seems to start with one of those unique identifiers,
and they seem to be monotonically increasing, 1, 2, 3, 4,
but they start with tt then some number of digits.
Then you see that these are a bunch of shorts, so short films or the like.
And if we kept scrolling, we'd see some other values.
Here's the first movie in the database.
And if we keep scrolling and scrolling and scrolling, we'll see more.
So if actually I use my text editor here and search for literally the word
The Crown, it's going to take a little while
to get to, because there's a bunch of shows apparently called
The Crown or Crowning or something.
But you know what'll be faster?
Let me go ahead and search for--
what was it?
If I go back to tt4786824, Enter, it takes a moment
to find this because it's searching through hundreds of megabytes,
but voila, down here is this tconst, TV series type.
It's called The Crown.
In another geography it's called also The Crown, 2016, and so forth.
So that's a huge amount of data.
In fact, how much data?
Well, if I actually analyze this with a command on my computer,
there are 6,267,469 shows in the world, be it TV
or movies or documentaries or the like.
That's a huge data set.
So suppose I want to focus really just on a subset of that data.
What can I go about using instead?
Well, let me go ahead and open up, for instance, the same file,
but let me try putting it in CS50 IDE.
I'm going to go ahead and download this, as you might have downloaded
past problem sets, with wget.
That's going to give me a pretty big file.
There's a command now that you've not had occasion
to use in the IDE called gunzip, which will unzip a file that starts with--
that ends with .gz.
That's going to take a moment to unzip the whole file.
A moment, a moment, a moment, OK.
[CHUCKLES] And no space left on disk.
But that's OK, because it turns out what I'm also
going to start doing today is using my own Mac a bit more.
And you'll recall last week, I did that a little bit,
because when I wanted to play with speech recognition in my microphone,
it's a lot easier to have Python running on my Mac or your PC
if I want to use hardware built into my computer.
Plus my MacBook, is a lot faster, for instance, than a single cloud
account on a shared server.
So I'll go ahead and write some of this code here on my own machine instead.
So let me go ahead and open a file called import .py.
And I'm going to go ahead and import the CSV module, as always,
and I'm going to go ahead and open this file, which is called title.basics.tsv
in read-only mode, and I'm going to call this variable titles.
What am I next going to do?
Let me myself a reader using csv.DictReader,
as before, reading in those titles.
But csv.DictReader feels like the wrong reader.
Why?
Just to be clear.
Yeah, I mean, maybe I want TSV.
But it turns out TSV doesn't exist.
So even though I might be inclined, for instance, to change this
to a T and this to a T, the CSV module does enough for me,
but I need to tell it that I want to use a different delimiter.
Instead of the default, which looks like this,
I can actually override that and say, you know what?
Use a tab.
And just like backslash n in C and in Python, it's a new line.
Backslash t in both languages is a tab character.
All right, so once I've done this, let me go ahead and open up, now--
let me go ahead and open up shows0.csv.
My goal in life now is to make this file a lot more manageable for myself.
I want to take a 500megabyte file and extract only the TV shows therein.
Moreover, how about only the TV shows from 1970 onward?
We won't go even further back than that.
So let me go ahead and open up, in write mode, a file called show0.csv.
And I'm just going to call that variable shows.
Then I'm going to go ahead and give myself a writer.
And you might recall using this in the past.
This is simply going to give me a variable called writer, via which
I can write to a new file.
Because again, the goal is to read this file and write to this file
a subset of the data therein.
So let me go ahead and write one row first.
Write a row, passing in a list of values, specifically tconst,
which is the title--
which is the ID field; primaryTitle, which
is the title field; startYear, which is the year field; and genres,
was one more fields that I mentioned earlier.
So my goal is to export only those four columns that I care about for now.
So I'm going to go ahead and do this.
For each row in my reader, I'm going to go ahead and say if that
row's titleType == tvSeries, which, recall,
The Crown was an example of, then I'm going to go ahead and write
to the writer--
whoops-- writer.writeRow, a list containing what?
The row's tconst value, the row's primary title value, the row--
whoops-- the row's start year, and lastly, the row's--
[CHUCKLES] the row's genres.
So what am I doing here, just to be clear?
So what I have done is I've first written out-- once and only
once-- literally these values, because I want
headers in the first line of my file.
After that, for each row in the reader, which is currently
iterating over this file in read-only mode,
I want to print out the current row's tconst, current row's primary title,
the current row's start year, and the current row genres.
But notice, I'm ignoring movies and shorts and documentaries and bunches
of other values as well.
And you know what?
Just for good measure, let's shrink this a little bit.
And row, how about, is adult == "0", for today's purposes?
So that'll filter the list further.
If you'd like to flip that later, that's fine.
So let me go ahead and do Python 3, for version 3 on my Mac-- actually,
no, let's do Python of import.py, all right?
I'm going to cross my fingers here.
Hopefully the file is working, working, working.
But it's taking a decent amount of time.
Like, this is how much time it takes, apparently, to process over
millions of rows of data.
Still running.
Still running.
But the goal, again, is to shrink the amount
of data I have to ultimately care about so that we can actually
search it much more effectively.
So, OK, it actually finished.
So let me go ahead and open up show0.csv.
Notice now, in my text editor, I've got a lot less data.
I've thrown away everything I don't care about,
but I've been left with tconst, primary title, start year, and genres,
and everything herein is now consistent with that filtration.
But I haven't filtered everything I said.
What did I say I wanted to get rid of earlier?
Yeah, the shows before 1970.
And clearly some of these are coming from 1940s and so forth.
So let's go ahead and get rid of those, but see how.
This is the CSV file that we just opened, but in Google Spreadsheet form.
So I literally just imported it into Google Spreadsheets
that so we could see it.
Literally the same data as before, and there are those 1940s movies.
But there's something curious that I wanted to be mindful of.
If I scroll down in the start years and keep going and going and going,
huh, those seem OK.
Those are increasing in order.
But let me try this.
Let me just poke around my data, sorting in reverse order.
It's going to take a while, because even this is a decent number of TV shows.
Notice this weirdness.
At the top of start year now, once I've reversed sorted them,
there's a whole bunch of backslash capital N's.
Now, this has nothing to do with C, and nothing to do with Python.
It has everything to do with the documentation.
If you read IMDB's data, as I only did carefully eventually, you'll see this.
A backslash n and capital N is used to denote that a particular field is
missing or null for that title name.
Now, this is important, because if I want
to filter out movies that are after 1970,
I need to be resilient against that.
So let me go ahead and do this.
So if the current row's startYear does not equal backslash n,
then I'm going to go ahead and check that it's a TV series
and that it's not an adult show.
So that would help.
But furthermore, how can I check this here?
Let me do year.
And how can I convert row startYear to an integer?
Well, everything in a spreadsheet, by definition
of it having been in a spreadsheet, or a TSV file or a CSV file, is text.
But start year looks like years, so what Python
function can I use to actually convert text that
resembles a number to an actual number?
Yeah.
So we can do something like this.
So I can convert year to an int, and now I
can say, if year greater than or equal to 1970,
now I'm going to go ahead and do those lines instead.
Now, there's an opportunity, surely, for better design,
because once your code starts doing this,
you've done something suboptimally, right?
This is not going to end well if all of my code starts wrapping.
So I could clean up the logic in a little bit, but let's go ahead
and run this just one more time.
This time, changing this to shows1.CSV so we can see slightly different
outputs.
Let me go ahead and run Python import.py.
Huh, syntax error, Unicode error, codec--
that's a weird one.
But this is because backslashes, recall, in C and Python, have special meaning.
So when you do something like this, backslash capital n,
even though it's not a lower case n, backslash n, recall,
is the scape character in C and Python.
So this is like telling Python, this is a special character.
But it's not.
And we've never really had occasion to do this,
but how would do you think we could output
a literal backslash before a capital N?
Yeah.
So it turns out the solution to this problem, usually,
no matter the language, is that if you want a literal character, not
an escape character, you literally put another one of it before.
So even though this looks a little funky now,
this backslash backslash capital N literally will mean backslash N.
All right.
So now, let me go ahead and run this on import.py.
This time, I'm hopefully going to actually generate
a new file called shows1.csv that has even less data that
actually is going to contain my shows, but only a subset of them.
And let's go ahead and pull the cake out of the oven this way.
This is what I get now this time.
So if I actually load the CSV, shows1.csv, into Google Spreadsheet,
just because it's pretty easy to look at than the black and white window,
now you can see that I apparently am only getting shows 1970 and onward.
And indeed, if I sorted them, I would see no backslash N's.
I would have thrown away everything that doesn't meet that criteria.
Well, let me go ahead and do one last thing here.
I'm going to go ahead and make one more change.
And first, let's improve the design here.
This indentation is the result of my asking questions again and again
and again and indenting if and only if those things are true.
But notice, you can start to flip your logic here, right?
Instead of saying if the start year does not equal backslash N,
what if I just do this and say continue?
I can then unindent this--
because if you've not used it before in Python and in C,
if you say continue inside of a loop, it's not going to continue down there,
it's going to [WHOOSH] continue to the start of the loop again.
So via this logic, we can actually keep wrapping around again and again.
And here, too, we could say, if year less than 1970, I can go ahead
and say continue, which would then allow me to unindent this as well.
So there are solutions, design-wise, to actually avoiding
that infinite indentation.
All right.
Let's go ahead and do one last version.
Then I'm going to go ahead and pull out of the oven in a--
premade.
So the last thing I didn't load before was this.
Suppose that I want to load into the CSV file, all of the genres
associated with the show.
It looks like all of these shows have one or more genres, just like you
were asked for your favorite shows.
And so now, we have a CSV file with tconst, primary titles, start year,
and genres, where genre is itself is a comma-separated list.
But there's a fundamental problem here.
Even though I have all of this data here, the best I can do
is a program like this.
Let me go ahead and search for, for instance, the following.
Let me go ahead and grab a file real fast.
Let me go ahead and grab a copy of shows2.csv
and write one final program here.
If I want to go ahead now and search this very large, still, data set,
in shows2.csv, well, let me go ahead and do this.
Import CSV.
Let me go ahead now and ask the user for a title.
I could use CS50's get string, but there's really no need for that
anymore now that we have the title function-- the input function, recall.
So I'll just use that.
And then I'm going to go ahead and open up shows2.csv in read-only mode.
And I'm going to call that my file.
Then I'm going to go ahead and give myself a reader from csv.DictReader,
passing in that file.
And now I'm going to go ahead and, for row in reader, do the following.
The goal now is to write a Python program that allows
me to search only those TV shows.
So I could say something like this, if title == row "primaryTitle",
then I can go ahead, for instance, and print out row "primaryTitle,"
started in row "startYear."
So what is the goal of this program?
It's going to ask the user for input.
It's going to open this big CSV that I've created.
But that's still smaller than the 50-megabyte version.
It's going to iterate over every row in that file via DictReader,
checking if the title the human typed in equals the current row's primary title.
And if so, it's going to print the title and year of that show.
So if I go ahead and run Python of search.py,
typing in something like The Crown, Enter, voila, I get that answer.
If I go ahead and do The Office, Enter, there's a bunch of Offices.
And in fact, if you haven't seen more than the American one,
there's the UK one, and apparently several others.
And we can actually corroborate this now.
If you go to IMDB, where all of this data originally came from,
and type The Office, there are all of The Offices that actually line up
with our own very data set.
It's going to be so damn tedious every time
you want to search for data or update data
or insert new data to write a Python program to do it,
so we need a few more features of SQL.
But I think first, we need some Halloween candy
for our five-minute break outback.
All right, we are back.
So we have a whole bunch more data now, because we've downloaded
that really big TSV file from IMDB.
I've simplified it into some CSV files, but that really gets me half of the way
there, because now if I want to search the data, as with search.py,
I still have to look over thousands of rows of TV shows,
and it's only going to be linear search by nature of just how files are read,
top to bottom, left to right.
But it turns out in Python that you can actually write SQL code itself.
And CS50 has a Python library that has not only get string and get int
and so forth, but it also has a SQL function built in that allows you
to connect to, so to speak, a file that ends in something like .db,
which is to say, in a moment, we'll start to write some Python code now
toward an end of loading a really large data set like IMDB's into a proper SQL
database, thereby allowing us hereafter to use all of the power
and expressiveness of SQL, and more examples of that in just a bit like
SELECT, UPDATE, DELETE, and INSERT, without having to write a whole bunch
of Python code.
So to be clear, instead of using Python to search and manipulate our data,
we're going to write a script, a program in Python, whose sole purpose in life
is to get data from one format into another.
And you can imagine there's being generally useful, whether it's
a Google Spreadsheet you've downloaded or a large data
set you found on the internet that you want to use for a final project
or for some other class.
Python can be a really powerful way of taking one data source as input
and producing its output now, as of today, SQL instead.
So let's go ahead and iterate one final time of our title.basics.tsv.
But this time, not just save it into a CSV file.
Let's put it into a proper SQL database on my own Mac or your PC.
So let me go ahead and do this.
First, let me go ahead and say, just like on the slide, db = cs50.sql,
and then quote unquote, "sqlite:///--" so the third slash is not a typo.
It should indeed be there.
And I'm going to say shows3.db, just because this
is version 3 now of my import script.
I'm going to go ahead, just as last week, and now import CSV--
CS50's library as well.
But for this to work, the file shows3.db needs to exist first.
And there's a couple of ways on a Mac or a PC or a Linux computer, typically,
to create an empty file that's ready to receive data.
You can literally use the command touch, which
just will create an empty file by whatever name you type at the prompt.
Or we can do this programmatically.
And I'm going to do it programmatically, because I
bet I'm going to screw up one or more times here,
and it's going to be useful to let my Python program create and recreate
the database again and again and again until I get it right.
So let me go ahead and open a file called shows3.db in write mode.
And recall from Python and C, using fopen,
anytime you open a file in write mode, it will overwrite any file that exists
or create any file that doesn't.
That's all I needed to do.
So in Python 2, recall that we were able to use this dot notation.
And it turns out here, when you open a file,
if you want to immediately close it, because your only goal was to create
it, you can just do .close on the very thing you just opened.
That is equivalent, just to be clear, to doing
something a little more pedantic like this, file = open, and then file.close.
But we can collapse this into one slick one-liner, so to speak,
by just doing instead what I did a moment ago.
All that does is create empty shows.3 file.
Now, open that file for SQLite.
And again, SQLite is the light version of SQL
that anyone can use on their own Mac or PC.
You don't need a special server to get up and running with it.
So now let me go ahead and open up title.basics.tsv.
And then here, let me go ahead and create myself a DictReader so
that I can iterate over the lines in that TSV file.
And now, let me go ahead, and for row in reader, do the following.
I first want to filter out stuff just as before.
So I'm going to say if row bracket "titleType" == "tvSeries"
and row "isAdult" == quote unquote "0", then I'm going to go ahead and check
one other thing.
I'm going to go ahead and give myself a start year variable, similar to before,
although I called it year earlier.
Then let me go ahead and do row "startYear",
just so I can cast that to an int.
But I only want to do that if row "startYear" does not
equal that special backslash N that IMDB told me to watch out for.
So I only want to do that if it's not that.
And then if startYear is greater than or equal to 1970,
let's go ahead and do the following.
Let's go ahead and do genres, gets row "genres".
Let's go ahead and get tconst, gets row "tconst,"
just so I can put these in some slightly shorter variable names,
just to keep myself sane.
primaryTitle is going to be from row "primaryTitle."
and then let me go ahead and give myself--
we already have startYear, so those are the only other three fields I need.
So now, I want to go ahead and insert this row from my TSV
into a SQLite database.
And the operative word that we saw earlier that we haven't used it yet,
is INSERT.
We did use SELECT.
We did use UPDATE.
We did use DELETE.
We haven't used INSERT yet.
So I'm going to do that in a moment.
But first, I need my database to actually exist,
so I need to create an actual table.
So I'm going to go up here first and do this, db, is a reference now,
a variable representing my database, and I'm
going to call the only function inside of it that's
useful for our purposes, called execute.
What I can now do is execute any SQL I want.
So what do I want to load into this database?
I think I want to load in the tconst, the primaryTitle, the startYear,
and the genres, just like we had earlier from title.basics.tsv.
I want to load rows that represent this kind of data, all right?
So how am I going to do this?
Well, let me go ahead and create a table.
I'm going to call it shows, because that seems nice and conceptually consistent.
I'm going to go ahead and create a list of columns now.
tconst is going to be one column.
primaryTitle is going to be another.
startYear is going to be another.
And genres is going to be the last.
I can literally, that is to say, write SQL inside of a string
that I pass to a Python function called db.execute.
And because db.execute-- or rather, because db
was configured with shows3.db, when I execute this string in Python,
it's going to get executed on that database file, shows3.db.
So it's a nice way of bridging these two worlds.
So I'm going to have to be a little more specific, though.
Recall that SQL has a bunch of types.
And I'm going to keep it simple.
I'm going to go ahead and say that the type of that tconst value is text.
The type of the primaryTitle is text.
The type of startYear is going to be numeric, kind of a catch
all for dates and date times.
And then genres is going to be text as well.
So the syntax is a little funky.
You actually specify the name of the column and then the type,
as opposed to the opposite, which we did in C. But that's the way SQL is.
So I'm going to go ahead and save that now.
And just to comment this, this is going to create
a table called shows in database file called shows3.db,
just to be super explicit.
So what am I going to do down here?
It looks like I have the ability with CS50's library
to execute any SQL I want.
So let me go ahead and insert into shows the following values,
a tconst, a primaryTitle, a startYear, and a genre--
and genres.
What values?
I want to insert these values.
Now, I don't know in advance, so I'm literally
going to put some question marks here.
And it turns out in SQL, this is valid syntax for the library we're using.
This is this INSERT query.
INSERT INTO the table name a parenthesized list
of the columns you want to insert data into.
Then, a set of values in separate parentheses.
And for now, I'm using question marks for placeholders,
for reasons we'll come back to.
But I'm going to go ahead and plug the following values
into those placeholders, tconst, primaryTitle, startYear, and genres.
And what the db.execute function is going
to do for me automatically is it's going to look at this SQL query.
Notice that, oh, it's got four question marks in it, or placeholders.
Those, in SQL, are like the %s was in C or are like the curly braces are
and Python f strings.
So this says, give me 1, 2, 3, 4 placeholders and plug in, ultimately,
the following four values, tconst, which is just a variable;
primaryTitle, which is the same; startYear, which is the same;
and genres, which is the same.
So what am I going to do?
For every TV series in this file that's not an adult series
and that started after 1970, insert it into my database.
If I've made no typos, I'm going to go ahead and run this, cross my fingers,
and enter.
This one is going to take more time, because it turns out writing
to a CSV file is actually pretty quick.
You can just write row, write row, write row, write row,
but inserting into a SQLite database is going to take more time.
More time upfront, but it's going to be a lot faster to search thereafter.
So let me go ahead and do the cake in the oven thing
and go ahead and now open up a file I made
an advance in today's src3 directory called shows3.db
using SQLite3, that command line program we used earlier.
Recall that I can say .schema to see the types of data in the database.
And indeed, look what I've done in advance.
I created a table called shows with exactly those columns in a--
with exactly these four columns, tconst, primaryTitle, startYear, and genres.
But I did this in advance to save us time so that I can now do SELECT
* FROM shows.
And let me not get all of them.
Let me do the first 10, semicolon.
All right.
So we see the first 10 shows from IMDB in whatever order
IMDB distributes them.
You can see their ID numbers are incrementing.
And All My Children.
I remember growing up with that show years ago.
And it seems that that's a drama--
it's not really mystery, but so be it, and a romance show there.
But it's indeed 1970, as are every show thereafter in 1970 or onward.
I can go ahead and search for more like this.
Let me give myself the first 100 shows.
All right, so this is a pretty large data set.
And let me go ahead and count them all.
Recall that you can use a COUNT function so that we don't
have to print them all on my screen.
It looks like there are 153,331 TV series in IMDB.
Crazier than that, you want to know how many of them came out this year?
WHERE startYear = 2019;.
There were 6,099 new TV shows in the world this year according to IMDB.
Just one of those is The Office that won our--
won our vote earlier.
So SELECT *, star denoting wild card, in this case, everything.
And unfortunately, SQL has two wild cards.
Star means select all of the columns in the table.
Percent means let any characters come before or after a quoted
string in a WHERE clause.
So let me go ahead and SELECT * FROM shows WHERE title = The Office,
and we'll actually see-- whoops.
primaryTitle, sorry.
Let me fix that. primaryTitle = The Office.
There are all of those Offices.
And indeed, 2005 is probably the one we know and love.
If I go to a browser and go to IMDB slash title slash that ID,
indeed, that's probably the one we're all thinking of,
unless you voted for the UK version instead.
So again, this is actually real, live data that we're now playing with.
Well, what more can we do?
Well, there's one thing that I don't really
like about this, which is that when we select all of the columns--
and let's go ahead and do this.
Let's select another hundred of them before.
This feels a little messy that we have all of these nice, clean columns except
for when we get to genres.
Then we just have this arbitrary comma-separated list.
Suppose, for instance, I want to search for all
of the comedies that came out in 2019.
I could say SELECT * FROM shows where genres = "Comedy" AND--
turns out you can use conjunctions like this--
startYear = 2019.
So that gives me a whole bunch.
Let's count them.
So COUNT this here, Enter.
OK, a thousand of those 6,000 shows are comedies.
But I think that's an underestimate.
Why is this query buggy at the moment?
Yeah?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yeah, some of them had more than one genre,
so comedy is somewhere in that comma-separated list.
And so what I should probably do instead is not say genre = "Comedy,"
but maybe genres LIKE "Comedy," and allow something maybe to appear before,
something maybe to appear after.
And that's going to give me 1,593 comedies that
came out this year in 2019.
So that seems a little better.
But this is not very robust, right?
Once you start resorting to techniques like this,
it should start, as a programmer, to rub you the wrong way.
It's kind of a hack, right?
Like, you're searching for comedy, but there could be something before it
or something after it, and odds are, there
is no other word I can think of in the world of genres
that starts with or ends with comedy, so we're probably OK.
But this is kind of hack-ish, that you're just kind of searching.
It would be nice if we could just search for a specific column called genre.
So how can we go about doing that?
Well, let me go ahead and do the following instead.
Let me go ahead and open up a final version of my import script,
this one that does two things up top.
At the top, I'm going to create two tables, one called shows, which has--
I'm to clean up the column names, too.
IMDB is a little nonconventional.
What most people would do when describing a unique identifier,
they're going to call it ID, not tconst.
So we're going to rename it to ID.
They're not going to call their title primaryTitle.
They're going to call it title, so we're going to rename it title.
They're not going to name it startYear.
We're going to call it year.
And then that's it.
We'll come back to primary key in just a moment.
But notice this.
In my new and final version of this script, I'm creating,
I propose, a second table called genres whose purpose in life
is to contain a value called show_id and another one called genre.
So what's going on?
Well, let me go ahead and show this.
If I load the resulting database from this one, shows4.db, and I do .schema,
you'll see that I indeed have two tables.
Let me go ahead and SELECT * FROM shows WHERE title, this time,
because I've renamed it from primaryTitle, = The Office.
OK.
That's a lot of Offices.
But let's go, AND year = 2005, which is the one we're all thinking about.
And it's that one.
And now, notice this.
Notice that I'm getting back what?
An ID, I'm getting back a title, and I'm getting back a year, but no genres.
That's because there's another table now called genres that's separate.
And you know, I'm kind of curious.
I see that a genre table has show_id.
Let me go and do this.
SELECT * FROM genres WHERE show_id =, and let
me do a little copy paste here, = this show_id.
And what might I see?
Comedy.
So what have we done now?
For any TV show that was in IMDB's database that
was a comma-separated list of genres, I've exploded it, so to speak.
I've split that value on the commas.
And if the show is a comedy, I've added a row in these genres table,
but then I've jotted down the show's ID next to that genre
so I remember that that show was of that genre.
But if another show has multiple fields-- for instance,
let's go ahead and search for not The Office but, say, The Crown.
And there's only one of those.
And now I do SELECT * FROM genres WHERE show_id = this number,
we'll see that, oh, The Crown now has drama and history as a genre.
And so therefore, in the genres table, notice
that there's two IDs and two genres.
But now we can use a quality.
I can now search for all of the comedies from 2019 in kind of a powerful way,
SELECT * FROM shows WHERE id IN--
and here's the cool part--
SELECT show_id FROM genres WHERE genre = "Comedy" AND year = 2019.
So this is admittedly a mouthful, but let's consider what's happening.
First, I'm selecting star from shows.
That means give me all the shows in all of the columns.
But filter as follows, only show me those shows where the ID of the show
is in the following list of IDs.
Now, you can look at a nested query inside the parentheses here.
This list here selects all of the show IDs from the genres table
where genre = "Comedy".
So the highlighted parenthetical right now returns essentially a list
of all of the ID numbers of shows that are associated with comedy,
even if they're associated with other things, too.
And we're making sure that the year equals 2019.
So if I now hit Enter, we'll see a whole bunch of results, but we should see,
if I count these by using my usual syntax,
there were 1,593 shows that are comedies in 2019.
That does happen to equal the same count we did earlier by using like,
but this is better designed in the sense that there's no ambiguity.
You're not just hackishly looking for a substring,
so to speak, in a comma-separated list.
You can actually now search more robustly
by having redesigned your data.
And what we've done really is something like this.
Instead of storing our data in just one table called shows,
and every show has an ID, a title, a year, and genres,
we've instead exploded those genres into two separate tables, such
that now our shows table looks like this.
We have an ID, a title, and a year.
And notice that the ID of a show can also appear over here
in another table called genres, and just by convention, to keep us sane,
instead of calling it ID here, we've called
it show_id to connote that it came from a table called shows, plural,
but it's a single ID from a show.
So by convention, humans often call them table name, minus the S, underscore ID.
And then the keyword here or genre is comedy or drama or documentary
or some other genre as well.
And now, this is a little subtle, but the fact
that this little symbol here, drawn from the database world,
flails out into three separate places, this
is a one-to-many relationship, so to speak.
You can have one show over here mapping to many genres over here,
or maybe zero, but it's zero or more possible genres.
All right.
Any questions just yet?
All right.
So the real power then, to be clear, is coming, now,
from this kind of expressiveness.
So now, let's play around with some other queries and features.
But first, let's give ourselves a bit more data.
It turns out besides the file called title.basics.tsv,
IMDB.com makes a bunch of others available to us as well.
There is one called name.basics.tsv.
And this is one that has information on all
of the names of actors and actresses and directors and writers
and other people in the world.
So for instance, there is an nconst, which is like a name constant
or an ID, nm2946516, which happens to belong
to the actor whose primary name is Claire Foy, the star of The Crown.
She was born in 1984, and there are some other fields in that file as well.
But also juicy is this file, title.principals.tsv,
and this is where it gets interesting, too.
In this file, notice there are no actual titles.
There's no primary titles.
There's no actual human names.
Instead, there's just two unique identifiers, a tconst and an nconst,
which IMDB speak for a title identifier and a name identifier.
So for instance, in one of the rows in this TSV file called
title.principals.tsv, there is a row that starts with tt4786824;
also has nm2946516, and has the word actress,
thereby implying that if you look up the nm--
the nconst in the names file and you look up the tconst in the titles file,
you will be able to, by transitivity, infer that Claire Foy is in The Crown.
This allows us to have a many-to-many relationship.
A one movie or show can have many actors,
and one actor can be in many shows, so we're using a sort of join file here,
a join table, that's going to somehow allow
us to link two different data sets together, and more on that in a moment.
But what's really fun in IMDB is that it also has a bunch of ratings
that humans have typed in saying, I get this 10 out of 10, a 0 out of 10,
and so forth.
And they keep track of the number of votes that shows have gotten.
And so in title.ratings.tsv, yet another file you can download from IMDB,
you can look up a given tconst, the unique identifier for a title,
what its average rating is and the number of votes.
And in fact, if I pull this up, for instance, on The Crown, if I go back
to IMDB itself--
IMDB, search for The Crown, the 2016 version,
you'll see that indeed, it is an 8.7, which lines up with exactly what
we have here.
But over time, that number is going to go up or down, because IMDB is updating
their data set every day as well.
So besides this data, we also then have the ability
to consider what this all looks like collectively.
So in this case here, here is another diagram.
It's more complicated now, but it just captures the intuition
that you would derive by just reading through IMDB's documentation, which
defines the following.
It turns out if you read closely among those files,
you'll see that, oh, we can glean a whole bunch of shows that are going
to have IDs, title, year, and episodes.
I want to go ahead and associate those shows with a whole bunch of stars.
But people are the entities in the world that have IDs, names, and birth.
So now things get a little weird.
Let's focus only on these two tables here.
This is a diagram of the goal at hand, and this
is a script I wrote in advance.
And in a moment, we'll open up a SQL database that represents this.
There's going to be a table called shows, every row of which
has an ID, title, year, and some number of episodes,
so you can see how long-running it is.
There's also going to be a table called people.
Claire Foy is going to be among them.
She and the other humans will have an ID, name, and birth year associated
with them as well.
But there's going to be this other table here called stars.
And you'll notice there's a line that links shows to stars to people,
again by transitivity.
If there is, in the stars table, a show ID, a.k.a. tconst, and a person ID,
a.k.a.
nconst, that links, for instance, Claire Foy to The Crown.
It's going to link Steve Carell to The Office.
It's going to link every other actor to their show as well.
Similarly for writers, we won't play too much with the writers today,
but writers are people, too, just as stars are people.
And so here's another feature or design goal of SQL.
You want to ideally factor out the commonalities, so
that Claire Foy appears only in one place by name,
but her unique identifier might appear in bunches of places.
There's a lot of actors in the world who are also writers themselves.
They are, at the end of the day, people, but they
might appear both in the stars table and in the writers table
by way of their person ID.
So the goal of SQL is to not copy and paste Claire Foy, Claire Foy, Claire
Foy, or Steve Carell, Steve Carell, Steve Carell, all over the place.
You want to have one authoritative place for all
of your people, one authoritative place for all of your shows,
and then you have these other tables called
join tables, which are similar in spirit to the TSV files
you can download that somehow link these identifiers together.
So if I, on my computer here, open a file that I had-- came with in advance,
that you can also play with online as well, called, finally, shows.db.
And I do .schema on this one.
This is the largest of the databases, and this is the one that we, the staff,
actually generated from all of the TSV files online,
but we threw away a lot of information.
We threw away anything before 1970, and we filtered out
everything except TV series, so that we can actually
play with the data in this data set.
But honestly, using a black and white prompt and a terminal window
tends not to get fun, especially when the data flies over the screen.
So there are also things called GUIs, Graphical User Interfaces,
and indeed, there is a program that's freely available for Macs,
PCs, and other types of operating systems, called DB Browser.
And indeed, will point you at this online.
This is just a program that allows you to explore SQL files on your own Mac
or PC much more visibly--
or much more visually and much more pleasantly than maybe just
a command line interface allows.
So let me go ahead and open up, for instance, shows.db.
And we'll see a whole bunch of things.
First of all, the graphical user interface
here shows me the same information, just in slightly prettier format.
Shows.db, per my diagram a moment ago, has six tables, people,
like I mentioned; shows, like I mentioned;
also, stars and writers, ratings, and then that separate table called genres.
And if you look over here on the right, you
can actually see the SQL code we wrote to create those tables.
But cooler than that is that notice these tabs up top here?
I am currently on Database Structure, and if you're using Windows,
your interface will look a little different,
but the options are still there.
I'm going to go ahead and click on Browse Data.
And now you'll see a little dropdown of all of the tables.
So if I want to go ahead and look at a whole bunch of shows,
I can actually see all of my data here, not unlike Google Spreadsheets.
But notice, this is 153,331 movies--
or shows, rather, that I can see altogether here.
And lastly, what's cool is that if I go over to the SQL tab,
I can now execute some sample queries.
So let me go ahead and do this.
SELECT * FROM shows;.
This is going to give me a whole lot of shows.
Enter.
So I hit the little Play button that just executes that query,
and you see the resulting rows that have come back.
So again, how did we get to this point?
We, the staff, downloaded all of those TSV files in advance.
We wrote a Python script that imported all of the data from those files
into memory, threw away the stuff we didn't care about,
and then inserted it into tables like this table called shows.
And what's nice about this tool is, because if you are an aspiring data
scientist or you're just trying to get some analytical work done
for this class or any other, or any other project,
a graphical tool lends itself to just kind of poking around.
So for instance, you'll see that the shows table no longer has any genres.
But that's OK.
We can reconstitute that data.
I can go ahead and SELECT * FROM shows--
but you know what I can do?
I can actually join in the genres table.
So I can take the shows table here, the genres table here,
and essentially link them together by way of the ID.
How is that possible?
Well, if you look at shows, a show has an ID title, year, and episodes.
Genres has a show ID.
So if you think of ID on my left hand as representing my fingers here.
Show ID in genres is representing my right hand here.
What we want to do is lineup ID with show ID
to make one larger, wider table that constitutes all of that data together.
So how do I do this?
Well, in SQL, you can join two tables.
You say what table you want to join with what other table and
how you want to do it.
Well, I want to go ahead and join it on shows.id = genres.show_id;.
And now, when I hit execute, it took a moment--
indeed, it took 408 milliseconds.
But my god, that's a lot faster than writing a whole Python
script to do this.
Now I have a table with all of the shows as before.
But notice the table got wider.
This is a temporary table that SQL has returned to me.
This one now has genre and show ID.
So in fact, let me go ahead and filter this.
WHERE title = "The Office" AND year = 2005, Play, we'll just get this.
So notice I have a wider table containing all of the columns
from both of those joined tables.
But if I change this now to The Crown, and that was 2016, and hit play,
notice I get back seemingly redundant information.
But this might be useful nonetheless, because now I
can iterate over all of the rows knowing that every row has not only a title
and a year, but also a genre as well.
So I can reconstitute the table like that.
Well, how can I ask other questions as well,
like what are actors that are in other shows as well?
Well, let me go ahead and do something like this.
For instance, let's select all of Steve Carell's movies.
Or-- let's select Steve Carell himself first.
So SELECT * FROM--
let's see-- no.
Let's do this.
How about Ellen?
So SELECT * from people where name = Ellen DeGeneres.
I spelled that right.
Semicolon, Play.
All right.
So this is handy, because I now know that Ellen's birth year is 1958,
but her ID is 1122.
That's the same thing as that nconst, but we threw away the nm
and we got rid of all the zeros and made it into a proper number for efficiency,
better to uniquely identify humans, typically-- or anything--
by numbers which fit in 32 or 64 bits rather than longer strings.
So now I know Ellen's ID is 1122, what can I do?
Well, let me go ahead and SELECT * FROM stars WHERE person_id = this.
This will tell me what?
This will tell me all of the information about shows that Ellen starred in,
including, presumably, her own.
OK, so I now see person_id is the same, the same, the same,
but Ellen is apparently in all of those shows.
But that's not that helpful, and also, this
is kind of lame that I've just hardcoded Ellen's ID.
But I don't have to do that.
I can do a subquery, SELECT id FROM people WHERE name = "Ellen DeGeneres",
closed parenthesis, Play.
Now it's dynamic, so now I've not hardcoded anything.
But this isn't that useful.
Let me go ahead and just select show_id here.
So I now have SELECT show_id FROM stars WHERE the person_id ID
= whatever Ellen's ID is here.
How can I take this one step further?
Well, what if I do SELECT * FROM shows WHERE the ID of the show
is in the following list of values?
So not only does SQL support equals, when
you want to compare one value against the next, much like in Python,
you have the keyword IN, where you can say select everything from shows where
the ID of the show is in the following list of shows,
which happens to represent the list of show IDs that Ellen is in.
Phew.
Lets hit Play on this.
It took a moment, but it looks like these are all of the shows
that Ellen has been in, according to IMDB.
And it looks like The Ellen DeGeneres Show is one.
She's been on the air for 2,865 episodes.
There is Ellen, her original TV show, which was on for a few seasons back
then.
Looks like she's doing some game shows these days as of 2017, and so forth.
So using these step-by-step thought processes
can we actually build up more interesting queries
to get back information like that.
All right, any questions before we try a few others out as well?
No?
All right.
Well, let me show one other approach to this same problem.
It turns out this is what are known as nested queries.
You keep using the parenthesization, much like in math,
where you can nest arithmetic expressions in parentheses.
You can do the same in SQL.
But you can also join information in slightly different ways as well.
I can actually do something like this.
Let me go ahead and SELECT Title FROM the following tables,
people JOIN stars ON people.id = stars.person_id--
and we'll walk through this in just a second--
JOIN-- sorry--
JOIN shows ON stars.show_id = shows.id where name equals "Ellen DeGeneres."
This, while more of a mouthful, is equivalent to what I've just done.
Notice that I've select the title of the show
from the following, the result of joining people on stars.
How do you join the people's table and the stars table?
Well, people have IDs.
Stars have person IDs, according to the diagram.
What else do you want to join together?
Let's join all of that with shows.
How?
Well, let's go ahead and say stars.show_id = shows.id.
So it's a way of linking, transitively, multiple tables together,
and then filter all of that with the name of Ellen DeGeneres.
This is what we would call an explicit join.
And if I click Play there, notice it's taking a moment.
Took, in fact, 1,990 milliseconds, almost two human seconds.
So it got slower, but it does give me another syntax
via which to select data.
And honestly, what's cool about this is I can select other fields as well.
They don't have to come from the original query in my nested selects.
But let me go ahead and do this, just as a teaser, and then
we'll consider some problems before wrapping up.
So it turns out 2 millisecond--
2 seconds is kind of slow, and if I do this again, it took 2,029 milliseconds.
If I do it again, it took 1,963 milliseconds,
and I'm looking at the time here.
That's slow, right?
That means if you have a finite number of servers in the world
and each of them can only handle some number of users at a time,
it's kind of a waste of hardware and of money, arguably, to spend more seconds
or milliseconds servicing one query from a given user on your app or your web
application.
But it turns out we can do something kind of smart here.
Notice that in our database structure, we've done a few things here.
Let me open up, for instance, people.
Rather, let me go over to people, and you'll see a few keywords
that I ignored earlier.
When you're defining a table, you can specify that one of your columns
is what's called a primary key.
That is the column by which you're going to uniquely identify all your data,
so it's like those numeric IDs we've given every person and show.
A foreign key is what we call the same number
when it appears in some other table.
So when we saw a person_id or show_id, those are what are called foreign keys,
because it's the same numbers, but in another table
that we're using to join things together, eventually.
And you can also define columns to be unique,
or you can index them for efficiency.
And that's, in fact, where we're going right now.
If you look at the same query, you'll notice that I'm
searching on a bunch of columns.
I'm clearly searching on the name field.
I'm also searching on-- that is, joining on-- show_id,
and I'm searching on person_id, not to mention the individual ID fields.
But the cool thing about a primary key, which
we have in advance to find all of our ID columns, all of the columns
called ID, to be primary keys, you get a feature for free with SQL.
SQL builds up what's called an index, a very fancy data
structure, a tree-like structure, that actually allows you to search
for information quite efficiently.
So when you define a column to be a primary key,
you get what's called an index.
Specifically, if we go back a few weeks, to tree, our discussion of trees,
it looks like this.
This is what's called the B-tree, which is not a binary tree.
It's a B-tree in the sense that it's got a lot of nodes
that might be one or two or more children each,
but it's very short with a lot of nodes, very-- a lot of width.
So that means when you store your data in a tree-like structure,
long story short, it just tends to be very efficiently searchable.
So when you define a column to be a primary key,
you get that speed for free.
But we can also tell our SQL database, you know what?
I plan on searching on person IDs and show IDs and names, also,
so let's go ahead and create myself some indexes.
Let me go ahead and execute the following queries just once.
Let me create something called an index called person_index on the stars table
using its person_id column.
So pretty quick syntax.
This means create a B-tree, create a fancy tree structure,
called person index, on the stars table by person--
on the person_id column.
That is, make a table like this in memory to store all of the person IDs.
Why?
So I can find them faster than linear search in the column itself.
Let me go ahead and execute this.
And you'll see it takes a moment.
It's thinking.
It took me 1.6 seconds.
So a little slow, but I only have to do this once.
Let me create two more.
CREATE INDEX called show_index ON stars ON the show_id column.
So almost the same, but this just means give me
a tree that looks like this in memory, so
that when I query for data like Ellen DeGeneres,
it searches a tree-like structure instead of a list-like structure
in a column alone.
Let me execute that.
And I screwed up ON stars show_ID.
There we go.
That took 1 second.
And lastly, let's create an index called name index on people on the name column
so that I can search for people by name more efficiently as well.
Otherwise, without these indexes on a column like name,
it is going to check every damn cell in the column looking
for Ellen DeGeneres, Ellen DeGeneres, Ellen DeGeneres, using big O of N,
running time, or linear search.
So now, let me go back to my query here.
I've not made any changes to the query.
The last time I ran this, it took almost 2 seconds.
Now, after creating these indexes and telling my SQL database
I plan to search on those columns, watch this,
and watch the blue highlighted number.
8 milliseconds.
8 millisecond.
7 milliseconds.
7, 7, 7.
Looks like 4 milliseconds that time.
So this is what the Googles of the world, the Facebooks of the world,
the Microsofts, who have very large data,
they not only store their data in databases
like we are here, but they also index their tables intelligently,
drawing in ideas from weeks ago, so that the database, for them, and for free,
and sort of magically, creates these kinds of structures in memory,
but does it so that you can search and insert and update
your data all the more efficiently.
So there's got to be some problems or some prices paid.
So let's consider just a couple of those.
So what problems actually arise when using a SQL database like this?
Well, the first is what's called a race condition.
And perhaps the best way to explain this is by way of a--
sort of a story that I was told when I took a class like this years
ago in a class called CS161, Operating Systems.
So contrived scenario, but consider this.
You and your roommates have a fridge in your dorm room,
and you and your roommate really like milk.
And one of you comes home one day, opens the fridge, and sees, oh, dammit,
we're out of milk.
And so you close the fridge and you walk into Harvard Square
and you head to CVS or some other such store.
Meanwhile, your other roommate comes home, opens the fridge,
really wants some milk, but, oh, darn it, we're out of milk.
So they close the fridge, walk outside, and head
to some other store that sells milk, and for reasons that we're contriving,
don't bump into you at CVS.
A few minutes later, you both get home, of course, having bought some milk.
But of course, milk doesn't last that long,
so now you've got twice as much milk, and surely some of it's
going to go bad.
Like horrible, horrible problem.
Why, right?
It's contrived, but this-- fond memories of when it was taught to me.
So what is the fundamental problem here, if you put your finger on it?
Both you and your roommate were allowed to inspect
the state of that refrigerator without knowing that the other was
about to do the same.
Better would have been for one of you to leave a note on the fridge saying, gone
for milk, so that the other person doesn't do it as well.
Or more stringently, just lock the refrigerator
so that no one else besides you can see inside of it until you have
updated the state of that refrigerator.
So this comes into play all too often these days
when dealing with really large data sets.
The Twitters of the world, the Instagrams of the world, all of you
are probably in the habit of liking or uploading posts
on those sites or some other.
But if a lot of other people are, too, especially when things go viral,
there might be code like this underneath the hood that
essentially does the following.
Here's three lines of representative Python code that use SQL.
The first line here selects the number of likes from a table
called posts where the ID of the post is something,
a variable called ID, 1, 2, 3, 4, 5.
That just tells me, what is the total number of likes this post has?
Then, I store that answer in a variable using likes.
And you'll see this syntax in problem set 7
as you begin to play with SQL yourself in the CS50 library.
Suppose that your third line of code is to then update posts,
setting likes equal to some place holder, where that place
holder is going to be likes plus 1.
This is the same problem as the milk example,
because if you check how many likes there
are on this post, or your roommate checks how--
or if you check how much milk is available left in the fridge
and then go off and begin to make a decision like buying milk,
your roommate might, on a server that's running
the same code to handle thousands and thousands of people,
might ask that same question.
How many likes does this current post have?
Suppose the post has a million likes.
Both of you execute that first line of code,
see that, oh, I have a million likes.
I've just clicked plus 1 on the site.
And so you try changing the number of likes with this second update query
to 1 million plus 1.
But if both of you have been making a decision that
gets interleaved with the other person, as will absolutely
happen with Instagram and Twitter, who have thousands
of servers all operating in parallel, instead of changing the number of likes
from 1 million to 1,000,002, you might change it both of you to 1,000,001,
thereby wasting one of the counts, because it gets lost,
because you inspected the state of the fridge or the likes.
You made a decision.
And then you updated it based on that decision
without realizing that life changed in between.
Your roommate got home and checked the fridge,
too, or someone else clicked the upvote.
So this is a bad situation, but there are solutions in SQL
that we won't look at this week, but what are known as transactions.
That is a solvable problem, and more on that in a higher level database class.
And one final example of this, because this one you
can avoid throughout the coming weeks in CS50 and beyond.
A SQL injection attack is when you write bad code that
somehow allows some bad actor on the internet or an application
to trick your code into running SQL code that you did not intend.
For instance, this is how Yale students log on in New Haven.
This, of course, is how Harvard students log on here in Cambridge.
You're asked for a username and password, or a Harvard key
and a password.
Suppose, though, you as a user don't cooperate,
and instead of typing in your email address and password,
suppose you type something like this, like your email address,
Malan@Harvard.edu, then maybe a single quote, and then a dash dash.
Well, it turns out in SQL, dash dash is how
you start a comment, similar to hash in Python or in slash slash in C.
But suppose that Harvard had implemented its website such
that there's a query like this going on underneath the hood,
if the Harvard key is implemented in Python, that says this, SELECT * FROM
users WHERE username = placeholder AND password = placeholder,
passing in username and password.
That is good.
That is correct code, because those place holders are important.
What the CS50 library does, and other libraries in the world like it, is
it escapes user input.
If a user tries to trick your code into inputting comments or single quotes,
db execute and other libraries will sanitize the user's input
and prevent the following from happening.
If, however, you do this-- not good, but bad.
Suppose that you really practice what we preach last week using f strings,
this is now bad in the context of SQL.
If you create a format string like this with a little f
that is literally SELECT * from users WHERE username =,
and then use the fancy Python notation for curly braces,
and password = curly braces, this is a correct SQL construction
so long as the human cooperates and puts their username here and password here.
But what if the user is malicious, like me,
and actually includes a single quote in their input, and then dash dash?
So Malan@Harvard.edu quote dash dash would
have the effect of injecting that into the user's input.
But dash dash I claim as a comment, so it's
a way of tricking the computer into executing not SELECT * WHERE user--
SELECT * FROM users where username = this AND password = that,
it has the effect of just saying SELECT * FROM users where
username equals Malan@Harvard.edu.
And if-- and you'll see this in p set 7--
db.execute returns one or more rows from the table--
that's how you can actually select data using Python code--
and the number of rows equals 1, because it's
selected Malan@Harvard.edu you could log in
as me because you very simply tricked the computer into executing code
that you injected.
And the biggest takeaway here is this is 100% solvable problem.
You just need to know that it is solved and not
do code like this using f strings in Python, but instead
to use library code like db.execute, or after this course, anything like it.
With that said, we'll end with a joke.
OK.
That's it for CS50.
We'll see you next time.