Placeholder Image

字幕列表 影片播放

  • [MUSIC PLAYING]

  • DAVID MALAN: All right.

  • So this is CS50, and this day we continue our look at Python,

  • but also introduce another language.

  • This one is called SQL, or Structured Query Language,

  • and it's going to allow us to solve some different problems as well

  • as some old problems better.

  • But to do this, let's try to motivate some of the problems initially.

  • So this, of course, is just an example of a spreadsheet.

  • Odds are you use these pretty frequently,

  • whether it's in the form of Google Spreadsheets or Excel

  • or Apple Numbers and the like.

  • And this is actually the simplest form of what we might call a database,

  • just some kind of application that stores lots of data.

  • And odds are if you've ever taken registrations for a student group

  • or had people sign up or fill out a form or a survey,

  • odds are you've used Google Forms to get your data into a Google Spreadsheet.

  • And so we thought we would begin there today,

  • a representative problem whereby you might want to collect a lot of data

  • from users and then do something with it,

  • but ideally more powerfully than you could do as a human on your own.

  • So we've put into place a Google Form that quite simply

  • asks a couple of questions today.

  • What are your favorite TV shows?

  • And we ask you for the title of your favorite show and the genre

  • or genres into which that show falls.

  • So if you wouldn't mind, go ahead and open up this URL here

  • on your laptop or phone.

  • If you don't have one, that's fine.

  • You can use the person's next to you, if I may.

  • And take just a moment to go to that URL there and answer those two questions.

  • What is the title of your favorite TV show, and into what genre or genres

  • does it fall?

  • Favorite TV shows.

  • And when you go to that URL, you'll find yourself at, again, this form here,

  • asking for your title and for your favorite show.

  • So by the look of my own screen here we've

  • got some 62 responses, 74 responses coming in already,

  • so we can track this live.

  • So 111 responses so far.

  • That's a lot of favorite TV shows.

  • This number's-- keep going up.

  • 169, up to, next, 191.

  • Keep them coming.

  • One per person is fine.

  • It will give us more than enough data.

  • 246.

  • We'll give it another few seconds, but suffice it to say,

  • this is kind of a lot of data.

  • And if the next step in this process were

  • for me to analyze the data because I'm curious to know

  • what's the most popular show on campus or if I wanted

  • to do some other analytical work on this,

  • it'll be useful to have all of this data,

  • not in Google Form's proprietary interface,

  • but in Google Spreadsheet form with rows and columns, a format you're probably

  • pretty familiar with.

  • So we'll let this counter keep going up and up and up,

  • and I'm going to go ahead and click on responses.

  • And if we really wanted to, we could look through--

  • The Office is up there, followed by Games of Thrones, Friends, OK.

  • Perhaps pretty predictable here.

  • Let me go ahead and click this icon, which

  • is going to open the Google Spreadsheet version of this.

  • And you'll see that there's three columns by default.

  • [LAUGHTER]

  • Oh, that's funny.

  • [APPLAUSE]

  • Thank you.

  • Thank you.

  • OK.

  • OK.

  • Well-played.

  • Now-- so you'll see that by default we got three columns.

  • Timestamp, Google Forms just does automatically

  • to timestamp the submission of the form.

  • Title comes from the first question.

  • Genres comes from the second question.

  • And if we kept scrolling and scrolling, odds

  • are we're seeing more and more rows.

  • Let's propose that we have enough of these rows already,

  • and how might we go about analyzing the data?

  • Well, if you're pretty expert with Google Spreadsheets or Excel

  • or Numbers, maybe you could use some functions that

  • are built into these programs or you could do some Command-F or Control-F

  • and search for the favorite show that you're

  • looking for to try to run some numbers.

  • But we have programming abilities nowadays.

  • And if you haven't used this feature already,

  • if you go to Google Spreadsheet's file menu.

  • You can go to download, and you can download your data

  • into a bunch of formats.

  • And the most useful one I'd propose for now

  • is going to be this one, Comma-Separated Values or CSV.

  • And we've seen this before.

  • In fact, you've generated this before.

  • So if I go ahead and do that, it pretty quickly downloaded a CSV file

  • that I really wanted to I could open in Excel or Apple Numbers or the like.

  • But let's do something instead with that, and go ahead and put it into,

  • for instance, my IDE so that I can actually do something with it.

  • I'm going to grab my downloaded file here.

  • And I'm going to go ahead and drag that into the IDE.

  • And so now you'll see a whole bunch of files here.

  • Let me put-- whoops.

  • Let me put this right in the root of my folder.

  • And we'll see this file here.

  • And if I double-click on it, you'll see exactly the same data,

  • albeit in CSV format.

  • So it looks the same, but we have commas separating the values.

  • And Google Spreadsheets did that for us automatically.

  • So let's now write a program that analyzes this data

  • and see if we can't make sense of what some of the most popular values

  • are in there.

  • Let me go ahead and close this, open a new file,

  • and I'm going to go ahead and call it, for instance, favorites.py,

  • since the goal is to figure out everyone's favorites here.

  • And let me go ahead and use some syntax from last week

  • that you might be familiar with now from P set 6.

  • Let me go ahead and import CSV, the module,

  • so that I have access to that library of functionality.

  • Then let me do this with Open.

  • And then I'm going to go ahead and open this file, and I could rename it,

  • but you know, I can just grab the whole file name this way, too.

  • It's a little excessive, but I'll go ahead and paste that in in read mode.

  • And that gives me a file.

  • Let me go ahead and zoom out just a little bit there.

  • And once I have this file, I can throw it in to a dict reader,

  • for instance, to make it easier to read the CSV.

  • So reader gets CSV.DictReader.

  • Pass in that file.

  • And then after this, I have the ability, now,

  • to iterate over each row like this, much like you probably did for DNA

  • or one or more other problems, as well.

  • Now, just as a sanity check, let me just go ahead and print out

  • each of the row's titles.

  • So I'm not going to see anything new here,

  • but this will be a little sanity check that I'm writing correct code.

  • So run Python of favorites.py, and voila, fwoom, there's all the data.

  • So The Office is both at the beginning and at the end here, it seems.

  • All right, so that's all fine and good, a nice stepping stone, but let's start

  • to analyze this data, this time using some code.

  • So let me go ahead and say this.

  • Suppose I wanted to keep track and determine

  • what the most popular show is.

  • So I've got this big CSV file with title, title, title, title.

  • Invariably, there's going to be some duplicates.

  • A bunch of you probably like The Office or some other show,

  • and so it'd be nice to kind of aggregate that data.

  • What might be a useful data structure in Python whereby for each row

  • we can use this data structure to just kind of do

  • the equivalent of like 1, 2, 3, and count each

  • of the occurrences of those titles?

  • Any-- OK, I heard a hash table, but more properly in Python?

  • A dictionary.

  • So a dictionary is kind of the Swiss army

  • knife of data structures for a lot of people in the sense

  • that insofar as it allows you to store keys and values,

  • and those keys can be strings, and those values can be anything you want,

  • including, for instance, in this case, numbers, like a counter,

  • it's a really versatile data structure when you just

  • want to throw some data into a data structure and do something with it.

  • So let me go ahead and give myself a dictionary at the top of the file.

  • I can do this in a couple of ways, and if my goal is to store the counts

  • of each of these tiles, I could call it counts,

  • and I could set it equal to dict(), which is just a function that hands me

  • an empty dictionary.

  • Or I recall I can use this syntax, which is

  • equivalent to just shorthand notation.

  • But what I want to do now inside of my loop, every time I

  • iterate through this file, I want to check if my--

  • I've seen this title before, I want to add 1 to my counter.

  • But if I've never seen this title before, what do I want to do instead?

  • Yeah?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Add the title to the dictionary

  • as a key with the value of what, the first time I see it?

  • Well, the first time I see it.

  • AUDIENCE: 1.

  • DAVID MALAN: So we can just initialize it to 1.

  • So there's two scenarios here.

  • Either we've seen this title before and we want to do plus 1,

  • or we've not seen the title before and we just want to set it equal to 1.

  • So how might we express that logic?

  • Well, let me go ahead and say, first of all,

  • let's get the title into a variable of its own,

  • just to shorten our syntax a little bit.

  • And then I can say something like this, if title in counts--

  • well, what does that mean?

  • That means that inside of my counts dictionary,

  • I can index into it at that location, and I can increment whatever is there

  • by 1, assuming on some previous iteration,

  • else I went ahead and did counts of title equals 1.

  • So these are the two scenarios.

  • If I've seen this title before, go ahead and add 1

  • to its value in the dictionary.

  • Otherwise, just default to I have seen it one time.

  • So some of the shows are going to stay stuck at one if they're not

  • very popular among the audience, or they're

  • going to keep getting plus 1, plus 1, plus 1.

  • So what I often do when writing code like this

  • is, as I'm sort of finding my way, after I'm done with this,

  • I'm not going to write the whole thing.

  • Let me go ahead and just, at the end of this, print counts, right?

  • I'm just curious to see what it is.

  • I'm just going to go ahead and add print counts.

  • Let me go ahead and run now Python of favorites.

  • And voila, you see a huge, messy data structure.

  • But let's go to the very top of it.

  • Notice that it starts with curly braces, indicating it's, indeed, a dictionary.

  • It starts with a quoted value, which is one of the titles in it,

  • and it looks like two of you said Dynasty is your favorite show.

  • 25 of you said The Office is your favorite show.

  • One of you said Blindspot is your favorite show.

  • One of you, 24.

  • Friends, 19, and so forth.

  • So the data is kind of all over the place,

  • because it's based on the order in which you all input it.

  • So let's just clean this up a little bit.

  • And let me go ahead and print this out as follows.

  • Instead of just printing a big, messy dictionary,

  • why don't I go ahead and iterate over the dictionary?

  • And we didn't see this necessarily last week,

  • but you can actually do something like this.

  • If your dictionary's storing titles as keys and counts as values,

  • we can do something fancy like for title comma count in counts,

  • but if you want to iterate over not just the keys but all of the items therein,

  • there's this other function in Python associated

  • with a dictionary that will hand you back two things at once,

  • key and value together.

  • And you can call them anything you want.

  • I'm calling them title comma count, respectively.

  • Now, let me go ahead and print out, for instance, something like the title,

  • and then let me go ahead and just print out the count right next to it.

  • So now, if I save this and run favorites.py, OK, it's prettier,

  • but it's still a little messy in that we have--

  • the numbers are just right there after the name.

  • So let's go ahead and tidy this up.

  • Recall that when you print two things with print

  • separating them with a comma, what character

  • did you get automatically in between them?

  • Yeah, just a single space.

  • We saw that at the very beginning of last week.

  • But we can override that separator, otherwise known as sep,

  • which has a default value of quote-unquote with a space.

  • But suppose I just want to give some visual delineation here,

  • maybe like a vertical bar to kind of sort of make it look like a spreadsheet

  • again.

  • Let me go ahead and run this.

  • And now it's not super, super pretty, but you

  • can see title followed by a vertical bar followed by the number 1.

  • So this is all fine and good, but it'd be nice, for instance,

  • if we could sort these, because I don't really

  • want to sift through this visually, looking for all of the possible values.

  • So you know, the easiest way to sort these things might be like this.

  • Instead of passing in counts of items, there's

  • another function in Python where you can literally just sort something

  • with sorted.

  • You don't have to worry about bubble sort, selection sort, insertion sort,

  • merge sort, anything.

  • Just, you sort it and Python.

  • We'll use one of the fastest implementations

  • of sorting available to it.

  • Now, if I rerun this, we'll see that--

  • [CHUCKLES] OK.

  • We'll see that we have all of these values

  • here starting with 24, 9009, Adventure Time.

  • But that's not what I meant.

  • I wanted to sort not by title, but by what?

  • Count.

  • So this gets a little less obvious.

  • And it turns out Python does this in a couple of possible ways.

  • But let me go ahead and introduce this one first.

  • It turns out that this sorted function takes an optional argument that

  • happens to be called key.

  • And key simply specifies, how do you want to sort

  • the dictionaries that you're being--

  • or how do you want to sort the key value pairs that are being passed?

  • By default, it literally uses the key, so title.

  • But if you want to use the value, you know what you can do, is this.

  • Let me go ahead, and up here, even though this is a little weird,

  • and let me go ahead and say the following.

  • Def function item.

  • So def f.

  • I'm just giving myself a pretty generic name

  • for a function, because I'm going to get rid of it in a moment.

  • I'm going to go ahead and return item bracket 1.

  • Now, why is this?

  • Take a look at this line here.

  • And even though sorted is new today, as this counts.items(),

  • all items does is return key comma value, and sorted sorts then by key,

  • by default, handing them back as title comma count, or again,

  • I could call these anything I want.

  • But I'm calling them, more semantically usefully, title comma counts.

  • So that's giving me back to things.

  • Well, just like an array in C or a list in Python,

  • if you're getting two things at once, the first of those things,

  • conventionally, is going to be accessible in bracket zero.

  • The second thing is going to be accessible in bracket 1.

  • So this super, super simple function called f takes, as input, an item.

  • And if that item has two values, a key and a value,

  • return item one is just literally going to return

  • the second thing, whatever it is.

  • In this case, it's going to return, of course, the value.

  • Now, why is this useful?

  • Well, because the sorted function takes an additional argument called key,

  • you can actually pass it the name of a function

  • to use in order to determine how to sort the previous items.

  • So this is an example of a really powerful feature

  • of Python and other higher-level languages, in which case

  • you can pass in, as arguments, the names of functions.

  • I'm not calling f.

  • I'm just passing in its name.

  • If I were calling it, I'd have open paren and a closed paren,

  • but I'm just passing it in by name.

  • And now, this is going to, by default, sort things from small to large.

  • Turns out if you read the documentation, there's

  • another argument to sorted called reverse that will invert it so that now

  • you'll get largest to smallest.

  • So now let's try this.

  • Let me go ahead and rerun Python of favorites.py

  • enter, and now, if I scroll up to the top,

  • we should see The Office is indeed the most popular show, apparently

  • consistent with Netflix's data these days, too,

  • followed by Friends, Game of Thrones, Breaking Bad, Rick and Morty, Black

  • Mirror, and so forth.

  • And the least popular show among you, although there looks like there's

  • a lot of ties, Demon Slayer.

  • [CHUCKLES] Or Game of Thrones, but GOT.

  • So interestingly, there are some--

  • [LAUGHTER]

  • There's-- OK.

  • OK.

  • So here, we've gone and sorted these values.

  • Now, let me go ahead and clean this up a little bit.

  • Much like in C where it's kind of silly to define a variable

  • and then use it only in one place, similarly in Python,

  • if you're going to define a function, but then only use it in one place,

  • you don't actually need to.

  • There's a fancy one-line equivalent wear you can actually say this.

  • And you won't see this too often, but it's a handy thing just

  • to have seen at least once.

  • You can define a function called a lambda function that takes in as input

  • any argument like this, and then you can literally just return that value.

  • So if I go ahead and get rid of this function f and replace it with this,

  • the syntax is super weird-looking at first glance,

  • but lambda is the term of art in computer

  • science for an anonymous function.

  • I don't care that it's called f.

  • I'm never going to use it again.

  • So lambda just means give me a function.

  • Item means that is the input to this function, the argument, and the return

  • value, or output, is going to be item bracket 1.

  • So it's a little arcane.

  • Odds are you'll have to double check the documentation for using this oneself

  • in the future, but it's the same thing.

  • And indeed, if I rerun the program, I'm going

  • to get back the exact same results, this time with the items still sorted

  • like that.

  • All right, any questions on the code here, the features of the syntax?

  • Anything at all?

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yes, if you're writing a lambda function like this,

  • you are indeed limited to one line.

  • That was OK, because f, even though it was two lines, only

  • had one line of actual content.

  • But yes, it has to be one line, indeed.

  • All right.

  • So the GOT thing opens up some interesting messiness, right?

  • Because GOT presumably denoted Game of Thrones, and yet most of you

  • wrote Game of Thrones, capitalized properly,

  • and yet, it'd be nice if we could have some equivalences.

  • And honestly, I'm guessing there's a bunch of capitalization differences,

  • too.

  • So you know what I might be inclined to do?

  • What might be the easiest way for us to standardize

  • what we're calling these shows' titles that would be a little more

  • resilient against some of you typing with capitals,

  • some of you typing with lower case?

  • Let's ignore the GOT problem, though.

  • Yeah, so make them all lowercase, or heck, make them all upper case.

  • It doesn't matter so long as you're consistent.

  • So you know what?

  • Before I get title from the row, let me go ahead

  • and force it all to lower case, and then let me go ahead and--

  • in my-- go ahead and force it to lower case there.

  • So no matter what the input is, I'm going

  • to go ahead and force it on entry to lower case.

  • Now, if I do this and scroll back up, looks like our counts went up a little,

  • right?

  • The Office is now at 27.

  • Game of Thrones is now at 21.

  • Friends is at 20.

  • So it looks like some of you, kind of, reasonably if sloppily,

  • didn't capitalize it exactly as the show officially is called.

  • But that should be OK, right?

  • This is all about UX or user experience.

  • This should be one stepping stone toward making your program much, much more

  • resilient against user input.

  • But you know what?

  • What if there were a better way to do this, right?

  • Instead of writing code, this feels pretty tedious.

  • Could we not just use a different tool, different language,

  • and achieve these same goals?

  • Today's answer is indeed going to be yes, with SQL.

  • But first a question.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: No?

  • No question.

  • All right, so then the segue is exactly this.

  • I'm going to go ahead now and close favorites.py.

  • And I'm just going to maximize the size of my terminal window here.

  • And I'm going to go ahead and run a program called SQLite3.

  • SQLite3 is a command line program that you can install on your Mac or PC--

  • it's already on CS50 IDE--

  • that allows you to use another language called SQL interactively.

  • And it's got this really cool feature whereby

  • if I have this file called, very verbosely, CS520 2019 Lecture 7,

  • and so forth, I'm going to copy that after highlighting it.

  • I'm going to run SQLite3, and I'm going to put SQLite3 into .mode CSV.

  • It's a weird syntax, .mode CSV, but you don't have to use these commands very

  • often.

  • Then I'm going to use one other command here, which is called import.

  • And I'm going to go ahead and import this into a database.

  • But what database, exactly?

  • Well, let me go ahead and more specifically-- let me exit out of this

  • here.

  • Whoops.

  • Let me exit out of this here and run SQLite3 in the name of a database.

  • I'm dealing with favorite shows, so maybe I'll

  • call this favorites.db, DB denoting database.

  • I'm going to now go ahead and do .mode CSV.

  • I'm going to now do .import, quoting the name of the file,

  • and quotes are important, because it's got some ugly spaces in it,

  • and I'm going to import this into a table, kind of like a spreadsheet,

  • that I'm just going to call in all lowercase favorites.

  • And hit Enter.

  • Nothing seems to happen now.

  • But if I go back to my terminal window and type ls,

  • notice that in addition to my CSV file and favorites.py and src7,

  • which I downloaded in advance with today's code,

  • I now have this file called favorites.db.

  • And if I rerun SQLite, it turns out when I pass in the name of that database,

  • now I can ask the same kinds of questions as before,

  • but I can do it with a different syntax.

  • I can go ahead and select all the titles from my favorites

  • by typing this command, SELECT title FROM favorites; Enter, and voila,

  • we get all of the titles there.

  • Suppose I want to sort those titles as before,

  • or I can SELECT title FROM favorites, ORDER BY title; ending with--

  • oh, sorry.

  • Semicolons are back.

  • Then I'm going to go ahead and hit here.

  • And you'll see that now it's alphabetized,

  • and there's all those Offices by first letter.

  • Now, there are some bugs.

  • If you scroll down, you'll see lowercase letters as well,

  • so we'd have to deal with that situation, too.

  • But this seems to be a user-friendly way of just selecting data

  • that I care about.

  • And watch this.

  • Let me try to group things together.

  • I'm going to go ahead and select title, and I'm

  • going to go ahead and say count the number of titles

  • from this table called favorites.

  • But I want to group those things by title.

  • Now, we'll see in a moment what this means, but the effect is what's cool.

  • Voila, I now have output much like my Python program.

  • It's not really as orderly as I would like, but notice

  • what it's done at the very top.

  • The very first line of output, if I keep scrolling is this here.

  • On the left is my title, then a vertical bar.

  • On the right is literally count of title, how many titles match that.

  • So let me go ahead and do this.

  • Let me go ahead and first of all limit this to the top 10 titles,

  • so I can just deal with this and see all this at once.

  • So here we have some alphabetized list of titles.

  • But now, suppose I want to go ahead and rename this thing.

  • Let me go ahead and say that.

  • Let me go ahead and say, as--

  • COUNT AS n.

  • Now notice that the title of this output is title, and then n.

  • But the reason I did that was so that I can do this.

  • I'm going to now do ORDER BY n, descending order,

  • and then limit to the top 10.

  • Enter, and voila.

  • Now, using a completely different language,

  • I think I've gotten the same results, except for the lower casing issue,

  • which we can come back to another time.

  • But The Office is here as the number one hit.

  • Friends then, 19.

  • Game of Thrones, 18, and so forth.

  • So what's then the motivation in part for this new language called SQL?

  • And you've seen just snippets of it here.

  • It allows us to solve the same kind of problems

  • much more easily, because I don't have to sit down and write

  • a whole darn Python program, faster though it is then

  • C. I don't have to write custom code just to answer questions about my data.

  • And in fact, if you pursue the web track in a couple

  • of weeks' time or the mobile track, both of those

  • are going to offer support for SQL so that if you want to store data

  • on users or interactive information on an app or a website application,

  • SQL is going to be ultimately where you can put that data and access it later,

  • thereby writing even less code than today.

  • So let's go ahead and take a look at some of the features of these things.

  • What we've really just done is introduce this,

  • something called a relational database.

  • The goal at hand is not to rely on CSV files anymore.

  • CSVs are like the weakest form of databases you can use.

  • Why?

  • Well, every time you want to look in a CSV file for data, you have to open it.

  • You have to read it in.

  • You have to iterate over it, line by line by line,

  • looking for the data you care about, and then you're done.

  • That's going to be super slow when you have a lot of data in a CSV file.

  • And in fact, in a little bit today, we'll

  • download a really big database in text form

  • that we're going to then see 500-plus megabytes later is much better handled

  • by what are called relational databases, pieces

  • of software you can run on your Mac or PC or servers in a company.

  • All they do is store data and provide you with faster access to it.

  • So SQLite3 is just a command line program

  • via which we can interact with that data.

  • It's not terribly compelling.

  • It's, again, a black and white blinking prompt,

  • but we'll see a graphical user interface, or GUI, in a bit,

  • too, that'll allow us to navigate that same data.

  • Here's how we went ahead and imported the data.

  • And .schema is the last of the commands.

  • In fact, let me go ahead and type this at the prompt.

  • Not .mode or .import but .schema.

  • And you'll see what automatically happened for me is something

  • like this, CREATE TABLE favorites.

  • This all happened automatically when I imported my data, but in a bit,

  • we'll be doing this more manually ourselves.

  • So where does this then bring us in terms of features?

  • Well, in the world of storing data, whether it's

  • storing users who've registered for your websites

  • or the number of likes a post on Instagram has gotten

  • or any form of data that you might want to read or write,

  • there are four fundamental operations in the world.

  • CRUD is the sort of crude way of remembering this.

  • These stand for Create, Read, Update, Delete.

  • And I dare say that with these four operations

  • you can do anything you want when it comes with data.

  • Creating that data, reading that data, updating it, or deleting it.

  • Now, in the world of SQL, or Structured Query Language, or S-Q-L,

  • this is just another programming language.

  • It tends to be used only in the context of databases,

  • but you can use it to solve problems not unlike Python, but in the context

  • of the data you care about.

  • So with SQL, these four CRUD operations actually

  • have different keywords mapped to them.

  • They happen to be these, INSERT, SELECT, UPDATE, DELETE.

  • That is to say, unlike Python, which has hundreds of functions available to you,

  • today is kind of neat in so far AS SQL has maybe a dozen or so total functions

  • that you can use, a few dozen functions total that you can use.

  • Much more manageable, and these are the four that we'll use most frequently.

  • So beyond this, how do you go about creating a table?

  • Well, what does that mean?

  • Well, relational database is literally that.

  • It's like a fancier version of Google Spreadsheets or Microsoft Excel

  • or Apple Numbers that allows you to store all of your data

  • in rows and columns.

  • But it gives you these four functions and more

  • via which to select data you care about, or delete or update or insert data

  • that you care about as well.

  • So we'll see in a bit that this is going to be the syntax VIA which in SQL you

  • can create a new table.

  • In Google Spreadsheets you would literally go to the plus icon

  • and say New Sheet.

  • In SQL, you would express this more programmatically with a line of code

  • like this.

  • But you're going to have to make a few decisions along the way,

  • and that's because SQL has its own data types.

  • Now, fortunately, they're pretty reminiscent of C and Python.

  • But there's five main ones here.

  • Let's just walk through a few examples thereof.

  • So first, INTEGER.

  • If you know in advance that you want to store a whole bunch of information

  • in a database like someone's age or the number of likes a post has

  • or anything that lends itself to an integer,

  • SQL supports an integer data type.

  • But in some versions of SQL, you actually have a decision to make.

  • Do you want an integer, a smallint, or a bigint?

  • And these speak to, as in C, the size, the number

  • of bits that are used to store it.

  • An integer is 32 bits.

  • A bigint is 64 bits.

  • A smallint is fewer than both those.

  • Now, we're using, for class, something called SQLite.

  • It's a free and open source, very user friendly version of SQL.

  • But there's others you might have heard about.

  • So if you've ever heard of the company Oracle,

  • their bread and butter is a database program

  • called Oracle that is a SQL database that

  • allows companies to store huge amounts of data

  • and select and create and update and delete data

  • using software they've written.

  • MySQL, Postgres, MariaDB, Microsoft Access Server,

  • SQL Server, and bunches of others, are all examples of SQL servers.

  • And they'll each support data types, typically, like this.

  • We happen to be showing you ones from Postgres,

  • which is a super popular open source one that you can use on the internet,

  • for instance for final projects.

  • Real numbers, just like in Python or C, are similar in spirit to floats.

  • However, you have a couple of options.

  • One is a real, which uses 32 bits.

  • Another is called double precision.

  • It's actually two words, but it gives you 64 bits as well,

  • to give you more precision numerically.

  • Then you've got this catch-all known as numeric in SQL,

  • whereby if you have a number that's number-like, essentially,

  • but isn't just an integer, categories--

  • this category encompasses things like Boolean values, dates,

  • which have a predefined format like YYYY,

  • four digits typically implements a year.

  • That would be considered numeric here, as would be a date and a date time,

  • which has something like 00:00:00 for hours, minutes, seconds.

  • And then you have other values as well.

  • What's nice about this numeric type specifically

  • is you can actually solve the floating point problem in both Python

  • and C. Recall that floats or even doubles in C

  • ultimately have finite precision, which is bad in terms of values

  • not quite adding up as you intend.

  • But with numeric data types and databases,

  • you can specify exactly how many digits you want before the decimal point

  • and after, maximally.

  • So if you're doing with financial information or scientific information,

  • you can be super, super precise thanks to this data type in the database.

  • So finally, a problem that we've solved, versus those other languages.

  • Text.

  • When it comes to storing data in a database,

  • you can specify that your columns can either

  • be characters, which isn't an individual characters or chars.

  • You literally specify n, the number of characters

  • that every cell in that column will be.

  • You never do this in Excel.

  • You never do this in Google Spreadsheets or Apple Numbers.

  • In a database, though, you can tell the database ever more precisely,

  • I want to store two characters in every cell in some column.

  • Why?

  • Well, maybe like US state codes.

  • If you have MA for Massachusetts or CA for California,

  • it might be nice to tell the database that you can store, minimally

  • and maximally, two characters for every value I'm going to give you.

  • Varchar is a little different.

  • This is a variable number of characters.

  • If your column might have a few characters or a lot of characters based

  • on what some human types in, you can give an upper bound

  • n on how many characters will be stored in every cell.

  • So if you don't know in advance how long a user's email address is going to be

  • or how long my name or your name is going to be in advance,

  • you can specify varchar and then pick some upper bound, some value that, ugh,

  • no one's going to have more than 20 characters in their name or 200

  • characters in their name, whatever line you decide to draw in the sand.

  • As an aside, does anyone want to conjecture

  • what a good upper bound is for the number of characters

  • in a human name that might register for a website you're making?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: 25?

  • Pretty good.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: 30, I heard.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: 32, OK.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: 45?

  • Higher.

  • 60?

  • AUDIENCE: 64.

  • DAVID MALAN: 64?

  • This is the kind of thing that Google, I dare say, is good at.

  • So let me go ahead and open up Google real fast.

  • Suppose the goal at hand is the store, in a database

  • table, that is the equivalent of a spreadsheets column, the longest

  • name in world.

  • All right.

  • Looks like we had 988 characters in a fellow's name, Hubert.

  • And I won't even try to pronounce his last name, but there it is.

  • So if you want to fill--

  • if you to be able to fit everyone's name--

  • oh, and note, senior.

  • [LAUGHTER]

  • You might need even more characters than that.

  • So this is a rabbit hole we won't really go down today,

  • but suffice it to say that the answers to these questions

  • are often not obvious.

  • And if you think about websites you've visited in the real world,

  • whether on campus or off, odds are, you have tried typing in some input

  • to some web-based form or even some mobile application,

  • where all of a sudden your keyboard stops working, right?

  • They won't tolerate any more characters.

  • Maybe it's the Common Application and the length of your essay.

  • Maybe it's the length that your resume can be in a field.

  • Maybe it's the length of your name when registering for a website.

  • Well, why is that?

  • Those applications are probably using a database, probably using SQL,

  • and they had to decide, in advance, what is the maximum length of an input

  • that we're going to tolerate from a human?

  • And then the computer ultimately enforces that.

  • Why?

  • Well that would be a very clever attack, for some bad actor out there

  • to exploit.

  • If you had no bounds on how big the input could

  • be that a human could type in, they could-- much like I

  • need to try the other day with Emma's name--

  • just keep pasting pasting, pasting, pasting a massive input,

  • hoping to overwhelm your computer's memory, maybe make your server crash,

  • and therefore deny service to other people.

  • So there are reasons for these kinds of defenses.

  • Lastly is BLOB.

  • At the very top, Binary Large Object.

  • Don't need it typically that often, but if you

  • want to store binary data, raw zeros and ones that represent files or the like,

  • you can store it in a BLOB format as well.

  • Now, just to give you a whirlwind tour of other syntax

  • before we start using it, you can insert data

  • into a database using syntax quite like this.

  • You can select data from a database using syntax like I already did.

  • You can use functions as well.

  • In fact, in just a bit, when we load more data into the database,

  • I could actually select all of the distinct names from a database--

  • and in fact, I could have done this a moment ago.

  • Rather than SELECT title FROM favorites, as I

  • did earlier-- that's going to give me all of the titles in that database.

  • I could instead do something like SELECT DISTINCT title FROM favorites,

  • and that's going to filter out all of the duplicates

  • without me having to write a single line of code to do that.

  • You can also count, as we did before.

  • You can compute averages, mins, or max, which

  • is really useful for analytical work that you might want

  • to do as a data scientist or the like.

  • And there's a whole bunch of other functions available to as well.

  • But there's some more powerful features.

  • And we'll begin to use these in just a little bit.

  • We used WHERE before.

  • I used LIMIT before.

  • I used GROUP BY before.

  • And we'll see a bunch of examples that reinforce all of these.

  • But there's keywords like LIKE, whereby if you

  • want to search for everyone's input who mentions The Office, you know what?

  • I bet we could do something like this instead.

  • I know that I can SELECT star for-- oh, let's go ahead

  • and SELECT title FROM favorites WHERE, and this is kind of nonsensical,

  • but title equals "The Office," quote unquote.

  • And I get back everyone who typed in literally The Office.

  • However, if I instead did something like this, "the office" in lower case,

  • I'm going to get one of you who typed in "the office" in lowercase.

  • And the thing here is just the title of this output.

  • But what if I want to do anything like The Office?

  • I could say something like LIKE "%office%", and it turns out in SQL,

  • this new language, percent is a placeholder.

  • It's a wild card that just says, eh, anything can go before,

  • and, eh, anything can go after.

  • So now I'm searching for any titles that have the word The Office.

  • We picked up another entry here.

  • So one of you said Office, without the word "the," also pretty reasonable.

  • Some of the uppercase and lowercase is all over the place.

  • Now we've caught those.

  • And if you want to count the total number of offices now,

  • now I can do something like this.

  • COUNT title-- so I can combine all of these building

  • blocks, not unlike Scratch, and get back the total number

  • of offices, which looks now to be 33 when we tolerate a lot more

  • variability in our users' input.

  • Most powerfully of all, we'll see something

  • like JOIN in just a little bit, and that'll

  • be all toward the end of using databases better with better design.

  • But with SELECTs, we, of course, don't just want to select all of the titles.

  • We might select where some condition is true.

  • So conditions, just like in Python and C,

  • exist in SQL, but at the end of our queries, as we've seen.

  • If you want to update something, for instance,

  • you can update data in a database by saying

  • UPDATE table name SET the column name equal to some value,

  • WHERE condition is true.

  • So how might I use this?

  • Well, let's again look at the data where it's like The Office.

  • And you know what?

  • I'm going to go ahead and do this.

  • If I want to clean all of this data up-- so data cleaning is

  • a very common approach, whether you're doing

  • research or analytical work, often when you have messy data,

  • the first pass is honestly grunt work.

  • Like, writing some code, or god forbid, manually copying

  • and pasting in an Excel file, just to clean up your data.

  • You can do this a lot more powerfully in SQL.

  • So for instance, let me go ahead and say UPDATE my favorites SET

  • title equal to "The Office," the canonical form, capital

  • T, capital O, WHERE title LIKE--

  • and now let me do this wildcard search, knowing

  • that that's going to slurp in all 30-plus of those rows.

  • But now, when I hit Enter, nothing seems to happen.

  • But if I then SELECT title again WHERE LIKE "%office%",

  • now they all look the same, because I've updated my data.

  • And so in fact, now I could go back to The Office,

  • because I've canonicalized my data, and everything now looks the same.

  • So with SQL, can you do that as well?

  • Previously, let's go ahead and SELECT a title FROM favorites.

  • And let's just limit this to 10 of them this time, so they don't all fly by.

  • Suppose-- VEEP is a good show.

  • Maybe I'm not a fan of, say--

  • let's give ourselves 20 here, see what comes up next.

  • OK, I actually don't like Friends, even though everyone in the world seems to.

  • No problem.

  • DELETE FROM favorites--

  • [LAUGHTER]

  • --WHERE title = "Friends";.

  • Now, no more Friends.

  • And so you can delete your data as well.

  • [LAUGHTER]

  • You can delete your data as well as by using a query of this form.

  • This one is perhaps the most dangerous of all.

  • Consider what might happen if you're a little sloppy, a little hasty,

  • and omit the WHERE condition.

  • What happens if you go ahead and DELETE FROM favorites;?

  • Everything goes away.

  • And slightly worse than that, though we won't have occasion to use it,

  • there's another function called DROP, where if you just

  • want to delete the table altogether, delete all of your rows and columns,

  • you can say DROP TABLE favorites;, and that will delete all of the data

  • as well.

  • And we'll come back to that, because the mere fact that you can scarily

  • delete lots of data at once makes you super

  • vulnerable to bad actors on the internet or in the office who

  • want to potentially wreak havoc on your data.

  • So we'll come back today, talking about matters of security, as well.

  • So any questions before, now, we take things up a notch

  • and look at a much more massive than our hundreds of favorites here?

  • Is that a question there?

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Did not look like Friends got removed.

  • Oh, interesting.

  • You are right.

  • Let's try a little harder.

  • LIKE "%friends%".

  • Better?

  • Now they're gone.

  • So I'm guessing-- I would have to look closer at the data,

  • but I'm guessing one or more people typed it

  • with a slightly different capitalization,

  • or maybe they hit a spacebar character or something, where

  • I wasn't doing a precise match, thereby making

  • the LIKE operator a more robust fix.

  • However, let's consider for just a moment.

  • What might the gotcha here be?

  • Like, this feels like a nice quick and dirty fix,

  • but it really is quick and dirty in what sense?

  • AUDIENCE: Like, if you have another TV show

  • that has the word friend in it, but not Friends, [INAUDIBLE]..

  • DAVID MALAN: Yeah, there might be another show

  • with the word "friends" in it.

  • Quite reasonable.

  • And I just blew those away as well from my database.

  • And in fact, we can probably see this for real.

  • If I go ahead and look at, for instance--

  • let me go into a browser here, IMDB.com, super popular website for just getting

  • movie reviews or information about movies.

  • Let me go ahead and search for friends.

  • And yes, the most popular hit seems to be the 1994 version of Friends,

  • but there's Friends with Benefits.

  • There's Just Friends, We Are Your Friends, My Best Friend's Wedding,

  • Fisherman's Friends, and so forth.

  • Some of those movies, some of those TV shows.

  • And had we poked a little further among our own data of hundreds of rows, maybe

  • some of you did have some of those shows as your favorites as well.

  • So beware that, too.

  • So in preparing for today, we wanted to play

  • with an even juicier data set then one we could generate for ourselves.

  • And wonderfully, IMDB makes a lot of their database

  • available for download as TSV files, Tab-Separated Values.

  • It's really no fundamentally different from Comma Separated Values.

  • One human at IMDB just decided to use tabs to separate

  • their values instead of commas.

  • And if you read through this page here, you'll

  • see that they let you download a bunch of different files, one of which

  • will focus on, initially, specifically on title.basics.tsv.gz.

  • So that's a mouthful, but this is just saying

  • this is basic information about titles in their database.

  • Gz means it's G-zipped, which is like Zip.

  • If you're familiar, on a Mac or PC, it's just

  • another way of compressing information, but you can unzip that on a computer

  • as well.

  • Dot TSV means Tab-Separated Values.

  • Now, we read in advance the information here,

  • but let me summarize it as follows.

  • There's some juicy information in that file.

  • And I'm going to go ahead and download this file over here.

  • If I click a link on IMDB's website, I'll

  • see a file called title.basics.tsv.gz.

  • I'm going to go ahead and let that download in the background.

  • It looks like it is 100-plus megabytes.

  • If I click on it on my Mac, it's going to decompress it.

  • And you'll see here, this apparently is 532 megabytes once it's decompressed.

  • So that's way bigger than the CSV file I downloaded from Google

  • a few minutes ago, which is 24 kilobytes, 24,000 bytes.

  • This is 532 million bytes.

  • So it's an order of magnitude bigger, which means our design decisions today

  • are really going to matter.

  • Now, if we take a look inside of this file,

  • we'll see a bunch of interesting fields.

  • And I've essentially summarized this from their own online documentation.

  • There's these five fields that are of interest right now,

  • and we'll tease these apart now.

  • So tconst, I didn't recognize that.

  • It's an IMDB-specific term, it seems, but it means some kind of title

  • constant.

  • It's a unique identifier for every movie in the database.

  • So for instance, one of the movies in that TSV file,

  • which I'll open up in a moment, has the unique value tt4786824.

  • Why?

  • Just because.

  • But it turns out we can actually see what this is.

  • Let me actually grab that value and go to IMDB.com/title/tt4786824.

  • Enter.

  • These actually do have significance.

  • That is the unique identifier for The Crown,

  • popular hit TV show that you can now watch on Netflix.

  • So if you go back to the actual field--

  • that's how I spent the summer.

  • So if you go back to the fields here, we have not only tconst,

  • which uniquely identifies the film, we also have title type.

  • So it turns out in this 500 megabyte file,

  • there's huge amounts of information on movies, TV series, documentaries,

  • shorts, way more data than we could possibly play with in one day.

  • So we'll focus just on TV series, like this, the show The Crown.

  • Primary title is the name in the file that is

  • given to the primary title of the show.

  • Sometimes shows apparently have different titles

  • in different geographies or communities, so the primary title

  • is the one that most people know it by.

  • Start year is the year for a TV show that-- in which that TV show began.

  • The Crown came out first in 2016.

  • And genres, a little weirdly, is a comma-separated list

  • of genres into which that show falls.

  • And I say it's a little weird, because we do have a TSV file,

  • but they're using commas inside.

  • So this is probably among the motivations for having used tabs in--

  • to separate all of the columns, because they

  • want to use commas for some other purpose here.

  • But it's a little messy.

  • This is actually not necessarily the best design decision here.

  • So what can we go ahead and do with all of this data here?

  • So let's go ahead and start to grab some of this data

  • and write some scripts involving it.

  • But let me issue a disclaimer.

  • It turns out when you have a lot of data in a database,

  • it can take quite a long time to analyze it or parse it.

  • And so if you ever watch certain baking shows like Julia Child

  • from down the road here in Cambridge, very often there

  • would be some TV magic whereby someone mixes the cake mix

  • and puts it into the oven, and then two minutes later, voila, it comes out.

  • That's because some of the scripts we'll write today

  • might actually take many minutes to run, but we'll

  • go ahead and run them in the backgrounds,

  • and I'll use the outputs that I've created here in advance.

  • So how do we go about beginning to navigate this data set?

  • Well, first of all, let me go ahead and just open this file.

  • And I'm going to use my own terminal window here.

  • This file was called titles.basics.tsv.

  • Notice that even in my program here, it took a moment to open,

  • but this is what it looks like.

  • It's a little cryptic at first glance, but notice the first row

  • is the headers of the file.

  • So we've got tconst and title type and primary title and so forth.

  • Then every row thereafter seems to start with one of those unique identifiers,

  • and they seem to be monotonically increasing, 1, 2, 3, 4,

  • but they start with tt then some number of digits.

  • Then you see that these are a bunch of shorts, so short films or the like.

  • And if we kept scrolling, we'd see some other values.

  • Here's the first movie in the database.

  • And if we keep scrolling and scrolling and scrolling, we'll see more.

  • So if actually I use my text editor here and search for literally the word

  • The Crown, it's going to take a little while

  • to get to, because there's a bunch of shows apparently called

  • The Crown or Crowning or something.

  • But you know what'll be faster?

  • Let me go ahead and search for--

  • what was it?

  • If I go back to tt4786824, Enter, it takes a moment

  • to find this because it's searching through hundreds of megabytes,

  • but voila, down here is this tconst, TV series type.

  • It's called The Crown.

  • In another geography it's called also The Crown, 2016, and so forth.

  • So that's a huge amount of data.

  • In fact, how much data?

  • Well, if I actually analyze this with a command on my computer,

  • there are 6,267,469 shows in the world, be it TV

  • or movies or documentaries or the like.

  • That's a huge data set.

  • So suppose I want to focus really just on a subset of that data.

  • What can I go about using instead?

  • Well, let me go ahead and open up, for instance, the same file,

  • but let me try putting it in CS50 IDE.

  • I'm going to go ahead and download this, as you might have downloaded

  • past problem sets, with wget.

  • That's going to give me a pretty big file.

  • There's a command now that you've not had occasion

  • to use in the IDE called gunzip, which will unzip a file that starts with--

  • that ends with .gz.

  • That's going to take a moment to unzip the whole file.

  • A moment, a moment, a moment, OK.

  • [CHUCKLES] And no space left on disk.

  • But that's OK, because it turns out what I'm also

  • going to start doing today is using my own Mac a bit more.

  • And you'll recall last week, I did that a little bit,

  • because when I wanted to play with speech recognition in my microphone,

  • it's a lot easier to have Python running on my Mac or your PC

  • if I want to use hardware built into my computer.

  • Plus my MacBook, is a lot faster, for instance, than a single cloud

  • account on a shared server.

  • So I'll go ahead and write some of this code here on my own machine instead.

  • So let me go ahead and open a file called import .py.

  • And I'm going to go ahead and import the CSV module, as always,

  • and I'm going to go ahead and open this file, which is called title.basics.tsv

  • in read-only mode, and I'm going to call this variable titles.

  • What am I next going to do?

  • Let me myself a reader using csv.DictReader,

  • as before, reading in those titles.

  • But csv.DictReader feels like the wrong reader.

  • Why?

  • Just to be clear.

  • Yeah, I mean, maybe I want TSV.

  • But it turns out TSV doesn't exist.

  • So even though I might be inclined, for instance, to change this

  • to a T and this to a T, the CSV module does enough for me,

  • but I need to tell it that I want to use a different delimiter.

  • Instead of the default, which looks like this,

  • I can actually override that and say, you know what?

  • Use a tab.

  • And just like backslash n in C and in Python, it's a new line.

  • Backslash t in both languages is a tab character.

  • All right, so once I've done this, let me go ahead and open up, now--

  • let me go ahead and open up shows0.csv.

  • My goal in life now is to make this file a lot more manageable for myself.

  • I want to take a 500megabyte file and extract only the TV shows therein.

  • Moreover, how about only the TV shows from 1970 onward?

  • We won't go even further back than that.

  • So let me go ahead and open up, in write mode, a file called show0.csv.

  • And I'm just going to call that variable shows.

  • Then I'm going to go ahead and give myself a writer.

  • And you might recall using this in the past.

  • This is simply going to give me a variable called writer, via which

  • I can write to a new file.

  • Because again, the goal is to read this file and write to this file

  • a subset of the data therein.

  • So let me go ahead and write one row first.

  • Write a row, passing in a list of values, specifically tconst,

  • which is the title--

  • which is the ID field; primaryTitle, which

  • is the title field; startYear, which is the year field; and genres,

  • was one more fields that I mentioned earlier.

  • So my goal is to export only those four columns that I care about for now.

  • So I'm going to go ahead and do this.

  • For each row in my reader, I'm going to go ahead and say if that

  • row's titleType == tvSeries, which, recall,

  • The Crown was an example of, then I'm going to go ahead and write

  • to the writer--

  • whoops-- writer.writeRow, a list containing what?

  • The row's tconst value, the row's primary title value, the row--

  • whoops-- the row's start year, and lastly, the row's--

  • [CHUCKLES] the row's genres.

  • So what am I doing here, just to be clear?

  • So what I have done is I've first written out-- once and only

  • once-- literally these values, because I want

  • headers in the first line of my file.

  • After that, for each row in the reader, which is currently

  • iterating over this file in read-only mode,

  • I want to print out the current row's tconst, current row's primary title,

  • the current row's start year, and the current row genres.

  • But notice, I'm ignoring movies and shorts and documentaries and bunches

  • of other values as well.

  • And you know what?

  • Just for good measure, let's shrink this a little bit.

  • And row, how about, is adult == "0", for today's purposes?

  • So that'll filter the list further.

  • If you'd like to flip that later, that's fine.

  • So let me go ahead and do Python 3, for version 3 on my Mac-- actually,

  • no, let's do Python of import.py, all right?

  • I'm going to cross my fingers here.

  • Hopefully the file is working, working, working.

  • But it's taking a decent amount of time.

  • Like, this is how much time it takes, apparently, to process over

  • millions of rows of data.

  • Still running.

  • Still running.

  • But the goal, again, is to shrink the amount

  • of data I have to ultimately care about so that we can actually

  • search it much more effectively.

  • So, OK, it actually finished.

  • So let me go ahead and open up show0.csv.

  • Notice now, in my text editor, I've got a lot less data.

  • I've thrown away everything I don't care about,

  • but I've been left with tconst, primary title, start year, and genres,

  • and everything herein is now consistent with that filtration.

  • But I haven't filtered everything I said.

  • What did I say I wanted to get rid of earlier?

  • Yeah, the shows before 1970.

  • And clearly some of these are coming from 1940s and so forth.

  • So let's go ahead and get rid of those, but see how.

  • This is the CSV file that we just opened, but in Google Spreadsheet form.

  • So I literally just imported it into Google Spreadsheets

  • that so we could see it.

  • Literally the same data as before, and there are those 1940s movies.

  • But there's something curious that I wanted to be mindful of.

  • If I scroll down in the start years and keep going and going and going,

  • huh, those seem OK.

  • Those are increasing in order.

  • But let me try this.

  • Let me just poke around my data, sorting in reverse order.

  • It's going to take a while, because even this is a decent number of TV shows.

  • Notice this weirdness.

  • At the top of start year now, once I've reversed sorted them,

  • there's a whole bunch of backslash capital N's.

  • Now, this has nothing to do with C, and nothing to do with Python.

  • It has everything to do with the documentation.

  • If you read IMDB's data, as I only did carefully eventually, you'll see this.

  • A backslash n and capital N is used to denote that a particular field is

  • missing or null for that title name.

  • Now, this is important, because if I want

  • to filter out movies that are after 1970,

  • I need to be resilient against that.

  • So let me go ahead and do this.

  • So if the current row's startYear does not equal backslash n,

  • then I'm going to go ahead and check that it's a TV series

  • and that it's not an adult show.

  • So that would help.

  • But furthermore, how can I check this here?

  • Let me do year.

  • And how can I convert row startYear to an integer?

  • Well, everything in a spreadsheet, by definition

  • of it having been in a spreadsheet, or a TSV file or a CSV file, is text.

  • But start year looks like years, so what Python

  • function can I use to actually convert text that

  • resembles a number to an actual number?

  • Yeah.

  • So we can do something like this.

  • So I can convert year to an int, and now I

  • can say, if year greater than or equal to 1970,

  • now I'm going to go ahead and do those lines instead.

  • Now, there's an opportunity, surely, for better design,

  • because once your code starts doing this,

  • you've done something suboptimally, right?

  • This is not going to end well if all of my code starts wrapping.

  • So I could clean up the logic in a little bit, but let's go ahead

  • and run this just one more time.

  • This time, changing this to shows1.CSV so we can see slightly different

  • outputs.

  • Let me go ahead and run Python import.py.

  • Huh, syntax error, Unicode error, codec--

  • that's a weird one.

  • But this is because backslashes, recall, in C and Python, have special meaning.

  • So when you do something like this, backslash capital n,

  • even though it's not a lower case n, backslash n, recall,

  • is the scape character in C and Python.

  • So this is like telling Python, this is a special character.

  • But it's not.

  • And we've never really had occasion to do this,

  • but how would do you think we could output

  • a literal backslash before a capital N?

  • Yeah.

  • So it turns out the solution to this problem, usually,

  • no matter the language, is that if you want a literal character, not

  • an escape character, you literally put another one of it before.

  • So even though this looks a little funky now,

  • this backslash backslash capital N literally will mean backslash N.

  • All right.

  • So now, let me go ahead and run this on import.py.

  • This time, I'm hopefully going to actually generate

  • a new file called shows1.csv that has even less data that

  • actually is going to contain my shows, but only a subset of them.

  • And let's go ahead and pull the cake out of the oven this way.

  • This is what I get now this time.

  • So if I actually load the CSV, shows1.csv, into Google Spreadsheet,

  • just because it's pretty easy to look at than the black and white window,

  • now you can see that I apparently am only getting shows 1970 and onward.

  • And indeed, if I sorted them, I would see no backslash N's.

  • I would have thrown away everything that doesn't meet that criteria.

  • Well, let me go ahead and do one last thing here.

  • I'm going to go ahead and make one more change.

  • And first, let's improve the design here.

  • This indentation is the result of my asking questions again and again

  • and again and indenting if and only if those things are true.

  • But notice, you can start to flip your logic here, right?

  • Instead of saying if the start year does not equal backslash N,

  • what if I just do this and say continue?

  • I can then unindent this--

  • because if you've not used it before in Python and in C,

  • if you say continue inside of a loop, it's not going to continue down there,

  • it's going to [WHOOSH] continue to the start of the loop again.

  • So via this logic, we can actually keep wrapping around again and again.

  • And here, too, we could say, if year less than 1970, I can go ahead

  • and say continue, which would then allow me to unindent this as well.

  • So there are solutions, design-wise, to actually avoiding

  • that infinite indentation.

  • All right.

  • Let's go ahead and do one last version.

  • Then I'm going to go ahead and pull out of the oven in a--

  • premade.

  • So the last thing I didn't load before was this.

  • Suppose that I want to load into the CSV file, all of the genres

  • associated with the show.

  • It looks like all of these shows have one or more genres, just like you

  • were asked for your favorite shows.

  • And so now, we have a CSV file with tconst, primary titles, start year,

  • and genres, where genre is itself is a comma-separated list.

  • But there's a fundamental problem here.

  • Even though I have all of this data here, the best I can do

  • is a program like this.

  • Let me go ahead and search for, for instance, the following.

  • Let me go ahead and grab a file real fast.

  • Let me go ahead and grab a copy of shows2.csv

  • and write one final program here.

  • If I want to go ahead now and search this very large, still, data set,

  • in shows2.csv, well, let me go ahead and do this.

  • Import CSV.

  • Let me go ahead now and ask the user for a title.

  • I could use CS50's get string, but there's really no need for that

  • anymore now that we have the title function-- the input function, recall.

  • So I'll just use that.

  • And then I'm going to go ahead and open up shows2.csv in read-only mode.

  • And I'm going to call that my file.

  • Then I'm going to go ahead and give myself a reader from csv.DictReader,

  • passing in that file.

  • And now I'm going to go ahead and, for row in reader, do the following.

  • The goal now is to write a Python program that allows

  • me to search only those TV shows.

  • So I could say something like this, if title == row "primaryTitle",

  • then I can go ahead, for instance, and print out row "primaryTitle,"

  • started in row "startYear."

  • So what is the goal of this program?

  • It's going to ask the user for input.

  • It's going to open this big CSV that I've created.

  • But that's still smaller than the 50-megabyte version.

  • It's going to iterate over every row in that file via DictReader,

  • checking if the title the human typed in equals the current row's primary title.

  • And if so, it's going to print the title and year of that show.

  • So if I go ahead and run Python of search.py,

  • typing in something like The Crown, Enter, voila, I get that answer.

  • If I go ahead and do The Office, Enter, there's a bunch of Offices.

  • And in fact, if you haven't seen more than the American one,

  • there's the UK one, and apparently several others.

  • And we can actually corroborate this now.

  • If you go to IMDB, where all of this data originally came from,

  • and type The Office, there are all of The Offices that actually line up

  • with our own very data set.

  • It's going to be so damn tedious every time

  • you want to search for data or update data

  • or insert new data to write a Python program to do it,

  • so we need a few more features of SQL.

  • But I think first, we need some Halloween candy

  • for our five-minute break outback.

  • All right, we are back.

  • So we have a whole bunch more data now, because we've downloaded

  • that really big TSV file from IMDB.

  • I've simplified it into some CSV files, but that really gets me half of the way

  • there, because now if I want to search the data, as with search.py,

  • I still have to look over thousands of rows of TV shows,

  • and it's only going to be linear search by nature of just how files are read,

  • top to bottom, left to right.

  • But it turns out in Python that you can actually write SQL code itself.

  • And CS50 has a Python library that has not only get string and get int

  • and so forth, but it also has a SQL function built in that allows you

  • to connect to, so to speak, a file that ends in something like .db,

  • which is to say, in a moment, we'll start to write some Python code now

  • toward an end of loading a really large data set like IMDB's into a proper SQL

  • database, thereby allowing us hereafter to use all of the power

  • and expressiveness of SQL, and more examples of that in just a bit like

  • SELECT, UPDATE, DELETE, and INSERT, without having to write a whole bunch

  • of Python code.

  • So to be clear, instead of using Python to search and manipulate our data,

  • we're going to write a script, a program in Python, whose sole purpose in life

  • is to get data from one format into another.

  • And you can imagine there's being generally useful, whether it's

  • a Google Spreadsheet you've downloaded or a large data

  • set you found on the internet that you want to use for a final project

  • or for some other class.

  • Python can be a really powerful way of taking one data source as input

  • and producing its output now, as of today, SQL instead.

  • So let's go ahead and iterate one final time of our title.basics.tsv.

  • But this time, not just save it into a CSV file.

  • Let's put it into a proper SQL database on my own Mac or your PC.

  • So let me go ahead and do this.

  • First, let me go ahead and say, just like on the slide, db = cs50.sql,

  • and then quote unquote, "sqlite:///--" so the third slash is not a typo.

  • It should indeed be there.

  • And I'm going to say shows3.db, just because this

  • is version 3 now of my import script.

  • I'm going to go ahead, just as last week, and now import CSV--

  • CS50's library as well.

  • But for this to work, the file shows3.db needs to exist first.

  • And there's a couple of ways on a Mac or a PC or a Linux computer, typically,

  • to create an empty file that's ready to receive data.

  • You can literally use the command touch, which

  • just will create an empty file by whatever name you type at the prompt.

  • Or we can do this programmatically.

  • And I'm going to do it programmatically, because I

  • bet I'm going to screw up one or more times here,

  • and it's going to be useful to let my Python program create and recreate

  • the database again and again and again until I get it right.

  • So let me go ahead and open a file called shows3.db in write mode.

  • And recall from Python and C, using fopen,

  • anytime you open a file in write mode, it will overwrite any file that exists

  • or create any file that doesn't.

  • That's all I needed to do.

  • So in Python 2, recall that we were able to use this dot notation.

  • And it turns out here, when you open a file,

  • if you want to immediately close it, because your only goal was to create

  • it, you can just do .close on the very thing you just opened.

  • That is equivalent, just to be clear, to doing

  • something a little more pedantic like this, file = open, and then file.close.

  • But we can collapse this into one slick one-liner, so to speak,

  • by just doing instead what I did a moment ago.

  • All that does is create empty shows.3 file.

  • Now, open that file for SQLite.

  • And again, SQLite is the light version of SQL

  • that anyone can use on their own Mac or PC.

  • You don't need a special server to get up and running with it.

  • So now let me go ahead and open up title.basics.tsv.

  • And then here, let me go ahead and create myself a DictReader so

  • that I can iterate over the lines in that TSV file.

  • And now, let me go ahead, and for row in reader, do the following.

  • I first want to filter out stuff just as before.

  • So I'm going to say if row bracket "titleType" == "tvSeries"

  • and row "isAdult" == quote unquote "0", then I'm going to go ahead and check

  • one other thing.

  • I'm going to go ahead and give myself a start year variable, similar to before,

  • although I called it year earlier.

  • Then let me go ahead and do row "startYear",

  • just so I can cast that to an int.

  • But I only want to do that if row "startYear" does not

  • equal that special backslash N that IMDB told me to watch out for.

  • So I only want to do that if it's not that.

  • And then if startYear is greater than or equal to 1970,

  • let's go ahead and do the following.

  • Let's go ahead and do genres, gets row "genres".

  • Let's go ahead and get tconst, gets row "tconst,"

  • just so I can put these in some slightly shorter variable names,

  • just to keep myself sane.

  • primaryTitle is going to be from row "primaryTitle."

  • and then let me go ahead and give myself--

  • we already have startYear, so those are the only other three fields I need.

  • So now, I want to go ahead and insert this row from my TSV

  • into a SQLite database.

  • And the operative word that we saw earlier that we haven't used it yet,

  • is INSERT.

  • We did use SELECT.

  • We did use UPDATE.

  • We did use DELETE.

  • We haven't used INSERT yet.

  • So I'm going to do that in a moment.

  • But first, I need my database to actually exist,

  • so I need to create an actual table.

  • So I'm going to go up here first and do this, db, is a reference now,

  • a variable representing my database, and I'm

  • going to call the only function inside of it that's

  • useful for our purposes, called execute.

  • What I can now do is execute any SQL I want.

  • So what do I want to load into this database?

  • I think I want to load in the tconst, the primaryTitle, the startYear,

  • and the genres, just like we had earlier from title.basics.tsv.

  • I want to load rows that represent this kind of data, all right?

  • So how am I going to do this?

  • Well, let me go ahead and create a table.

  • I'm going to call it shows, because that seems nice and conceptually consistent.

  • I'm going to go ahead and create a list of columns now.

  • tconst is going to be one column.

  • primaryTitle is going to be another.

  • startYear is going to be another.

  • And genres is going to be the last.

  • I can literally, that is to say, write SQL inside of a string

  • that I pass to a Python function called db.execute.

  • And because db.execute-- or rather, because db

  • was configured with shows3.db, when I execute this string in Python,

  • it's going to get executed on that database file, shows3.db.

  • So it's a nice way of bridging these two worlds.

  • So I'm going to have to be a little more specific, though.

  • Recall that SQL has a bunch of types.

  • And I'm going to keep it simple.

  • I'm going to go ahead and say that the type of that tconst value is text.

  • The type of the primaryTitle is text.

  • The type of startYear is going to be numeric, kind of a catch

  • all for dates and date times.

  • And then genres is going to be text as well.

  • So the syntax is a little funky.

  • You actually specify the name of the column and then the type,

  • as opposed to the opposite, which we did in C. But that's the way SQL is.

  • So I'm going to go ahead and save that now.

  • And just to comment this, this is going to create

  • a table called shows in database file called shows3.db,

  • just to be super explicit.

  • So what am I going to do down here?

  • It looks like I have the ability with CS50's library

  • to execute any SQL I want.

  • So let me go ahead and insert into shows the following values,

  • a tconst, a primaryTitle, a startYear, and a genre--

  • and genres.

  • What values?

  • I want to insert these values.

  • Now, I don't know in advance, so I'm literally

  • going to put some question marks here.

  • And it turns out in SQL, this is valid syntax for the library we're using.

  • This is this INSERT query.

  • INSERT INTO the table name a parenthesized list

  • of the columns you want to insert data into.

  • Then, a set of values in separate parentheses.

  • And for now, I'm using question marks for placeholders,

  • for reasons we'll come back to.

  • But I'm going to go ahead and plug the following values

  • into those placeholders, tconst, primaryTitle, startYear, and genres.

  • And what the db.execute function is going

  • to do for me automatically is it's going to look at this SQL query.

  • Notice that, oh, it's got four question marks in it, or placeholders.

  • Those, in SQL, are like the %s was in C or are like the curly braces are

  • and Python f strings.

  • So this says, give me 1, 2, 3, 4 placeholders and plug in, ultimately,

  • the following four values, tconst, which is just a variable;

  • primaryTitle, which is the same; startYear, which is the same;

  • and genres, which is the same.

  • So what am I going to do?

  • For every TV series in this file that's not an adult series

  • and that started after 1970, insert it into my database.

  • If I've made no typos, I'm going to go ahead and run this, cross my fingers,

  • and enter.

  • This one is going to take more time, because it turns out writing

  • to a CSV file is actually pretty quick.

  • You can just write row, write row, write row, write row,

  • but inserting into a SQLite database is going to take more time.

  • More time upfront, but it's going to be a lot faster to search thereafter.

  • So let me go ahead and do the cake in the oven thing

  • and go ahead and now open up a file I made

  • an advance in today's src3 directory called shows3.db

  • using SQLite3, that command line program we used earlier.

  • Recall that I can say .schema to see the types of data in the database.

  • And indeed, look what I've done in advance.

  • I created a table called shows with exactly those columns in a--

  • with exactly these four columns, tconst, primaryTitle, startYear, and genres.

  • But I did this in advance to save us time so that I can now do SELECT

  • * FROM shows.

  • And let me not get all of them.

  • Let me do the first 10, semicolon.

  • All right.

  • So we see the first 10 shows from IMDB in whatever order

  • IMDB distributes them.

  • You can see their ID numbers are incrementing.

  • And All My Children.

  • I remember growing up with that show years ago.

  • And it seems that that's a drama--

  • it's not really mystery, but so be it, and a romance show there.

  • But it's indeed 1970, as are every show thereafter in 1970 or onward.

  • I can go ahead and search for more like this.

  • Let me give myself the first 100 shows.

  • All right, so this is a pretty large data set.

  • And let me go ahead and count them all.

  • Recall that you can use a COUNT function so that we don't

  • have to print them all on my screen.

  • It looks like there are 153,331 TV series in IMDB.

  • Crazier than that, you want to know how many of them came out this year?

  • WHERE startYear = 2019;.

  • There were 6,099 new TV shows in the world this year according to IMDB.

  • Just one of those is The Office that won our--

  • won our vote earlier.

  • So SELECT *, star denoting wild card, in this case, everything.

  • And unfortunately, SQL has two wild cards.

  • Star means select all of the columns in the table.

  • Percent means let any characters come before or after a quoted

  • string in a WHERE clause.

  • So let me go ahead and SELECT * FROM shows WHERE title = The Office,

  • and we'll actually see-- whoops.

  • primaryTitle, sorry.

  • Let me fix that. primaryTitle = The Office.

  • There are all of those Offices.

  • And indeed, 2005 is probably the one we know and love.

  • If I go to a browser and go to IMDB slash title slash that ID,

  • indeed, that's probably the one we're all thinking of,

  • unless you voted for the UK version instead.

  • So again, this is actually real, live data that we're now playing with.

  • Well, what more can we do?

  • Well, there's one thing that I don't really

  • like about this, which is that when we select all of the columns--

  • and let's go ahead and do this.

  • Let's select another hundred of them before.

  • This feels a little messy that we have all of these nice, clean columns except

  • for when we get to genres.

  • Then we just have this arbitrary comma-separated list.

  • Suppose, for instance, I want to search for all

  • of the comedies that came out in 2019.

  • I could say SELECT * FROM shows where genres = "Comedy" AND--

  • turns out you can use conjunctions like this--

  • startYear = 2019.

  • So that gives me a whole bunch.

  • Let's count them.

  • So COUNT this here, Enter.

  • OK, a thousand of those 6,000 shows are comedies.

  • But I think that's an underestimate.

  • Why is this query buggy at the moment?

  • Yeah?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, some of them had more than one genre,

  • so comedy is somewhere in that comma-separated list.

  • And so what I should probably do instead is not say genre = "Comedy,"

  • but maybe genres LIKE "Comedy," and allow something maybe to appear before,

  • something maybe to appear after.

  • And that's going to give me 1,593 comedies that

  • came out this year in 2019.

  • So that seems a little better.

  • But this is not very robust, right?

  • Once you start resorting to techniques like this,

  • it should start, as a programmer, to rub you the wrong way.

  • It's kind of a hack, right?

  • Like, you're searching for comedy, but there could be something before it

  • or something after it, and odds are, there

  • is no other word I can think of in the world of genres

  • that starts with or ends with comedy, so we're probably OK.

  • But this is kind of hack-ish, that you're just kind of searching.

  • It would be nice if we could just search for a specific column called genre.

  • So how can we go about doing that?

  • Well, let me go ahead and do the following instead.

  • Let me go ahead and open up a final version of my import script,

  • this one that does two things up top.

  • At the top, I'm going to create two tables, one called shows, which has--

  • I'm to clean up the column names, too.

  • IMDB is a little nonconventional.

  • What most people would do when describing a unique identifier,

  • they're going to call it ID, not tconst.

  • So we're going to rename it to ID.

  • They're not going to call their title primaryTitle.

  • They're going to call it title, so we're going to rename it title.

  • They're not going to name it startYear.

  • We're going to call it year.

  • And then that's it.

  • We'll come back to primary key in just a moment.

  • But notice this.

  • In my new and final version of this script, I'm creating,

  • I propose, a second table called genres whose purpose in life

  • is to contain a value called show_id and another one called genre.

  • So what's going on?

  • Well, let me go ahead and show this.

  • If I load the resulting database from this one, shows4.db, and I do .schema,

  • you'll see that I indeed have two tables.

  • Let me go ahead and SELECT * FROM shows WHERE title, this time,

  • because I've renamed it from primaryTitle, = The Office.

  • OK.

  • That's a lot of Offices.

  • But let's go, AND year = 2005, which is the one we're all thinking about.

  • And it's that one.

  • And now, notice this.

  • Notice that I'm getting back what?

  • An ID, I'm getting back a title, and I'm getting back a year, but no genres.

  • That's because there's another table now called genres that's separate.

  • And you know, I'm kind of curious.

  • I see that a genre table has show_id.

  • Let me go and do this.

  • SELECT * FROM genres WHERE show_id =, and let

  • me do a little copy paste here, = this show_id.

  • And what might I see?

  • Comedy.

  • So what have we done now?

  • For any TV show that was in IMDB's database that

  • was a comma-separated list of genres, I've exploded it, so to speak.

  • I've split that value on the commas.

  • And if the show is a comedy, I've added a row in these genres table,

  • but then I've jotted down the show's ID next to that genre

  • so I remember that that show was of that genre.

  • But if another show has multiple fields-- for instance,

  • let's go ahead and search for not The Office but, say, The Crown.

  • And there's only one of those.

  • And now I do SELECT * FROM genres WHERE show_id = this number,

  • we'll see that, oh, The Crown now has drama and history as a genre.

  • And so therefore, in the genres table, notice

  • that there's two IDs and two genres.

  • But now we can use a quality.

  • I can now search for all of the comedies from 2019 in kind of a powerful way,

  • SELECT * FROM shows WHERE id IN--

  • and here's the cool part--

  • SELECT show_id FROM genres WHERE genre = "Comedy" AND year = 2019.

  • So this is admittedly a mouthful, but let's consider what's happening.

  • First, I'm selecting star from shows.

  • That means give me all the shows in all of the columns.

  • But filter as follows, only show me those shows where the ID of the show

  • is in the following list of IDs.

  • Now, you can look at a nested query inside the parentheses here.

  • This list here selects all of the show IDs from the genres table

  • where genre = "Comedy".

  • So the highlighted parenthetical right now returns essentially a list

  • of all of the ID numbers of shows that are associated with comedy,

  • even if they're associated with other things, too.

  • And we're making sure that the year equals 2019.

  • So if I now hit Enter, we'll see a whole bunch of results, but we should see,

  • if I count these by using my usual syntax,

  • there were 1,593 shows that are comedies in 2019.

  • That does happen to equal the same count we did earlier by using like,

  • but this is better designed in the sense that there's no ambiguity.

  • You're not just hackishly looking for a substring,

  • so to speak, in a comma-separated list.

  • You can actually now search more robustly

  • by having redesigned your data.

  • And what we've done really is something like this.

  • Instead of storing our data in just one table called shows,

  • and every show has an ID, a title, a year, and genres,

  • we've instead exploded those genres into two separate tables, such

  • that now our shows table looks like this.

  • We have an ID, a title, and a year.

  • And notice that the ID of a show can also appear over here

  • in another table called genres, and just by convention, to keep us sane,

  • instead of calling it ID here, we've called

  • it show_id to connote that it came from a table called shows, plural,

  • but it's a single ID from a show.

  • So by convention, humans often call them table name, minus the S, underscore ID.

  • And then the keyword here or genre is comedy or drama or documentary

  • or some other genre as well.

  • And now, this is a little subtle, but the fact

  • that this little symbol here, drawn from the database world,

  • flails out into three separate places, this

  • is a one-to-many relationship, so to speak.

  • You can have one show over here mapping to many genres over here,

  • or maybe zero, but it's zero or more possible genres.

  • All right.

  • Any questions just yet?

  • All right.

  • So the real power then, to be clear, is coming, now,

  • from this kind of expressiveness.

  • So now, let's play around with some other queries and features.

  • But first, let's give ourselves a bit more data.

  • It turns out besides the file called title.basics.tsv,

  • IMDB.com makes a bunch of others available to us as well.

  • There is one called name.basics.tsv.

  • And this is one that has information on all

  • of the names of actors and actresses and directors and writers

  • and other people in the world.

  • So for instance, there is an nconst, which is like a name constant

  • or an ID, nm2946516, which happens to belong

  • to the actor whose primary name is Claire Foy, the star of The Crown.

  • She was born in 1984, and there are some other fields in that file as well.

  • But also juicy is this file, title.principals.tsv,

  • and this is where it gets interesting, too.

  • In this file, notice there are no actual titles.

  • There's no primary titles.

  • There's no actual human names.

  • Instead, there's just two unique identifiers, a tconst and an nconst,

  • which IMDB speak for a title identifier and a name identifier.

  • So for instance, in one of the rows in this TSV file called

  • title.principals.tsv, there is a row that starts with tt4786824;

  • also has nm2946516, and has the word actress,

  • thereby implying that if you look up the nm--

  • the nconst in the names file and you look up the tconst in the titles file,

  • you will be able to, by transitivity, infer that Claire Foy is in The Crown.

  • This allows us to have a many-to-many relationship.

  • A one movie or show can have many actors,

  • and one actor can be in many shows, so we're using a sort of join file here,

  • a join table, that's going to somehow allow

  • us to link two different data sets together, and more on that in a moment.

  • But what's really fun in IMDB is that it also has a bunch of ratings

  • that humans have typed in saying, I get this 10 out of 10, a 0 out of 10,

  • and so forth.

  • And they keep track of the number of votes that shows have gotten.

  • And so in title.ratings.tsv, yet another file you can download from IMDB,

  • you can look up a given tconst, the unique identifier for a title,

  • what its average rating is and the number of votes.

  • And in fact, if I pull this up, for instance, on The Crown, if I go back

  • to IMDB itself--

  • IMDB, search for The Crown, the 2016 version,

  • you'll see that indeed, it is an 8.7, which lines up with exactly what

  • we have here.

  • But over time, that number is going to go up or down, because IMDB is updating

  • their data set every day as well.

  • So besides this data, we also then have the ability

  • to consider what this all looks like collectively.

  • So in this case here, here is another diagram.

  • It's more complicated now, but it just captures the intuition

  • that you would derive by just reading through IMDB's documentation, which

  • defines the following.

  • It turns out if you read closely among those files,

  • you'll see that, oh, we can glean a whole bunch of shows that are going

  • to have IDs, title, year, and episodes.

  • I want to go ahead and associate those shows with a whole bunch of stars.

  • But people are the entities in the world that have IDs, names, and birth.

  • So now things get a little weird.

  • Let's focus only on these two tables here.

  • This is a diagram of the goal at hand, and this

  • is a script I wrote in advance.

  • And in a moment, we'll open up a SQL database that represents this.

  • There's going to be a table called shows, every row of which

  • has an ID, title, year, and some number of episodes,

  • so you can see how long-running it is.

  • There's also going to be a table called people.

  • Claire Foy is going to be among them.

  • She and the other humans will have an ID, name, and birth year associated

  • with them as well.

  • But there's going to be this other table here called stars.

  • And you'll notice there's a line that links shows to stars to people,

  • again by transitivity.

  • If there is, in the stars table, a show ID, a.k.a. tconst, and a person ID,

  • a.k.a.

  • nconst, that links, for instance, Claire Foy to The Crown.

  • It's going to link Steve Carell to The Office.

  • It's going to link every other actor to their show as well.

  • Similarly for writers, we won't play too much with the writers today,

  • but writers are people, too, just as stars are people.

  • And so here's another feature or design goal of SQL.

  • You want to ideally factor out the commonalities, so

  • that Claire Foy appears only in one place by name,

  • but her unique identifier might appear in bunches of places.

  • There's a lot of actors in the world who are also writers themselves.

  • They are, at the end of the day, people, but they

  • might appear both in the stars table and in the writers table

  • by way of their person ID.

  • So the goal of SQL is to not copy and paste Claire Foy, Claire Foy, Claire

  • Foy, or Steve Carell, Steve Carell, Steve Carell, all over the place.

  • You want to have one authoritative place for all

  • of your people, one authoritative place for all of your shows,

  • and then you have these other tables called

  • join tables, which are similar in spirit to the TSV files

  • you can download that somehow link these identifiers together.

  • So if I, on my computer here, open a file that I had-- came with in advance,

  • that you can also play with online as well, called, finally, shows.db.

  • And I do .schema on this one.

  • This is the largest of the databases, and this is the one that we, the staff,

  • actually generated from all of the TSV files online,

  • but we threw away a lot of information.

  • We threw away anything before 1970, and we filtered out

  • everything except TV series, so that we can actually

  • play with the data in this data set.

  • But honestly, using a black and white prompt and a terminal window

  • tends not to get fun, especially when the data flies over the screen.

  • So there are also things called GUIs, Graphical User Interfaces,

  • and indeed, there is a program that's freely available for Macs,

  • PCs, and other types of operating systems, called DB Browser.

  • And indeed, will point you at this online.

  • This is just a program that allows you to explore SQL files on your own Mac

  • or PC much more visibly--

  • or much more visually and much more pleasantly than maybe just

  • a command line interface allows.

  • So let me go ahead and open up, for instance, shows.db.

  • And we'll see a whole bunch of things.

  • First of all, the graphical user interface

  • here shows me the same information, just in slightly prettier format.

  • Shows.db, per my diagram a moment ago, has six tables, people,

  • like I mentioned; shows, like I mentioned;

  • also, stars and writers, ratings, and then that separate table called genres.

  • And if you look over here on the right, you

  • can actually see the SQL code we wrote to create those tables.

  • But cooler than that is that notice these tabs up top here?

  • I am currently on Database Structure, and if you're using Windows,

  • your interface will look a little different,

  • but the options are still there.

  • I'm going to go ahead and click on Browse Data.

  • And now you'll see a little dropdown of all of the tables.

  • So if I want to go ahead and look at a whole bunch of shows,

  • I can actually see all of my data here, not unlike Google Spreadsheets.

  • But notice, this is 153,331 movies--

  • or shows, rather, that I can see altogether here.

  • And lastly, what's cool is that if I go over to the SQL tab,

  • I can now execute some sample queries.

  • So let me go ahead and do this.

  • SELECT * FROM shows;.

  • This is going to give me a whole lot of shows.

  • Enter.

  • So I hit the little Play button that just executes that query,

  • and you see the resulting rows that have come back.

  • So again, how did we get to this point?

  • We, the staff, downloaded all of those TSV files in advance.

  • We wrote a Python script that imported all of the data from those files

  • into memory, threw away the stuff we didn't care about,

  • and then inserted it into tables like this table called shows.

  • And what's nice about this tool is, because if you are an aspiring data

  • scientist or you're just trying to get some analytical work done

  • for this class or any other, or any other project,

  • a graphical tool lends itself to just kind of poking around.

  • So for instance, you'll see that the shows table no longer has any genres.

  • But that's OK.

  • We can reconstitute that data.

  • I can go ahead and SELECT * FROM shows--

  • but you know what I can do?

  • I can actually join in the genres table.

  • So I can take the shows table here, the genres table here,

  • and essentially link them together by way of the ID.

  • How is that possible?

  • Well, if you look at shows, a show has an ID title, year, and episodes.

  • Genres has a show ID.

  • So if you think of ID on my left hand as representing my fingers here.

  • Show ID in genres is representing my right hand here.

  • What we want to do is lineup ID with show ID

  • to make one larger, wider table that constitutes all of that data together.

  • So how do I do this?

  • Well, in SQL, you can join two tables.

  • You say what table you want to join with what other table and

  • how you want to do it.

  • Well, I want to go ahead and join it on shows.id = genres.show_id;.

  • And now, when I hit execute, it took a moment--

  • indeed, it took 408 milliseconds.

  • But my god, that's a lot faster than writing a whole Python

  • script to do this.

  • Now I have a table with all of the shows as before.

  • But notice the table got wider.

  • This is a temporary table that SQL has returned to me.

  • This one now has genre and show ID.

  • So in fact, let me go ahead and filter this.

  • WHERE title = "The Office" AND year = 2005, Play, we'll just get this.

  • So notice I have a wider table containing all of the columns

  • from both of those joined tables.

  • But if I change this now to The Crown, and that was 2016, and hit play,

  • notice I get back seemingly redundant information.

  • But this might be useful nonetheless, because now I

  • can iterate over all of the rows knowing that every row has not only a title

  • and a year, but also a genre as well.

  • So I can reconstitute the table like that.

  • Well, how can I ask other questions as well,

  • like what are actors that are in other shows as well?

  • Well, let me go ahead and do something like this.

  • For instance, let's select all of Steve Carell's movies.

  • Or-- let's select Steve Carell himself first.

  • So SELECT * FROM--

  • let's see-- no.

  • Let's do this.

  • How about Ellen?

  • So SELECT * from people where name = Ellen DeGeneres.

  • I spelled that right.

  • Semicolon, Play.

  • All right.

  • So this is handy, because I now know that Ellen's birth year is 1958,

  • but her ID is 1122.

  • That's the same thing as that nconst, but we threw away the nm

  • and we got rid of all the zeros and made it into a proper number for efficiency,

  • better to uniquely identify humans, typically-- or anything--

  • by numbers which fit in 32 or 64 bits rather than longer strings.

  • So now I know Ellen's ID is 1122, what can I do?

  • Well, let me go ahead and SELECT * FROM stars WHERE person_id = this.

  • This will tell me what?

  • This will tell me all of the information about shows that Ellen starred in,

  • including, presumably, her own.

  • OK, so I now see person_id is the same, the same, the same,

  • but Ellen is apparently in all of those shows.

  • But that's not that helpful, and also, this

  • is kind of lame that I've just hardcoded Ellen's ID.

  • But I don't have to do that.

  • I can do a subquery, SELECT id FROM people WHERE name = "Ellen DeGeneres",

  • closed parenthesis, Play.

  • Now it's dynamic, so now I've not hardcoded anything.

  • But this isn't that useful.

  • Let me go ahead and just select show_id here.

  • So I now have SELECT show_id FROM stars WHERE the person_id ID

  • = whatever Ellen's ID is here.

  • How can I take this one step further?

  • Well, what if I do SELECT * FROM shows WHERE the ID of the show

  • is in the following list of values?

  • So not only does SQL support equals, when

  • you want to compare one value against the next, much like in Python,

  • you have the keyword IN, where you can say select everything from shows where

  • the ID of the show is in the following list of shows,

  • which happens to represent the list of show IDs that Ellen is in.

  • Phew.

  • Lets hit Play on this.

  • It took a moment, but it looks like these are all of the shows

  • that Ellen has been in, according to IMDB.

  • And it looks like The Ellen DeGeneres Show is one.

  • She's been on the air for 2,865 episodes.

  • There is Ellen, her original TV show, which was on for a few seasons back

  • then.

  • Looks like she's doing some game shows these days as of 2017, and so forth.

  • So using these step-by-step thought processes

  • can we actually build up more interesting queries

  • to get back information like that.

  • All right, any questions before we try a few others out as well?

  • No?

  • All right.

  • Well, let me show one other approach to this same problem.

  • It turns out this is what are known as nested queries.

  • You keep using the parenthesization, much like in math,

  • where you can nest arithmetic expressions in parentheses.

  • You can do the same in SQL.

  • But you can also join information in slightly different ways as well.

  • I can actually do something like this.

  • Let me go ahead and SELECT Title FROM the following tables,

  • people JOIN stars ON people.id = stars.person_id--

  • and we'll walk through this in just a second--

  • JOIN-- sorry--

  • JOIN shows ON stars.show_id = shows.id where name equals "Ellen DeGeneres."

  • This, while more of a mouthful, is equivalent to what I've just done.

  • Notice that I've select the title of the show

  • from the following, the result of joining people on stars.

  • How do you join the people's table and the stars table?

  • Well, people have IDs.

  • Stars have person IDs, according to the diagram.

  • What else do you want to join together?

  • Let's join all of that with shows.

  • How?

  • Well, let's go ahead and say stars.show_id = shows.id.

  • So it's a way of linking, transitively, multiple tables together,

  • and then filter all of that with the name of Ellen DeGeneres.

  • This is what we would call an explicit join.

  • And if I click Play there, notice it's taking a moment.

  • Took, in fact, 1,990 milliseconds, almost two human seconds.

  • So it got slower, but it does give me another syntax

  • via which to select data.

  • And honestly, what's cool about this is I can select other fields as well.

  • They don't have to come from the original query in my nested selects.

  • But let me go ahead and do this, just as a teaser, and then

  • we'll consider some problems before wrapping up.

  • So it turns out 2 millisecond--

  • 2 seconds is kind of slow, and if I do this again, it took 2,029 milliseconds.

  • If I do it again, it took 1,963 milliseconds,

  • and I'm looking at the time here.

  • That's slow, right?

  • That means if you have a finite number of servers in the world

  • and each of them can only handle some number of users at a time,

  • it's kind of a waste of hardware and of money, arguably, to spend more seconds

  • or milliseconds servicing one query from a given user on your app or your web

  • application.

  • But it turns out we can do something kind of smart here.

  • Notice that in our database structure, we've done a few things here.

  • Let me open up, for instance, people.

  • Rather, let me go over to people, and you'll see a few keywords

  • that I ignored earlier.

  • When you're defining a table, you can specify that one of your columns

  • is what's called a primary key.

  • That is the column by which you're going to uniquely identify all your data,

  • so it's like those numeric IDs we've given every person and show.

  • A foreign key is what we call the same number

  • when it appears in some other table.

  • So when we saw a person_id or show_id, those are what are called foreign keys,

  • because it's the same numbers, but in another table

  • that we're using to join things together, eventually.

  • And you can also define columns to be unique,

  • or you can index them for efficiency.

  • And that's, in fact, where we're going right now.

  • If you look at the same query, you'll notice that I'm

  • searching on a bunch of columns.

  • I'm clearly searching on the name field.

  • I'm also searching on-- that is, joining on-- show_id,

  • and I'm searching on person_id, not to mention the individual ID fields.

  • But the cool thing about a primary key, which

  • we have in advance to find all of our ID columns, all of the columns

  • called ID, to be primary keys, you get a feature for free with SQL.

  • SQL builds up what's called an index, a very fancy data

  • structure, a tree-like structure, that actually allows you to search

  • for information quite efficiently.

  • So when you define a column to be a primary key,

  • you get what's called an index.

  • Specifically, if we go back a few weeks, to tree, our discussion of trees,

  • it looks like this.

  • This is what's called the B-tree, which is not a binary tree.

  • It's a B-tree in the sense that it's got a lot of nodes

  • that might be one or two or more children each,

  • but it's very short with a lot of nodes, very-- a lot of width.

  • So that means when you store your data in a tree-like structure,

  • long story short, it just tends to be very efficiently searchable.

  • So when you define a column to be a primary key,

  • you get that speed for free.

  • But we can also tell our SQL database, you know what?

  • I plan on searching on person IDs and show IDs and names, also,

  • so let's go ahead and create myself some indexes.

  • Let me go ahead and execute the following queries just once.

  • Let me create something called an index called person_index on the stars table

  • using its person_id column.

  • So pretty quick syntax.

  • This means create a B-tree, create a fancy tree structure,

  • called person index, on the stars table by person--

  • on the person_id column.

  • That is, make a table like this in memory to store all of the person IDs.

  • Why?

  • So I can find them faster than linear search in the column itself.

  • Let me go ahead and execute this.

  • And you'll see it takes a moment.

  • It's thinking.

  • It took me 1.6 seconds.

  • So a little slow, but I only have to do this once.

  • Let me create two more.

  • CREATE INDEX called show_index ON stars ON the show_id column.

  • So almost the same, but this just means give me

  • a tree that looks like this in memory, so

  • that when I query for data like Ellen DeGeneres,

  • it searches a tree-like structure instead of a list-like structure

  • in a column alone.

  • Let me execute that.

  • And I screwed up ON stars show_ID.

  • There we go.

  • That took 1 second.

  • And lastly, let's create an index called name index on people on the name column

  • so that I can search for people by name more efficiently as well.

  • Otherwise, without these indexes on a column like name,

  • it is going to check every damn cell in the column looking

  • for Ellen DeGeneres, Ellen DeGeneres, Ellen DeGeneres, using big O of N,

  • running time, or linear search.

  • So now, let me go back to my query here.

  • I've not made any changes to the query.

  • The last time I ran this, it took almost 2 seconds.

  • Now, after creating these indexes and telling my SQL database

  • I plan to search on those columns, watch this,

  • and watch the blue highlighted number.

  • 8 milliseconds.

  • 8 millisecond.

  • 7 milliseconds.

  • 7, 7, 7.

  • Looks like 4 milliseconds that time.

  • So this is what the Googles of the world, the Facebooks of the world,

  • the Microsofts, who have very large data,

  • they not only store their data in databases

  • like we are here, but they also index their tables intelligently,

  • drawing in ideas from weeks ago, so that the database, for them, and for free,

  • and sort of magically, creates these kinds of structures in memory,

  • but does it so that you can search and insert and update

  • your data all the more efficiently.

  • So there's got to be some problems or some prices paid.

  • So let's consider just a couple of those.

  • So what problems actually arise when using a SQL database like this?

  • Well, the first is what's called a race condition.

  • And perhaps the best way to explain this is by way of a--

  • sort of a story that I was told when I took a class like this years

  • ago in a class called CS161, Operating Systems.

  • So contrived scenario, but consider this.

  • You and your roommates have a fridge in your dorm room,

  • and you and your roommate really like milk.

  • And one of you comes home one day, opens the fridge, and sees, oh, dammit,

  • we're out of milk.

  • And so you close the fridge and you walk into Harvard Square

  • and you head to CVS or some other such store.

  • Meanwhile, your other roommate comes home, opens the fridge,

  • really wants some milk, but, oh, darn it, we're out of milk.

  • So they close the fridge, walk outside, and head

  • to some other store that sells milk, and for reasons that we're contriving,

  • don't bump into you at CVS.

  • A few minutes later, you both get home, of course, having bought some milk.

  • But of course, milk doesn't last that long,

  • so now you've got twice as much milk, and surely some of it's

  • going to go bad.

  • Like horrible, horrible problem.

  • Why, right?

  • It's contrived, but this-- fond memories of when it was taught to me.

  • So what is the fundamental problem here, if you put your finger on it?

  • Both you and your roommate were allowed to inspect

  • the state of that refrigerator without knowing that the other was

  • about to do the same.

  • Better would have been for one of you to leave a note on the fridge saying, gone

  • for milk, so that the other person doesn't do it as well.

  • Or more stringently, just lock the refrigerator

  • so that no one else besides you can see inside of it until you have

  • updated the state of that refrigerator.

  • So this comes into play all too often these days

  • when dealing with really large data sets.

  • The Twitters of the world, the Instagrams of the world, all of you

  • are probably in the habit of liking or uploading posts

  • on those sites or some other.

  • But if a lot of other people are, too, especially when things go viral,

  • there might be code like this underneath the hood that

  • essentially does the following.

  • Here's three lines of representative Python code that use SQL.

  • The first line here selects the number of likes from a table

  • called posts where the ID of the post is something,

  • a variable called ID, 1, 2, 3, 4, 5.

  • That just tells me, what is the total number of likes this post has?

  • Then, I store that answer in a variable using likes.

  • And you'll see this syntax in problem set 7

  • as you begin to play with SQL yourself in the CS50 library.

  • Suppose that your third line of code is to then update posts,

  • setting likes equal to some place holder, where that place

  • holder is going to be likes plus 1.

  • This is the same problem as the milk example,

  • because if you check how many likes there

  • are on this post, or your roommate checks how--

  • or if you check how much milk is available left in the fridge

  • and then go off and begin to make a decision like buying milk,

  • your roommate might, on a server that's running

  • the same code to handle thousands and thousands of people,

  • might ask that same question.

  • How many likes does this current post have?

  • Suppose the post has a million likes.

  • Both of you execute that first line of code,

  • see that, oh, I have a million likes.

  • I've just clicked plus 1 on the site.

  • And so you try changing the number of likes with this second update query

  • to 1 million plus 1.

  • But if both of you have been making a decision that

  • gets interleaved with the other person, as will absolutely

  • happen with Instagram and Twitter, who have thousands

  • of servers all operating in parallel, instead of changing the number of likes

  • from 1 million to 1,000,002, you might change it both of you to 1,000,001,

  • thereby wasting one of the counts, because it gets lost,

  • because you inspected the state of the fridge or the likes.

  • You made a decision.

  • And then you updated it based on that decision

  • without realizing that life changed in between.

  • Your roommate got home and checked the fridge,

  • too, or someone else clicked the upvote.

  • So this is a bad situation, but there are solutions in SQL

  • that we won't look at this week, but what are known as transactions.

  • That is a solvable problem, and more on that in a higher level database class.

  • And one final example of this, because this one you

  • can avoid throughout the coming weeks in CS50 and beyond.

  • A SQL injection attack is when you write bad code that

  • somehow allows some bad actor on the internet or an application

  • to trick your code into running SQL code that you did not intend.

  • For instance, this is how Yale students log on in New Haven.

  • This, of course, is how Harvard students log on here in Cambridge.

  • You're asked for a username and password, or a Harvard key

  • and a password.

  • Suppose, though, you as a user don't cooperate,

  • and instead of typing in your email address and password,

  • suppose you type something like this, like your email address,

  • Malan@Harvard.edu, then maybe a single quote, and then a dash dash.

  • Well, it turns out in SQL, dash dash is how

  • you start a comment, similar to hash in Python or in slash slash in C.

  • But suppose that Harvard had implemented its website such

  • that there's a query like this going on underneath the hood,

  • if the Harvard key is implemented in Python, that says this, SELECT * FROM

  • users WHERE username = placeholder AND password = placeholder,

  • passing in username and password.

  • That is good.

  • That is correct code, because those place holders are important.

  • What the CS50 library does, and other libraries in the world like it, is

  • it escapes user input.

  • If a user tries to trick your code into inputting comments or single quotes,

  • db execute and other libraries will sanitize the user's input

  • and prevent the following from happening.

  • If, however, you do this-- not good, but bad.

  • Suppose that you really practice what we preach last week using f strings,

  • this is now bad in the context of SQL.

  • If you create a format string like this with a little f

  • that is literally SELECT * from users WHERE username =,

  • and then use the fancy Python notation for curly braces,

  • and password = curly braces, this is a correct SQL construction

  • so long as the human cooperates and puts their username here and password here.

  • But what if the user is malicious, like me,

  • and actually includes a single quote in their input, and then dash dash?

  • So Malan@Harvard.edu quote dash dash would

  • have the effect of injecting that into the user's input.

  • But dash dash I claim as a comment, so it's

  • a way of tricking the computer into executing not SELECT * WHERE user--

  • SELECT * FROM users where username = this AND password = that,

  • it has the effect of just saying SELECT * FROM users where

  • username equals Malan@Harvard.edu.

  • And if-- and you'll see this in p set 7--

  • db.execute returns one or more rows from the table--

  • that's how you can actually select data using Python code--

  • and the number of rows equals 1, because it's

  • selected Malan@Harvard.edu you could log in

  • as me because you very simply tricked the computer into executing code

  • that you injected.

  • And the biggest takeaway here is this is 100% solvable problem.

  • You just need to know that it is solved and not

  • do code like this using f strings in Python, but instead

  • to use library code like db.execute, or after this course, anything like it.

  • With that said, we'll end with a joke.

  • OK.

  • That's it for CS50.

  • We'll see you next time.

[MUSIC PLAYING]

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

CS50 2019--閱讀7--SQL (CS50 2019 - Lecture 7 - SQL)

  • 12 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字