Placeholder Image

字幕列表 影片播放

  • [MUSIC PLAYING]

  • DAVID MALAN: All right.

  • So this is CS50, and this day we continue our look at Python,

  • but also introduce another language.

  • This one is called SQL, or Structured Query Language,

  • and it's going to allow us to solve some different problems as well

  • as some old problems better.

  • But to do this, let's try to motivate some of the problems initially.

  • So this, of course, is just an example of a spreadsheet.

  • Odds are you use these pretty frequently,

  • whether it's in the form of Google Spreadsheets or Excel

  • or Apple Numbers and the like.

  • And this is actually the simplest form of what we might call a database,

  • just some kind of application that stores lots of data.

  • And odds are if you've ever taken registrations for a student group

  • or had people sign up or fill out a form or a survey,

  • odds are you've used Google Forms to get your data into a Google Spreadsheet.

  • And so we thought we would begin there today,

  • a representative problem whereby you might want to collect a lot of data

  • from users and then do something with it,

  • but ideally more powerfully than you could do as a human on your own.

  • So we've put into place a Google Form that quite simply

  • asks a couple of questions today.

  • What are your favorite TV shows?

  • And we ask you for the title of your favorite show and the genre

  • or genres into which that show falls.

  • So if you wouldn't mind, go ahead and open up this URL here

  • on your laptop or phone.

  • If you don't have one, that's fine.

  • You can use the person's next to you, if I may.

  • And take just a moment to go to that URL there and answer those two questions.

  • What is the title of your favorite TV show, and into what genre or genres

  • does it fall?

  • Favorite TV shows.

  • And when you go to that URL, you'll find yourself at, again, this form here,

  • asking for your title and for your favorite show.

  • So by the look of my own screen here we've

  • got some 62 responses, 74 responses coming in already,

  • so we can track this live.

  • So 111 responses so far.

  • That's a lot of favorite TV shows.

  • This number's-- keep going up.

  • 169, up to, next, 191.

  • Keep them coming.

  • One per person is fine.

  • It will give us more than enough data.

  • 246.

  • We'll give it another few seconds, but suffice it to say,

  • this is kind of a lot of data.

  • And if the next step in this process were

  • for me to analyze the data because I'm curious to know

  • what's the most popular show on campus or if I wanted

  • to do some other analytical work on this,

  • it'll be useful to have all of this data,

  • not in Google Form's proprietary interface,

  • but in Google Spreadsheet form with rows and columns, a format you're probably

  • pretty familiar with.

  • So we'll let this counter keep going up and up and up,

  • and I'm going to go ahead and click on responses.

  • And if we really wanted to, we could look through--

  • The Office is up there, followed by Games of Thrones, Friends, OK.

  • Perhaps pretty predictable here.

  • Let me go ahead and click this icon, which

  • is going to open the Google Spreadsheet version of this.

  • And you'll see that there's three columns by default.

  • [LAUGHTER]

  • Oh, that's funny.

  • [APPLAUSE]

  • Thank you.

  • Thank you.

  • OK.

  • OK.

  • Well-played.

  • Now-- so you'll see that by default we got three columns.

  • Timestamp, Google Forms just does automatically

  • to timestamp the submission of the form.

  • Title comes from the first question.

  • Genres comes from the second question.

  • And if we kept scrolling and scrolling, odds

  • are we're seeing more and more rows.

  • Let's propose that we have enough of these rows already,

  • and how might we go about analyzing the data?

  • Well, if you're pretty expert with Google Spreadsheets or Excel

  • or Numbers, maybe you could use some functions that

  • are built into these programs or you could do some Command-F or Control-F

  • and search for the favorite show that you're

  • looking for to try to run some numbers.

  • But we have programming abilities nowadays.

  • And if you haven't used this feature already,

  • if you go to Google Spreadsheet's file menu.

  • You can go to download, and you can download your data

  • into a bunch of formats.

  • And the most useful one I'd propose for now

  • is going to be this one, Comma-Separated Values or CSV.

  • And we've seen this before.

  • In fact, you've generated this before.

  • So if I go ahead and do that, it pretty quickly downloaded a CSV file

  • that I really wanted to I could open in Excel or Apple Numbers or the like.

  • But let's do something instead with that, and go ahead and put it into,

  • for instance, my IDE so that I can actually do something with it.

  • I'm going to grab my downloaded file here.

  • And I'm going to go ahead and drag that into the IDE.

  • And so now you'll see a whole bunch of files here.

  • Let me put-- whoops.

  • Let me put this right in the root of my folder.

  • And we'll see this file here.

  • And if I double-click on it, you'll see exactly the same data,

  • albeit in CSV format.

  • So it looks the same, but we have commas separating the values.

  • And Google Spreadsheets did that for us automatically.

  • So let's now write a program that analyzes this data

  • and see if we can't make sense of what some of the most popular values

  • are in there.

  • Let me go ahead and close this, open a new file,

  • and I'm going to go ahead and call it, for instance, favorites.py,

  • since the goal is to figure out everyone's favorites here.

  • And let me go ahead and use some syntax from last week

  • that you might be familiar with now from P set 6.

  • Let me go ahead and import CSV, the module,

  • so that I have access to that library of functionality.

  • Then let me do this with Open.

  • And then I'm going to go ahead and open this file, and I could rename it,

  • but you know, I can just grab the whole file name this way, too.

  • It's a little excessive, but I'll go ahead and paste that in in read mode.

  • And that gives me a file.

  • Let me go ahead and zoom out just a little bit there.

  • And once I have this file, I can throw it in to a dict reader,

  • for instance, to make it easier to read the CSV.

  • So reader gets CSV.DictReader.

  • Pass in that file.

  • And then after this, I have the ability, now,

  • to iterate over each row like this, much like you probably did for DNA

  • or one or more other problems, as well.

  • Now, just as a sanity check, let me just go ahead and print out

  • each of the row's titles.

  • So I'm not going to see anything new here,

  • but this will be a little sanity check that I'm writing correct code.

  • So run Python of favorites.py, and voila, fwoom, there's all the data.

  • So The Office is both at the beginning and at the end here, it seems.

  • All right, so that's all fine and good, a nice stepping stone, but let's start

  • to analyze this data, this time using some code.

  • So let me go ahead and say this.

  • Suppose I wanted to keep track and determine

  • what the most popular show is.

  • So I've got this big CSV file with title, title, title, title.

  • Invariably, there's going to be some duplicates.

  • A bunch of you probably like The Office or some other show,

  • and so it'd be nice to kind of aggregate that data.

  • What might be a useful data structure in Python whereby for each row

  • we can use this data structure to just kind of do

  • the equivalent of like 1, 2, 3, and count each

  • of the occurrences of those titles?

  • Any-- OK, I heard a hash table, but more properly in Python?

  • A dictionary.

  • So a dictionary is kind of the Swiss army

  • knife of data structures for a lot of people in the sense

  • that insofar as it allows you to store keys and values,

  • and those keys can be strings, and those values can be anything you want,

  • including, for instance, in this case, numbers, like a counter,

  • it's a really versatile data structure when you just

  • want to throw some data into a data structure and do something with it.

  • So let me go ahead and give myself a dictionary at the top of the file.

  • I can do this in a couple of ways, and if my goal is to store the counts

  • of each of these tiles, I could call it counts,

  • and I could set it equal to dict(), which is just a function that hands me

  • an empty dictionary.

  • Or I recall I can use this syntax, which is

  • equivalent to just shorthand notation.

  • But what I want to do now inside of my loop, every time I

  • iterate through this file, I want to check if my--

  • I've seen this title before, I want to add 1 to my counter.

  • But if I've never seen this title before, what do I want to do instead?

  • Yeah?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Add the title to the dictionary

  • as a key with the value of what, the first time I see it?

  • Well, the first time I see it.

  • AUDIENCE: 1.

  • DAVID MALAN: So we can just initialize it to 1.

  • So there's two scenarios here.

  • Either we've seen this title before and we want to do plus 1,

  • or we've not seen the title before and we just want to set it equal to 1.

  • So how might we express that logic?

  • Well, let me go ahead and say, first of all,

  • let's get the title into a variable of its own,

  • just to shorten our syntax a little bit.

  • And then I can say something like this, if title in counts--

  • well, what does that mean?

  • That means that inside of my counts dictionary,

  • I can index into it at that location, and I can increment whatever is there

  • by 1, assuming on some previous iteration,

  • else I went ahead and did counts of title equals 1.

  • So these are the two scenarios.

  • If I've seen this title before, go ahead and add 1

  • to its value in the dictionary.

  • Otherwise, just default to I have seen it one time.

  • So some of the shows are going to stay stuck at one if they're not

  • very popular among the audience, or they're

  • going to keep getting plus 1, plus 1, plus 1.

  • So what I often do when writing code like this

  • is, as I'm sort of finding my way, after I'm done with this,

  • I'm not going to write the whole thing.

  • Let me go ahead and just, at the end of this, print counts, right?

  • I'm just curious to see what it is.

  • I'm just going to go ahead and add print counts.

  • Let me go ahead and run now Python of favorites.

  • And voila, you see a huge, messy data structure.

  • But let's go to the very top of it.

  • Notice that it starts with curly braces, indicating it's, indeed, a dictionary.

  • It starts with a quoted value, which is one of the titles in it,

  • and it looks like two of you said Dynasty is your favorite show.

  • 25 of you said The Office is your favorite show.

  • One of you said Blindspot is your favorite show.

  • One of you, 24.

  • Friends, 19, and so forth.

  • So the data is kind of all over the place,

  • because it's based on the order in which you all input it.

  • So let's just clean this up a little bit.

  • And let me go ahead and print this out as follows.

  • Instead of just printing a big, messy dictionary,

  • why don't I go ahead and iterate over the dictionary?

  • And we didn't see this necessarily last week,

  • but you can actually do something like this.

  • If your dictionary's storing titles as keys and counts as values,

  • we can do something fancy like for title comma count in counts,

  • but if you want to iterate over not just the keys but all of the items therein,

  • there's this other function in Python associated

  • with a dictionary that will hand you back two things at once,

  • key and value together.

  • And you can call them anything you want.

  • I'm calling them title comma count, respectively.

  • Now, let me go ahead and print out, for instance, something like the title,

  • and then let me go ahead and just print out the count right next to it.

  • So now, if I save this and run favorites.py, OK, it's prettier,

  • but it's still a little messy in that we have--

  • the numbers are just right there after the name.

  • So let's go ahead and tidy this up.

  • Recall that when you print two things with print

  • separating them with a comma, what character

  • did you get automatically in between them?

  • Yeah, just a single space.

  • We saw that at the very beginning of last week.

  • But we can override that separator, otherwise known as sep,

  • which has a default value of quote-unquote with a space.

  • But suppose I just want to give some visual delineation here,

  • maybe like a vertical bar to kind of sort of make it look like a spreadsheet