字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] DAVID MALAN: All right. So this is CS50, and this day we continue our look at Python, but also introduce another language. This one is called SQL, or Structured Query Language, and it's going to allow us to solve some different problems as well as some old problems better. But to do this, let's try to motivate some of the problems initially. So this, of course, is just an example of a spreadsheet. Odds are you use these pretty frequently, whether it's in the form of Google Spreadsheets or Excel or Apple Numbers and the like. And this is actually the simplest form of what we might call a database, just some kind of application that stores lots of data. And odds are if you've ever taken registrations for a student group or had people sign up or fill out a form or a survey, odds are you've used Google Forms to get your data into a Google Spreadsheet. And so we thought we would begin there today, a representative problem whereby you might want to collect a lot of data from users and then do something with it, but ideally more powerfully than you could do as a human on your own. So we've put into place a Google Form that quite simply asks a couple of questions today. What are your favorite TV shows? And we ask you for the title of your favorite show and the genre or genres into which that show falls. So if you wouldn't mind, go ahead and open up this URL here on your laptop or phone. If you don't have one, that's fine. You can use the person's next to you, if I may. And take just a moment to go to that URL there and answer those two questions. What is the title of your favorite TV show, and into what genre or genres does it fall? Favorite TV shows. And when you go to that URL, you'll find yourself at, again, this form here, asking for your title and for your favorite show. So by the look of my own screen here we've got some 62 responses, 74 responses coming in already, so we can track this live. So 111 responses so far. That's a lot of favorite TV shows. This number's-- keep going up. 169, up to, next, 191. Keep them coming. One per person is fine. It will give us more than enough data. 246. We'll give it another few seconds, but suffice it to say, this is kind of a lot of data. And if the next step in this process were for me to analyze the data because I'm curious to know what's the most popular show on campus or if I wanted to do some other analytical work on this, it'll be useful to have all of this data, not in Google Form's proprietary interface, but in Google Spreadsheet form with rows and columns, a format you're probably pretty familiar with. So we'll let this counter keep going up and up and up, and I'm going to go ahead and click on responses. And if we really wanted to, we could look through-- The Office is up there, followed by Games of Thrones, Friends, OK. Perhaps pretty predictable here. Let me go ahead and click this icon, which is going to open the Google Spreadsheet version of this. And you'll see that there's three columns by default. [LAUGHTER] Oh, that's funny. [APPLAUSE] Thank you. Thank you. OK. OK. Well-played. Now-- so you'll see that by default we got three columns. Timestamp, Google Forms just does automatically to timestamp the submission of the form. Title comes from the first question. Genres comes from the second question. And if we kept scrolling and scrolling, odds are we're seeing more and more rows. Let's propose that we have enough of these rows already, and how might we go about analyzing the data? Well, if you're pretty expert with Google Spreadsheets or Excel or Numbers, maybe you could use some functions that are built into these programs or you could do some Command-F or Control-F and search for the favorite show that you're looking for to try to run some numbers. But we have programming abilities nowadays. And if you haven't used this feature already, if you go to Google Spreadsheet's file menu. You can go to download, and you can download your data into a bunch of formats. And the most useful one I'd propose for now is going to be this one, Comma-Separated Values or CSV. And we've seen this before. In fact, you've generated this before. So if I go ahead and do that, it pretty quickly downloaded a CSV file that I really wanted to I could open in Excel or Apple Numbers or the like. But let's do something instead with that, and go ahead and put it into, for instance, my IDE so that I can actually do something with it. I'm going to grab my downloaded file here. And I'm going to go ahead and drag that into the IDE. And so now you'll see a whole bunch of files here. Let me put-- whoops. Let me put this right in the root of my folder. And we'll see this file here. And if I double-click on it, you'll see exactly the same data, albeit in CSV format. So it looks the same, but we have commas separating the values. And Google Spreadsheets did that for us automatically. So let's now write a program that analyzes this data and see if we can't make sense of what some of the most popular values are in there. Let me go ahead and close this, open a new file, and I'm going to go ahead and call it, for instance, favorites.py, since the goal is to figure out everyone's favorites here. And let me go ahead and use some syntax from last week that you might be familiar with now from P set 6. Let me go ahead and import CSV, the module, so that I have access to that library of functionality. Then let me do this with Open. And then I'm going to go ahead and open this file, and I could rename it, but you know, I can just grab the whole file name this way, too. It's a little excessive, but I'll go ahead and paste that in in read mode. And that gives me a file. Let me go ahead and zoom out just a little bit there. And once I have this file, I can throw it in to a dict reader, for instance, to make it easier to read the CSV. So reader gets CSV.DictReader. Pass in that file. And then after this, I have the ability, now, to iterate over each row like this, much like you probably did for DNA or one or more other problems, as well. Now, just as a sanity check, let me just go ahead and print out each of the row's titles. So I'm not going to see anything new here, but this will be a little sanity check that I'm writing correct code. So run Python of favorites.py, and voila, fwoom, there's all the data. So The Office is both at the beginning and at the end here, it seems. All right, so that's all fine and good, a nice stepping stone, but let's start to analyze this data, this time using some code. So let me go ahead and say this. Suppose I wanted to keep track and determine what the most popular show is. So I've got this big CSV file with title, title, title, title. Invariably, there's going to be some duplicates. A bunch of you probably like The Office or some other show, and so it'd be nice to kind of aggregate that data. What might be a useful data structure in Python whereby for each row we can use this data structure to just kind of do the equivalent of like 1, 2, 3, and count each of the occurrences of those titles? Any-- OK, I heard a hash table, but more properly in Python? A dictionary. So a dictionary is kind of the Swiss army knife of data structures for a lot of people in the sense that insofar as it allows you to store keys and values, and those keys can be strings, and those values can be anything you want, including, for instance, in this case, numbers, like a counter, it's a really versatile data structure when you just want to throw some data into a data structure and do something with it. So let me go ahead and give myself a dictionary at the top of the file. I can do this in a couple of ways, and if my goal is to store the counts of each of these tiles, I could call it counts, and I could set it equal to dict(), which is just a function that hands me an empty dictionary. Or I recall I can use this syntax, which is equivalent to just shorthand notation. But what I want to do now inside of my loop, every time I iterate through this file, I want to check if my-- I've seen this title before, I want to add 1 to my counter. But if I've never seen this title before, what do I want to do instead? Yeah? AUDIENCE: [INAUDIBLE] DAVID MALAN: Add the title to the dictionary as a key with the value of what, the first time I see it? Well, the first time I see it. AUDIENCE: 1. DAVID MALAN: So we can just initialize it to 1. So there's two scenarios here. Either we've seen this title before and we want to do plus 1, or we've not seen the title before and we just want to set it equal to 1. So how might we express that logic? Well, let me go ahead and say, first of all, let's get the title into a variable of its own, just to shorten our syntax a little bit. And then I can say something like this, if title in counts-- well, what does that mean? That means that inside of my counts dictionary, I can index into it at that location, and I can increment whatever is there by 1, assuming on some previous iteration, else I went ahead and did counts of title equals 1. So these are the two scenarios. If I've seen this title before, go ahead and add 1 to its value in the dictionary. Otherwise, just default to I have seen it one time. So some of the shows are going to stay stuck at one if they're not very popular among the audience, or they're going to keep getting plus 1, plus 1, plus 1. So what I often do when writing code like this is, as I'm sort of finding my way, after I'm done with this, I'm not going to write the whole thing. Let me go ahead and just, at the end of this, print counts, right? I'm just curious to see what it is. I'm just going to go ahead and add print counts. Let me go ahead and run now Python of favorites. And voila, you see a huge, messy data structure. But let's go to the very top of it. Notice that it starts with curly braces, indicating it's, indeed, a dictionary. It starts with a quoted value, which is one of the titles in it, and it looks like two of you said Dynasty is your favorite show. 25 of you said The Office is your favorite show. One of you said Blindspot is your favorite show. One of you, 24. Friends, 19, and so forth. So the data is kind of all over the place, because it's based on the order in which you all input it. So let's just clean this up a little bit. And let me go ahead and print this out as follows. Instead of just printing a big, messy dictionary, why don't I go ahead and iterate over the dictionary? And we didn't see this necessarily last week, but you can actually do something like this. If your dictionary's storing titles as keys and counts as values, we can do something fancy like for title comma count in counts, but if you want to iterate over not just the keys but all of the items therein, there's this other function in Python associated with a dictionary that will hand you back two things at once, key and value together. And you can call them anything you want. I'm calling them title comma count, respectively. Now, let me go ahead and print out, for instance, something like the title, and then let me go ahead and just print out the count right next to it. So now, if I save this and run favorites.py, OK, it's prettier, but it's still a little messy in that we have-- the numbers are just right there after the name.