Placeholder Image

字幕列表 影片播放

  • DAVID MALAN: All right, this is CS50 and this is lecture 10.

  • Welcome to those parents among you.

  • Today, for those tuning in online, is freshmen family weekend.

  • We have quite a few guests with us and we

  • thought we would begin today in fact by perhaps putting everyone

  • at ease since I know this is perhaps the first time in a while

  • that you've been back to college.

  • Maybe this is your first class for the fall semester.

  • And so we thought we'd share with you by proxy

  • what you might be feeling by way of a short film

  • that CS50's video team put together with a little friend of ours

  • just a few semesters ago.

  • [VIDEO PLAYBACK]

  • [MUSIC PLAYING]

  • - Hi.

  • - Hello

  • [END PLAYBACK]

  • DAVID MALAN: All right, well, welcome to Harvard and CS50.

  • So if you'd like to actually return in just a few weeks time,

  • your children or siblings or friends or here in the room

  • will be exhibiting their CS50 final projects in just a few weeks' time.

  • This is the so-called CS50 fair, which is

  • meant to be a campus wide exhibition of students' final projects.

  • And it's worth noting that, as the students in the class know,

  • 68% of the students in CS50 this year had no prior CS experience.

  • And so what you and they will see on display at the CS fair

  • in a few weeks' time in December is all of their accomplishments.

  • And in fact, the goal of this event really

  • is to delight in what everyone has achieved and glancing

  • at each other's screens and phones to see with great surprise what

  • each of our students has pulled off.

  • So you are invited back if you would like for that as well.

  • Now CS50 of course is characterized as a class

  • for those less comfortable and those more comfortable

  • and those somewhere in between.

  • And I daresay some of our guests today are perhaps

  • at one end of that spectrum or the other.

  • And so rest assured that what we'll dive into today,

  • the world of databases and SQL with applications to data science

  • and beyond, is new to most everyone in the room as well.

  • So you are in good company, and I daresay 68% or more

  • of percent of the parents in the room are perhaps feeling exactly as you are.

  • So if we can now turn our attention back to some things lower level,

  • those students among you will recall that we focused in recent days

  • on Python, this other higher level language by which you

  • can solve problems a bit more readily than you can using, say, something

  • lower level like C. Now you might have noticed I'm ready if you dived

  • into problem set 6 as I'm sure you have, the debug 50 only

  • supports C and not Python.

  • So you have relatively fewer tools at your disposal for debugging.

  • But do realize that within the CS50 library for Python

  • is a function like this called eprint, which

  • you might recall if you've used eprintf with C.

  • This is a Python version of the same.

  • And so if you need help debugging your programs,

  • it suffices to write a program in Python.

  • And anywhere you want to see diagnostic output,

  • simply call this eprint function.

  • And it will print the file name and the line number for which you

  • have written that line of code.

  • But let's turn our attention now to things more web-oriented

  • and perhaps something with which you're familiar Facebook

  • and the login screen thereof.

  • So there's so many websites out there today where when you visit them,

  • you're prompted to log in-- something like this.

  • But you're not prompted to log in every time you click a link on facebook.com.

  • You're not prompted to log in every minute or every two minutes

  • because indeed, the websites, if they required as much,

  • they would degrade into something downright unusable.

  • So instead, these websites remember that you're logged in somehow.

  • So you log in once per day, per week, per year.

  • It varies by website.

  • And then somehow the website remembers that you are Maria or you are Brian

  • and you have already logged into this website.

  • Now how might that work?

  • How does Facebook or any site like it, Gmail and beyond,

  • remember that you are logged in once you are logged in even though you only

  • type your username and password once?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: OK, so they store information on the server.

  • And indeed, that's exactly right.

  • On the server, as we're now starting to see with Python and today with SQL,

  • you have the ability to store information on a server.

  • We saw very simply about a week ago the ability to store data

  • in, like, CSV files, just text files on the server.

  • That's not going to be very high performing,

  • so we're going to do better than that today.

  • But you can also store data in variables inside a flask

  • while it's running as we saw as well.

  • But we'll need some more sophistication than that.

  • But yes, we can certainly remember that Maria or Brian have logged in.

  • But how do we remember from which computer or browser

  • Maria or Brian have logged in?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Via the IP address-- so could work.

  • You could just recall that from those virtual envelopes

  • which I no longer have today.

  • We've been writing the to address in the from address, the IP address, on those

  • envelopes.

  • So you could certainly remember, oh, I saw Brian from 1.2.3.4 and Maria

  • from 4.5.6.7 or whenever the numbers actually are.

  • But that's a problem in, like, households and frankly even

  • on campuses these days because of NAT, Network Address Translation,

  • that technique whereby a bunch of us probably in this room right now

  • on our laptops or phones are sharing the same IP address, at least so

  • far as the outside world is concerned.

  • So we could confuse things.

  • And that would be bad because that means if Brian sort of logs into Facebook

  • and Maria logs into Facebook, suddenly Brian

  • might actually see Maria's profile page or the like.

  • So what else might they be doing?

  • What's that?

  • AUDIENCE: Cookie?

  • DAVID MALAN: Cookie, so-- eh, correct answer, but what in the world

  • does that mean?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Don't know--

  • OK.

  • So, perfect-- so you're in the right place.

  • And most everyone in this room probably, if you use the web frequently,

  • have generally heard about cookies and probably that they're dangerous or bad

  • or some way or you've probably googled how

  • to clear your cookies if you've indeed heard they're

  • a bad thing or something like that.

  • And indeed, cookies are the answer to this question.

  • But how does that actually work?

  • Well, if you've been to like an amusement park or a club

  • or someplace where you typically have a ticketed access gate, sometimes if you

  • want to come in go to and from the parking lot or out to the street

  • or wherever, they might stamp your hand using like a little red stamp

  • or invisible stamp or the like.

  • And then you just show it to the bouncer or the person at the checker

  • that you had been there before because you have this hand

  • stamp via the remembering you.

  • So web servers and browsers are kind of the same.

  • When you visit Facebook or Gmail or wherever for the first time

  • and moreover log in, they essentially put like a digital hand

  • stamp on your browser called a cookie.

  • And that handstand happens to be implemented

  • not obviously as ink but like as a big random number that

  • is given just to you.

  • And it's stored either in your computer's RAM

  • or on your computer's hard drive longer term.

  • And your browser, because it speaks HTTP,

  • that protocol we've been talking about, it

  • has been designed to show its hand, so to speak.

  • Every time you revisit Facebook, every time you click an email,

  • every time you send a message, anytime you browse around on that same website,

  • the browser is, without you realizing it, reminding

  • the server of that hand stamp.

  • So how might it be doing it?

  • Where can a browser sort of tuck away information like this such

  • that it's always in that envelope?

  • Yeah.

  • AUDIENCE: It can put it like into a list like you did before with the invites

  • or something like that.

  • DAVID MALAN: The invites, the invites--

  • oh, so that would be lower lev-- so, OK, I remember now.

  • So that would be not quite there because we

  • want to make sure that this has nothing to do fundamentally

  • with our application.

  • This has to do with HTTP.

  • So this was invented years ago before there was like a list of registrants

  • for [INAUDIBLE].

  • So we can't put it there.

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: A concept of a cache-- not here, no, no.

  • Yeah.

  • AUDIENCE: [INAUDIBLE] header.

  • DAVID MALAN: The header-- so the header is

  • where we have access to metadata, sort of lower level technical information

  • that most of us never see unless you poke around Chrome

  • or actually sniff your own internet traffic and see.

  • But we do have the ability, within these HTB headers, to send, so to speak,

  • this virtual hand stamp.

  • So let's recall what these headers look like.

  • So the headers that get sent in this virtual envelope

  • when you request Facebook's home page, for instance,

  • might look like this-- get slash, which means get me the home page,

  • using version 1.1.

  • The host I'm visiting, just as a reminder,

  • is facebook.com or www.facebook.com.

  • And then below that is maybe more headers.

  • We've generally waived our hands at it because they're not as interesting

  • but there might be some more information there too.

  • Well, when Facebook replies, especially after you've

  • logged in, it turns out one of the headers we haven't looked

  • at in addition to these which we have seen, the response,

  • the 200 OK from the server, is that Facebook and servers

  • like it are probably sending this header as well, literally

  • Set-Cookie colon and then a name and then a value or key and a value.

  • The name here, for instance, I'm going to start calling session,

  • as is the term of art.

  • And then the value to the right can actually be anything.

  • So value might be 123, 456, or any unique value

  • that's sort of randomly generated by the server to remember who you are.

  • And so session equals value is like this is your hand and this is your value.

  • That is now the cookie that's been planted on the computer.

  • And what your browser thereafter is supposed to do is not send set cookie.

  • That only comes from server to browser.

  • But whenever you on Facebook request another page,

  • click a friend's profile page or the like,

  • your virtual envelope will contain not these familiar headers

  • but also cookie session equals value.

  • And so that is the presentation of that same key value pair.

  • So if Facebook now to your point earlier remembers

  • that user 123456 has logged in and that user happens to be named Maria

  • and we can store that information on the server,

  • every time Facebook sees this same hand stamp

  • or sees this same session equals value, it'll be like, oh, that must be Maria.

  • And so I'll show Maria her messages and her profile page

  • and all that comes with it in a website.

  • So let's take a step back because the mechanism is as relatively

  • simple as this, even though it's certainly arcane

  • and you have to kind of know what you're doing to leverage this.

  • But is there a threat here?

  • Like, why might cookies be dangerous and why do people

  • tell us occasionally to delete them?

  • AUDIENCE: If some one gains access to [INAUDIBLE]..

  • DAVID MALAN: Yeah, if someone gains access to your cookies,

  • you could masquerade as someone else like Maria.

  • And so in the sort of equivalent at a club or a part

  • would be if you kind of do this with someone else's hand and try to say,

  • hey, we're both here even though one is now backwards, that would be kind

  • of a duplication of the session value.

  • So you probably want to make sure these are encrypted.

  • And what's the easiest way to ensure that the traffic between you

  • and a server is encrypted?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Not hashing necessarily.

  • AUDIENCE: SSL.

  • DAVID MALAN: SSL or the HTTPS prefixed URLs.

  • And thankfully-- and the story has changed over the years.

  • Thankfully, web servers are getting better

  • at supporting HTTPS because then all of this stuff,

  • even though the browser can see it and the server can see it,

  • it's at least encrypted or scrambled to anyone

  • who's eavesdropping and trying to see what's going on between the two.

  • But there have been bugs along these lines.

  • In fact, it was just a few years ago in CS50

  • where just days or hours prior, a security researcher

  • had presented at a conference some very easy to use

  • software via which you could hijack other people's sessions,

  • or session highjacking being the term of art here.

  • And what he had implemented was this plug-in for Firefox, another browser.

  • And that would allow you, when you ran this guy's plug-in,

  • to just sniff all of the wireless traffic in the same room.

  • And it would show you, so user friendly, a list of all of the session cookies

  • that were kind of going back and forth through the air, so to speak,

  • that were not encrypted.

  • And this was, I don't know, four or five years ago.

  • When HTTPS existed, it's just not, a lot of websites

  • were actually really using it.

  • So a lot of really popular websites weren't using it.

  • And so the demonstration in class, which thankfully we

  • can't do anymore because most websites have been more secured,

  • it was as easy as clicking a button for me

  • to log into that here Rose's Facebook account because she was using Facebook

  • during lecture and unbeknownst to her, so now was I.

  • And it was simply a matter of capturing this kind of value.

  • Moreover, if the session cookies are stored, as they're called,

  • on your computer's hard drive, there's just more remnants

  • of where you have been on the web.

  • Not just that little drop down menu, but elsewhere on your file system

  • might be reminders for better or for worse of where you have actually been.

  • So for these and other reasons do you sometimes

  • want to clear these or at least be mindful of them.

  • And frankly, from a development perspective,

  • one of the reasons why we keep encouraging you students to use

  • incognito mode occasionally for development purposes

  • is to essentially clear all of your cookies.

  • Incognito mode gets rid of all of your cookies

  • temporarily so that you can interact with the server

  • in a more predictable way.

  • And for software developers, that's just useful

  • not to be confused by older stale cookies.

  • Any questions then about cookies?

  • All right, so who cares?

  • Like, that's all fine and good.

  • Now we can be a little more paranoid when using the web.

  • But what can we actually do with this now that's actionable?

  • Well, if I go ahead into CS50 IDE, let me go ahead and open up an example

  • that you can also look at later on via the course's web site.

  • And this is a little web-based store that I whipped up.

  • It's super common these days to, like, buy things on the web.

  • And of course when you're buying things on the web, hopefully

  • the website is remembering what you are interested in.

  • If you go to amazon.com and click Add to Cart, where does that go?

  • Well, let's think about it.

  • Amazon is somehow remembering, after I logged in, who I am.

  • And we can assume that they're doing that via cookies already.

  • So they're putting some handstamp on my hand

  • and my browser is reminding Amazon on every page click

  • that I am me who logged in previously.

  • So what do you think a server is doing any time you add something to your cart

  • or quantity two of your car or something else in your cart?

  • Where Where's that information being stored as well?

  • Say again?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Also-- so in the header, it suffices just

  • to put that cookie because you proposed actually

  • earlier that we can store stuff in a server, whether it's a CSV file

  • or in RAM or today in a database.

  • If you can store anything you want on the server,

  • it suffices just to remember via this cookie

  • that the person has logged in before and uniquely identify him or her

  • thereafter by that cookie's value--

  • 123456 or whatever it is.

  • Yeah.

  • AUDIENCE: But there's a list associated with your cookie?

  • DAVID MALAN: There's a list associated with my cookie.

  • That is actually the perfect segue to, yes, exactly.

  • And let's take a look at what that actually means.

  • So this is a web based application.

  • To those of you in the audience who weren't here last week

  • or the week prior or the whole semester, this

  • is a programming language called Python.

  • It's the language we've now started using over the past week.

  • And this is using what's called a framework, freely available software

  • that other people wrote, called Flask that makes it easier to implement web

  • based applications or websites.

  • And what we're doing in CS50, now recall, is not making static websites.

  • We're not talking about web design, per se,

  • which tends to be the art of creating static content, very

  • beautiful informational websites, but web programming, whereby we're actually

  • using logic to generate web applications or web sites dynamically.

  • And so Python lets us do this.

  • So this, recall from last week, is the magical line

  • that says give me a Flash based web application.

  • And with this comes certain features, like the ability to say,

  • hey Flask, define a route for slash.

  • That means anytime the user visits slash, the default page

  • on a web site, return what file, just to be clear?

  • Index.html.

  • And then we looked last week at where those index.html are

  • and that they are templates and so forth.

  • But this simply means render index.html.

  • So recall that I can run this web app as follows.

  • If I go into the store directory and type flaskrun,

  • I have now turned on a web server, so to speak, in the cloud,

  • and it's literally telling me what URL to visit.

  • So I'm going to go ahead and click on this and click Open.

  • And we see a super simple store.

  • Let me zoom in.

  • It's really ugly-- haven't used any CSS.

  • This is indeed not web design.

  • This is all functional today.

  • And you'll see, even though the projector is a little cloudy,

  • that there's three text fields there.

  • There's a number for Foo, a number for Bar,

  • and a number for Bas, which if I'm familiar

  • are just go-to words in computer science when

  • you need to talk about something that doesn't need a specific name.

  • It's like x, y, and z in math.

  • So right now, here's a store a la Amazon.com that sells three things--

  • Foo, Bar, and Baz.

  • And I can buy any number of these things.

  • But what we haven't had yet is the ability

  • really to allow a user to interact with a web form

  • and then store that information and change that information.

  • We kind of scratched the surface last time with the CSV file for [INAUDIBLE]..

  • But here now we have the ability to actually store something in a database.

  • So let me go back to the IDE for a moment and focus on exactly that.

  • So first of all, here is how this program works.

  • I'm going to go ahead and buy myself one Foo, two Bars and three Bazes,

  • whatever those are.

  • I'm going to go ahead and click Purchase.

  • And now I see my shopping cart.

  • And just to be clear, if I hover over the URL, I am now at slash cart

  • preceded by the long URL, which is where I began.

  • So notice what's neat here.

  • I can reload and it's still the same.

  • I can reload, still the same.

  • I can even close the tab, open a new one, go back to it, and it's still me.

  • Now let's see what's actually going on underneath the hood in terms of code.

  • Well, to C/cart is actually super simple.

  • Just render cart.html.

  • But cart.html, which I wrote before this morning,

  • didn't know in advance presumably that I was

  • going to buy one Foo, two Bars, and three Bazes.

  • So it's probably not hardcoded.

  • So there's this other feature of templates

  • in a web application like this one whereby

  • in addition to telling the computer show me cart.html,

  • recall that you can pass in values.

  • And I think we did this once last time.

  • So cart is just the name I've given to this concept, like a shopping cart.

  • And I'm going to pass it something called a session.

  • And a session is the term of art that describes

  • whatever it is the server is remembering on that user's

  • behalf per his or her cookie.

  • So the user's browser has the hand stamp, the cookie--

  • 123456.

  • The server actually stores the session, which

  • is just going to be a box filled with values, something

  • that has been put into that box before.

  • And so what I have now in my cart.html is this.

  • Let me go into templates, open up cart.html.

  • And this is going to look especially cryptic.

  • But for those who've seen it before, recall

  • that this just means descend from or inherit from layout.html.

  • That's where, where, like, of the standard HTML is-- head and body

  • and all that.

  • Then this says, hey, server.

  • Here comes the body of my website.

  • And there's relatively little in there.

  • Here's my big bold title--

  • cart.

  • And that's why a moment ago, we saw exactly that, big bold cart.

  • And then over here, notice that I'm using

  • Jinja, that templating language-- very lightweight, very Python-like,

  • and you'll get more used to it over time.

  • Notice that we have here a for loop saying for each item in cart,

  • go ahead and do this colon that, which looks super cryptic.

  • But does anyone recall perhaps from [INAUDIBLE] last time,

  • like, what does this syntax mean with the two curly braces on left and right?

  • AUDIENCE: Variables.

  • DAVID MALAN: Variables-- just plug in these values here.

  • And now as an aside, it turns out that these curly braces are actually

  • escaping our data, so to speak, in a way so

  • that if the user types in kind of dangerous characters

  • at his or her keyboard, this prevents that from becoming a problem,

  • but more on that some other time.

  • And so this is just going to say item colon cart bracket item.

  • And this might be new syntax to some.

  • Where have we seen square brackets before?

  • What does that mean in C?

  • AUDIENCE: Arrays?

  • DAVID MALAN: Arrays.

  • So in Python, we still have arrays, but they're called lists.

  • But we have another feature too, and you wouldn't know this

  • just by glancing at the syntax.

  • But when you see square brackets next to a word

  • like cart inside of which is not a number,

  • but it's going to turn out a word.

  • This word is literally going to be Foo or Bar or Baz.

  • This is a way of using a hash table or a dictionary or a dict in Python,

  • as it's called, whereby if the cart is what's been passed into this template

  • and inside of the cart are a bunch of items like Foo and Bar and Baz,

  • you can get the number of Foos by doing cart brackets Foo, which would look,

  • if I hardcoded it, like this.

  • You can get the number of Bars by doing this.

  • And you can get the number of Bazes by doing that.

  • And yet if I just plug in item, which is a variable,

  • I'm going to get out, within this loop, the Foos,

  • then the Bars, then the Bazes, and so forth.

  • So what is the end result?

  • Let's go back to the browser.

  • When I hit reload here, the reason we're seeing Foo colon 1,

  • Bar colon 2, Baz colon three, is because of that template.

  • And if I reload again and again and again,

  • I'm seeing the same thing because my browser is probably doing something.

  • So let me do that.

  • Let we go to inspect.

  • Let me go to Network.

  • And just for good measure, and you should get into this habit too,

  • click preserve log.

  • Because this way, if you change pages, the log doesn't go away.

  • You can still poke around what you're seeing.

  • And again, this is my Network tab.

  • So I want to look at what Chrome is doing to talk to this web app.

  • I'm going to go ahead now and I'm going to hold Shift and then click reload.

  • Or if you click and hold--

  • you've probably never noticed this.

  • But if you click and hold on your own reload button in Chrome,

  • you can do a normal reload, which is what all of us do all the time,

  • or a hard reload or an empty cache and hard reload, which is just

  • a way of saying get rid of all of my cookies,

  • all remnants, and just forcibly reload everything, which

  • is great for software development so we know what state we're always in.

  • But you can also do that by holding Shift--

  • whoops-- and clicking reload.

  • So let's do that.

  • Reload-- and here we go.

  • That made one request because there's no images or CSS or fancy stuff

  • to send more requests.

  • This is going to be a little overwhelming at first glance.

  • The IED itself uses cookies for other purposes.

  • So you are seeing bunches of cookies that

  • have nothing to do with this application-- has

  • to do with the programming environment.

  • But what we can focus on is this here.

  • Notice that any time I get a response from the server,

  • it's indeed sending me set cookie and then some value.

  • Now this is a longer value than 123456, but it's

  • the same thing-- a big random number.

  • So what's happening is that is the server's way

  • of telling my browser you are user 123456 or whatever

  • that big long number was.

  • And my browser is now supposed to present that again and again and again.

  • So lastly, what is actually going on when we continue shopping and let's

  • say buy 10 of these and 10 of these and 10 of these and click Purchase?

  • Notice that if I look at the form, this is going to slash update.

  • And it's via Post.

  • And so if I now go back to the code to see what happens--

  • let me go back into application.py, focus on update.

  • This now is how I'm updating the contents of my server side cart.

  • I am saying for each item in the form that

  • was submitted, the requests form, go ahead and update session bracket

  • item, which is going to be Foo or Bar or Baz because that is what

  • was in the form, and store there this--

  • request.form.get.

  • So get me the value of that item, whether it's 10 and 10 and 10 or 1

  • and 2 and 3, whatever the numbers are that the human typed in.

  • And why do you think we put this word int?

  • Why might that be necessary?

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, it's simple as that.

  • It's kind of a casting in C. What the user typed in was text.

  • It might look like numbers, like 1, 0.

  • Certainly looks like 10.

  • But it's not a number.

  • That's a string that looks like a number.

  • So using int converts it to an actual number so we can do math on it.

  • So long story short, what is it that made this possible?

  • We've not used this before, but you will have seen this perhaps in P set six

  • or rather P set seven's distribution code, which is now on the web site.

  • There's a few lines of code that just enable sessions.

  • With those few lines, are we able to tell Flask, hey Flask,

  • start sending cookies and expecting cookies

  • back so that the programmer can use what is called a session, which

  • I've additionally imported up here?

  • So I've just added some additional functionality.

  • And if you've ever wondered where all this data is going,

  • let me go ahead and stop the server, clear my terminal and type LS.

  • A moment ago, the only two things in this folder

  • were application.pi and templates.

  • Now there's apparently a directory called Flask session.

  • So if you've ever wondered where Amazon stores the contents of your shopping

  • cart or where Facebook stores Maria or Brian once they've logged in,

  • it's probably in a folder in a file like this

  • where these big long strings represent your cookie value

  • or, better yet, it's in a database.

  • And so that's why those are there-- because I apparently

  • created four cookies by reloading and experimenting a few different times.

  • OK, any questions on what the potential then

  • or the usage of cookies and sessions are?

  • All right, so let's then transition to solving

  • problems a little more powerfully with these things called databases.

  • So here is, like, a depiction of what we've been doing the past week

  • and we'll do a bit more today and you'll do with problem set seven as well.

  • We have been having browsers and servers talk to one another

  • using kind of a certain paradigm.

  • And the paradigm is generally called MVC, Model View Controller, though we

  • didn't call it that last week.

  • And MVC means this--

  • you are writing code that is called controller.

  • This is like the logic, the intellectual property,

  • the business logic of your application.

  • Like, if a user submits form, then add update contents of shopping cart.

  • Any of those kinds of ifs and elses and loops

  • are in what's called your controller code, AKA application.pi.

  • You might also have views, V in MVC, which specifically are your templates,

  • the idea being that you separate all of the interesting, juicy, intellectual

  • part of your program in the controller code, the application.pi,

  • and all of the important but kind of aesthetic stuff

  • in your templates directory, AKA views.

  • This way, among other things, you, for instance,

  • can implement all of the logic of a web site

  • whereas a friend of yours who is much better at design certainly than me

  • can do all of the aesthetic stuff, all of the views,

  • in the templates directory, and you can collaborate a lot more effectively.

  • But there's one more letter in MVC.

  • It's Model View Controller.

  • And what we'll focus now on today is the M in MVC.

  • Model is just a generic way of talking about data--

  • like where does the actual data go.

  • So the mental model to have moving forward as you write code,

  • even though the lines are not going to be always as

  • clear as this picture suggests, is any time

  • you're writing code in application.pi that's probably controller code.

  • Like, that's the intellectual workflow of your application.

  • Anytime you're doing HTML or CSS or Jinja,

  • that's the view of your application, the aesthetics, the presentation

  • of information.

  • And any time you're reading and writing data, long term or short term.

  • That's the model.

  • And sometimes the lines will be very blurry

  • and it could all be in the same file.

  • But those three types of operations are the mental model to begin to adopt.

  • So let's get there.

  • You'll recall that in week zero, we introduced this guy, Scratch,

  • and the graphical programming language with which you began the semester.

  • A week later, we introduced C, which had the same ideas and capabilities

  • for the most part but using a keyboard only instead of graphics and a GUI.

  • Then we transitioned a week ago to Python, a couple of weeks

  • ago now to Python, which is very similar in spirit to see.

  • But as you may recall with great delight or frustration,

  • you can get so much more work done quickly--

  • for instance, P set five--

  • using the right language.

  • And so here too we begin to use a new tool

  • for the job called SQL, or Structured Query Language,

  • or S-Q-L. This is yet another language, but it's much smaller than Python and C

  • as we shall see today.

  • But it is a language that you typically use

  • when you want to talk to a database.

  • So based on your familiarity already, what

  • is a database even if you've never actually used or programmed

  • with one before?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Place where your store data.

  • So a database is a place where you store data- a base where you store data even

  • perhaps.

  • And so that's true.

  • And what does that actually mean?

  • Well, a database, like a web server, is actually a piece of software.

  • We might think of it is this, like, physical entity,

  • but a database is technically a database program

  • that is listening for connections just like a web server or an email server.

  • And any time it gets requests, it serves up a response.

  • So we have the same client server relationship that we do with the web

  • itself.

  • Now the database hopefully is better and smarter and faster at storing data

  • than certainly little old me.

  • And it's hopefully better than something like CSV files.

  • Like, recall we used CSV to store the registrants for [INAUDIBLE] last week.

  • I think it was me and Rob and Maria all at once.

  • Then what was good about that?

  • What problem did we solve by introducing CSVs with that example?

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: You didn't have to go through my inbox, right?

  • One of the other versions of the registration program

  • were to just email the proctor in charge of [INAUDIBLE],, which

  • is fine because he or she can just use labels and keep track of it there.

  • But that's not a database.

  • You're Just kind of abusing Gmail as a database.

  • So why not put in a Text file instead like a CSV?

  • Because what kinds of programs can open CSVs?

  • Yeah, like Excel and Numbers and similar spreadsheet programs.

  • You can import it into Google Spreadsheets as well.

  • So that was a plus of using CSVs.

  • But what's perhaps a minus even though we didn't really

  • experience the minus so much?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Harder to what?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: It is harder to do stuff with it because at the end of the day,

  • it's just a stupid text file, and we humans

  • have agreed to put commas in between values we care about.

  • But that's it.

  • Like, there's no features.

  • You open up that file, you're just going to see ASCII text and commas with them.

  • You get no, like, features, no functionality.

  • And do you have another thought?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: It's inefficient why?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: It's harder to find things.

  • Certainly not in the case of, like, me and Rob and Maria.

  • But as soon as you start having hundreds of students registered or thousands

  • of customers in your database or hundreds of thousands, like, in CSV,

  • the best they can do is linear time for everything.

  • You have to search the entire file top to bottom.

  • So wouldn't it be better if we can borrow ideas from week zero,

  • integrate them into the context of this problem, databases,

  • and actually query for data, ask for it, change it, delete it, and edit it

  • much more readily than simple little text files allow?

  • So what should you be thinking about when you think about a database?

  • Well, there's different types.

  • And today, we're going to focus on what are called relational databases, which

  • is a fancy way of saying tables of columns and rows.

  • So this is a screenshot of Google Spreadsheets

  • with which you're likely familiar.

  • The projector seems to be hiding the fact that there are actually

  • rows and columns here.

  • And instead, we're just seeing a big white box.

  • But there are indeed rows and columns inside of here

  • as you can see on the screen.

  • And if I use a [INAUDIBLE] this is a screenshot.

  • So I can't even enable rows and columns.

  • So here, we have our columns and rows.

  • So why is this useful?

  • Well, if you've ever store data for your student group, or for your customers

  • or for other applications around the house,

  • you've probably been in the habit of storing things in rows and columns.

  • And what does each column typically represent in a spreadsheet?

  • What kinds of things might it represent?

  • And here's an actual version of Google Spreadsheets.

  • And let me actually highlight this and turn on, just because, darker lines.

  • So there's our rows and columns.

  • So when you're using a spreadsheet, maybe from personal experience,

  • like, what do you put in the columns?

  • I'm sorry?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Categories of what?

  • Give me a concrete example.

  • AUDIENCE: [INAUDIBLE] the name of the student and lab that they're in and--

  • DAVID MALAN: Oh, what a delightful example.

  • So yes, we can say, like, the student's name and the student's dorm and maybe

  • their email or if we collected their phone number

  • and bunches of other things.

  • And you know, just to be sort of tidy, we might boldface it,

  • and you can use features like this, which just kind of lock that row there.

  • But these are just GUI features.

  • These are not sort of intellectually that interesting.

  • But what do I typically put in each row thereafter?

  • The first row is kind of special usually, which

  • is why you can kind of lock it off.

  • What do you put in the rows?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: I'm sorry?

  • AUDIENCE: Elements [INAUDIBLE]

  • DAVID MALAN: Yeah, the actual data--

  • so, students' names.

  • So if, like, David registered from Matthews,

  • you would put his information in there.

  • And then, like, if Rob registered from Thayer put his in--

  • oh, that's not--

  • Rob's from Thayer, not Rob from Bowden.

  • And then Maria, who--

  • damn it, I asked her this the other day.

  • Where does she live?

  • [INAUDIBLE] Matthews, just like me, which is why I should remember it.

  • OK, so here's-- we now might have multiple rows in this database,

  • each of which represents an entity would be the fancy way of describing where

  • an entity is like a real world thing or a physical object or whatever it is you

  • care about.

  • So this is how we might form at this.

  • And of course literally, we might format things using Excel or Numbers

  • or Google Spreadsheet's, like, format column, which kind of

  • presents the data in a different way.

  • Maybe it centers it to the left or-- the lines to the left or right.

  • Maybe it removes leading zeros.

  • Maybe it adds dollar signs, percentages.

  • You can kind of present the information in different ways.

  • But for the most part, at the moment, it's just text that I've typed in.

  • But if I want to actually find someone in this spreadsheet, what can I do?

  • Well, I can do Command-F and then type in Rob.

  • And there we go.

  • We found Rob.

  • But what did the spreadsheet do?

  • Pretty much looked from top to bottom looking for R-O-B and as,

  • soon as it found it, highlighted that row.

  • And if I had thousands of rows or tens of thousands

  • of rows in the spreadsheet, Rob might be all the way

  • at the bottom at which point it could take longer

  • to actually find all instances of Rob.

  • And in fact, programs like Excel and Numbers

  • and even Google Spreadsheet eventually just start to break down.

  • In fact, I experienced this in grad school

  • myself I was doing some research where I wanted people to analyze data

  • and it was just convenient to be able to double click on my CSV files,

  • open them in Excel, and look at or search for some of the rows

  • that I cared about.

  • But it turns out that because of the long running simulations I was doing,

  • I had, like, over 65,000 rows of data, and Excel just wouldn't do it.

  • I would double click and, sorry, this file

  • is too big because someone at Microsoft, in prior versions of Excel,

  • had decided, first and foremost, surely no one

  • needs more than 65,000 rows in this product.

  • But specifically, they used a number of bits to represent the row number.

  • If you recall, we tend to deal in powers of 2 or multiples of 8 specifically--

  • so 8 and 16 and 32 and so forth.

  • Well, 16 bits, 2 to the 16th, is 65,536.

  • And indeed, I must have had 65,537 rows or more.

  • So Excel just kind of wouldn't cooperate,

  • and I would have to use an actual database like Microsoft Access

  • or Oracle or, as we'll see today, something like SQLite.

  • So as soon as you kind of hit some ceiling with the tool

  • whereby it's slow to search or the program just can't open all of the data

  • or you can't kind of edit information or delete information any better

  • than linearly, it's probably time for an actual database.

  • And one such database is called, indeed, SQLite.

  • So SQL, Structured Query Language, is the language we're about to see.

  • SQLite is like a very free and very lightweight

  • implementation of this database.

  • And it's a lightweight in the sense that you

  • don't need special software licenses.

  • You don't really need to install it.

  • All you have to do is store data in binary format.

  • So up until now in CS50, we've pretty much

  • stored almost everything in ASCI or text format except for P set 4's images.

  • Well, today we're back to storing things in binary format, zeros and ones.

  • But again, we are well past the point of caring what those zeros and ones

  • look like, the patterns there of.

  • So SQLite is going to handle all of that complexity for us,

  • and it's going to store in one big binary file containing

  • zeros and ones the same kind of information as David and Rob

  • and Maria and others but much more efficiently

  • because one of the things you get from a database

  • is the result of a lot of smart people's effort trying

  • to figure out how can we, on behalf of the world, store data more efficiently?

  • How can we provide users around the world with searching and deletion

  • and editing capabilities more efficiently than he or she could

  • implement themselves?

  • So again, in the spirit of CS50, we sort of

  • are standing on the shoulders of people who've come before us.

  • Now what is this going to mean in real terms?

  • We're going to use at least one of two tools.

  • There's a tool called PHP lighted min, happens to be

  • written in another language called PHP.

  • But that's incidental.

  • We're going to use it as a graphical tool with which we

  • can interact with the database program.

  • But there turns out there is also a terminal window version of a management

  • tool called SQLite or technically SQLite 3,

  • which is just a text based way of interacting with the database as well.

  • But that's just how we're going to see the database visually

  • in my terminal window or in my browser.

  • At the end of the day, we're going to talk to our database using

  • actual Python code.

  • But before we get there, let's seed ourselves

  • with some of the basic building blocks that we'll have

  • at our disposal with this language.

  • So unlike C and Python, where we're generally focused

  • on the logical controls, conditions and loops and functions and so forth--

  • some of those do exist in SQL.

  • But we're actually going to focus really on just the operations, so

  • to speak-- the verbs that SQL supports.

  • So it's a little different mental model to have than C and Python.

  • Specifically, perhaps the most common operations

  • we're going to use when storing data in not a spreadsheet like Google

  • but rather a binary file like a SQLite file is these five operations.

  • We can create tables.

  • We can create-- we can insert rows.

  • We can select rows from the table.

  • We can update rows in the table and we can delete rows from the table.

  • So these are kind of like what you would expect

  • you would get from a database program.

  • It's just these happen to be the terms of art as well.

  • Like, these are the verbs we use.

  • And it's conventional to capitalize them,

  • but they don't strictly need to be.

  • So let's look at a few examples now.

  • If I wanted to, in an actual database program,

  • create kind of the equivalent of this idea--

  • rows and columns and things called tables--

  • here's how I might do that.

  • So in my administrative program, which we'll see in a moment, or in code,

  • I might do lines like this, one at a time.

  • We'll review.

  • So in yellow now is CREATE TABLE.

  • So if you want to create the binary equivalent, the computer equivalent

  • of like a Google spreadsheet for [INAUDIBLE] registrants,

  • you would literally do CREATE TABLE quote unquote registrants

  • where that's the name of the table or the sheet you want to deal with.

  • And then this didn't fit on one line.

  • So I wrapped it.

  • But this is technically one long command.

  • Then in parentheses, you specify a, separated list of all of the columns

  • that you want to have in that table, but--

  • and this is fancier than Excel and Numbers

  • and Google Spreadsheets-- you specify a few other details as well.

  • For instance, if my leftmost or first column is going to be called ID

  • and my second is going to be called Name and my third

  • is going to be called Dorm--

  • so similar to what you proposed earlier but

  • we're not going to worry about email and phone number and the like.

  • And I have added, for good measure, a unique identifier, an integer

  • to identify each freshmen registering for sports.

  • There's a few other details I'm clearly providing.

  • Like, what type of information is a name going to be clearly?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Text, and indeed, SQLite has a data type

  • called not string but literally text.

  • Similarly, as dorm going to be text.

  • But ID is going to be a little different.

  • It could be text.

  • But we get better efficiency if we just use numbers

  • to uniquely identify humans, right?

  • Because even in my name, D-A-V-I-D, how many bytes is that minimally?

  • How many bytes?

  • D-A-V-I-D. Five-- yes, five bytes total for D-A-V-I-D.

  • But if we instead represent me in a computer using a number like the number

  • one or two or even the number 2 billion if we have a lot of users,

  • how many bytes does it take to store that number?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Well, even fewer.

  • If I just rep-- if my Harvard ID number is just a single number,

  • like one or two or 2 billion, how many bits

  • do you need to represent an int like that?

  • Typically, we've used four.

  • You could use eight if you want a bigger integer,

  • but generally speaking, it's going to be fewer than the number of characters

  • you're actually carrying about.

  • And so that's why I might use an integer.

  • We'll come back to what primary key means in just a moment.

  • But long story short, that line, those lines in yellow

  • if executed in a database program are going

  • to create the equivalent of a spreadsheet containing

  • three columns that I care about.

  • Thereafter, if I want to insert someone like myself into that table.

  • This is how you use the SQL language to do so.

  • You literally execute insert into registrant,

  • which is the name of the table, then a, separated list

  • in parentheses of the columns for which you have values

  • that you're about to provide.

  • And this allows you to change the order if you really want,

  • but I'm just keeping it clean and going left to right.

  • Then you literally write values.

  • And then in parentheses again, you list a comma separated list of the values

  • that you want to plug into those columns.

  • And again, it's up to you to make sure that if you say ID first,

  • you put a number first.

  • Name second, you put the name second, and so forth.

  • That's where we're in control.

  • Meanwhile, if we want to search for everyone who has registered for

  • [INAUDIBLE],, we're going to see syntax like this.

  • You're going to say to the database select star everyone from the database,

  • specifically from the table called registrants.

  • And that's going to return to you literally

  • every row somehow in code from the table Meanwhile.

  • if I want to change my name because I wasn't sort of following directions

  • and they technically wanted my full name for sports, well,

  • here I can do update registrants, the table,

  • set the name column equal to David Mailand,

  • but this last part is pretty key.

  • If I don't put where ID equals 1, what intuitively do you think

  • might happen in the database?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: All the names go to David Mailand, which is probably incorrect.

  • And some of the C course's head teaching fellows

  • will remember years ago-- we actually had

  • a teaching fellow named Michael [? Tingly ?]

  • whose nickname was Tingles.

  • And we accidentally executed a SQL query omitting

  • this condition, this predicate.

  • Where we should have said where ID equals Michael [? Tingly's ?] ID

  • number, instead we left this off.

  • And thereafter, we had like 700 Michael [? Tingles ?] registered in the class

  • until we restored from backup.

  • So it can happen to you as it did to us.

  • Finally, if you wanted to delete someone,

  • same principles, just different verbs-- delete from the table

  • that you want to delete from where you are

  • specifying an actual ID in this case.

  • And actually just for good measure, though it doesn't strictly

  • matter, let me be consistent and just capitalize that Where as well.

  • The reason for the capitalization, just to be clear,

  • is that it kind of helps the human eye to distinguish

  • what is, like, SQL syntax from words that we humans have invented

  • for own spreadsheet.

  • And notice a few characteristics as well.

  • I have standardized on all lowercase for my column names

  • ID, name, and dorm, which is common, but not strictly required.

  • The table as well is all lowercase.

  • But what is absent from all of these field names is I'm not using spaces.

  • I'm not using funky characters like exclamation points,

  • like "registrants!", or anything like that.

  • I'm just kind of keeping it clean with very simple characters

  • and that's good for portability.

  • It's very commonly the case.

  • And you might have seen or experienced this in the IDE.

  • Any time you start using spaces or funky characters,

  • computers sometimes get confused because they're not expecting them.

  • So the best practice with things like database design

  • is just avoid those issues altogether and keep things simple.

  • Use underscores instead of spaces if you really want to put something--

  • a gap between two things.

  • Now meanwhile, there are different data types.

  • There's not terribly many, and the only ones

  • we have to care about for the coming days are these here, one of which

  • is kind of strangely named.

  • But SQLite, the database we're using henceforth, supports text as we've

  • seen, integers as we've seen, or AKA int-- some of these things

  • have aliases or nicknames

  • Null is going to be kind of the opposite of having any value.

  • You can actually specify, no, there's no value here.

  • Real is going to be what, perhaps?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, like a real number, like a floating point value,

  • as we called it in C and in Python.

  • And then BLOB is binary large object.

  • So if you actually want to store 0's and 1's in some form in your database,

  • you store it as a BLOB, which isn't super common

  • because there are better ways to store binary data like, in files.

  • But it is possible as well.

  • And it's worth noting that there's kind of sort, of some other types,

  • but they're not sort of officially there.

  • SQLite is nice because it's pretty flexible.

  • There are a lot of database programs out there.

  • You've probably heard of some of them like the ones I rattled off--

  • Microsoft Access, Oracle, MySQL, Postgres SQL, and others.

  • SQLite is just an alternative that's pretty flexible.

  • And so if you try to load data from like another database like Oracle

  • into SQLite, it will actually tolerate unofficial types

  • like date times, which look like years and time-- or dates and times

  • literally-- numeric, which is when you're not sure

  • if it's going to be a floating point number or an integer,

  • the database can figure it out for you and some others as well.

  • So don't be confused when reading up on SQLite.

  • If you see some things that aren't wholly consistent,

  • it's just consistent with that kind of flexibility.

  • And in fact, here's an exhaustive chart of all the things from left

  • that map to the right.

  • So in the right are all of SQLite's official data types, these five here.

  • And you might see on the internet in SQL tutorials

  • and like bunches of other types here.

  • The most popular might be, like, bigint, which is a 64-bit integer,

  • which is really big, varchar, which is a variable number of charts--

  • either fewer or more, depending on the data--

  • and lots of other things as well.

  • So just realize they're out there, but the five on the right

  • are the ones we really care about.

  • SQL also has some built-in functions.

  • You can do this in Python code too as we'll see.

  • But date and time and datetime allow you to get, like, the current time.

  • So if a user registers for your website and you

  • want to record when he or she registered,

  • you can call functions like this and the database

  • will just figure it out for you.

  • You don't have to check the computer's own clock.

  • But most importantly, there's going to be some of these decisions.

  • But before we forge ahead with too much more, let me pause for any questions.

  • Yeah.

  • AUDIENCE: [INAUDIBLE] when you [INAUDIBLE]

  • DAVID MALAN: Ah, good question.

  • Typically, no.

  • But if you want to be really precise, yes.

  • So I was really precise and I actually read the documentation this morning

  • to make sure I got all of those quotes right.

  • The rule of thumb if you're curious--

  • though for the most part, this won't matter and it shouldn't hang up--

  • is that any time you are referring to the name of an identifier,

  • as it's called, like a column name or a table name,

  • you use double quotes with SQLite.

  • Anytime you're referring to a literal string

  • that the human has typed in somewhere like with David Mailand

  • and like with Matthews, you use single quotes.

  • So if it's kind of user input, single.

  • If it's a name in your database, then it's double coated,

  • but it doesn't typically matter.

  • So don't fret two much.

  • So let's consider then for just a moment where we

  • started this particular conversation.

  • We were trying to create a database or spreadsheet for [INAUDIBLE] IM's,

  • but we kind of stopped short of the juicy part.

  • Like, I typed in some names and some dorms.

  • But I could have started typing email, which

  • would have hopefully some standard format-- something at something dot

  • whatever.

  • And then phone number is kind of interesting, right?

  • Like, how should I store a number?

  • Like, if I just want people calling Harvard's main number,

  • is that, like, 459-- what is it 5,000?

  • Well should I store it like this--

  • 10 digits and two hyphens?

  • Should I instead store it without the hyphens just to kind of save space?

  • Should I, need to be consistent with international folks,

  • store the country code as well, which now Google is confusing for math?

  • So that's what Harvard's phone number is when you add the numbers together

  • apparently.

  • How do you store the information?

  • Or better still, what should you expect the users to type in?

  • When you visit a website and you are asked for your phone number

  • to buy something or create an account, like, in what format

  • are you typically prompted for your number at least in the US?

  • AUDIENCE: Area code, [INAUDIBLE]

  • DAVID MALAN: Yeah, you're often prompted [INAUDIBLE] like an example like this.

  • And then you have to type it in and hit Enter.

  • And then it's stored, like that.

  • What else have you seen-- probably the other formats too?

  • So which way should you do it?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yes, yes, all of these ways are perfectly reasonable,

  • and any website that forces you to type your data into one format is stupid.

  • Like, it is poorly designed.

  • And this is one of my greatest-- well, it's not my greatest pet peeves.

  • But like, one of my pet peeves is when a web site just completely

  • technologically unnecessarily puts the burden of the human to format the data.

  • Like, I should certainly be able to type my information in like this

  • or with the plus one or without the parentheses

  • or with an additional hyphen or the like because you know what you

  • the programmer who made the website can do?

  • You can just use some kind of pattern matching

  • in the spirit even of P set six, throw away all of the punctuation,

  • and then store just the digits if you want or go in

  • and forcibly add the hyphens where you want them

  • or the parentheses where you want them.

  • The burden does not need to be on the human.

  • And there's actually compelling UX, or User Experience,

  • reasons for this beyond me going off kind of on a rant.

  • Like, if you can't copy and paste something into a field,

  • that is a bad user experience.

  • And if you've ever tried to just copy one field into another

  • and the website is disabling that for you, it's not the best practice.

  • Worse yet are websites that don't let you paste your passwords into the forms

  • because if you store your passwords, as is good practice, in like a password

  • manager, encrypted somehow, it's better to have a really long password

  • and then paste it into the form than memorize a really short password

  • that someone else could guess.

  • So in short, as we move forward with web programming,

  • don't necessarily take as canon, like, what you see out there, but consider

  • for your own users, like, what would be the best way to do this

  • and how can I, knowing programming, accommodate different styles

  • and expectations as well?

  • But at the end of the day, there's an issue here

  • because is this phone number now a number, or is it a string or text?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: So OK, so I heard a few strings or text.

  • So that's good because SQLite supports text.

  • So I could specify, when I convert the spreadsheet into SQL commands and a SQL

  • database, I could use text.

  • And so that's probably a good thing.

  • What about zip codes, right?

  • Zip codes, at least if we do the first five digits in the US--

  • like 02138 to.

  • Is that a number or is that text?

  • AUDIENCE: Text.

  • DAVID MALAN: Text?

  • Why?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: I'm sorry?

  • Say again?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, it's more of a category

  • than it is a quantity, certainly.

  • Like I don't have zero, one, two, three, eight of things.

  • That is actually just meant to be a unique symbol.

  • And in fact, as a side effect of the spreadsheet trying to be helpful, here

  • is an actual bug, or reason to store it as text.

  • I typed in 02138, which is Cambridge, Massachusetts,

  • one of its largest zip codes.

  • And unfortunately, Google just decided, you don't really

  • need that leading zero, do you, and it got rid of it.

  • And so this is because, at least in Google Spreadsheets, my data here,

  • if I go under Format, it's being stored automatically.

  • Google, for better or for worse-- often for worse--

  • is inferring that, if it looks like a number, it is a number, therefore let's

  • throw away the leading zeros, which humans don't need.

  • But if I change it to plain text and now do 02138, now it kind of sticks.

  • And another sort of true story, years ago,

  • I used to use Microsoft Outlook for my email.

  • And then eventually, I switched to Gmail.

  • And before I did that, I wanted to copy my contacts,

  • like my address book from Outlook into Gmail.

  • And so Outlook actually has a feature where

  • you can export a big CSV, literally, from Outlook.

  • And that contains columns of, like, all your friends' names and numbers

  • and addresses and zip codes.

  • And I foolishly didn't realize that I probably

  • shouldn't open the file first in Excel to look at it,

  • probably instinctively hit Command S or Control S just to save it,

  • because that's like a habit.

  • As soon as I did that, Excel presumptuously

  • did a Google just did, and dropped all those leading zeros.

  • And to this day, like 10 years later, I'm

  • still finding friends whose addresses are Cambridge, Massachusetts

  • 2138 for that very reason.

  • So data formats matter.

  • And we actually have a number of decisions to make, and moreover,

  • if we want this database to be highly performing-- that is, quick--

  • we actually have a few other design decisions.

  • And so let's take our five-minute break here, and when we come back,

  • we'll actually wrestle with how can you store the data most efficiently so

  • that your web applications are super fast.

  • So we are back.

  • And let's see what problems now arise besides the most immediate, which

  • was the phone number.

  • So let's consider that these aren't necessarily students registering

  • for [INAUDIBLE],, and so therefore they don't just

  • have dorms and so forth, but maybe these are actually

  • like customers in a database, Amazon customers or the like.

  • And so people are indeed going to have names, we'll say.

  • And people are going to have addresses, address.

  • And people will have phone numbers still, email, and maybe

  • some other identifiers, as well.

  • So when it comes to storing actual data here--

  • and just for clarity, these will be my column headings now, so

  • special first column--

  • so what's my name here going to be, David Malan?

  • Should it be David?

  • Like, how do you actually think about something as trivial as this,

  • and what are some of the upsides and downsides?

  • Like, which should it be?

  • AUDIENCE: David Malan.

  • DAVID MALAN: David Malan, OK.

  • So David Malan is what we store.

  • OK?

  • But now, later on, I realize I'd really like

  • to send customized emails to my customers,

  • and it feels kind of lame and informal if I say dear David Malan, where

  • clearly it's not very intimate.

  • I'd like it to say, dear David comma.

  • So how do you address that?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: OK, so maybe we should store first and last names separated.

  • So to do that, maybe I'll just kind of redo this, and actually

  • call this First, and then call this Last.

  • And so now, I can keep my two columns separated.

  • Other thoughts?

  • That's pretty good.

  • Or--

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Sorry?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Oh, so we have like the greeting, as well,

  • which you might be asked by a dropdown or a text box, too.

  • So maybe we need that, as well.

  • So technically, I should probably move this over like this.

  • And this will be like the greeting, and maybe this

  • will be Mr. or some number of other symbols.

  • So we might have to deal with that, as well.

  • What about address?

  • Let's move on to that, because honestly, with most of these questions,

  • there's not necessarily a best answer.

  • There might be some best practices, but even then you're

  • going to hit corner cases.

  • For instance now, anytime I do want the user's first name and last name,

  • I'm gonna have to concatenate them together, which is fine.

  • It's not a big deal in Python.

  • It was a bigger deal in C. So there's these tradeoffs.

  • Like, I just have to now do that, or write more code later.

  • What about the address?

  • So it might be 33 Oxford Street, the CS building in Cambridge, Massachusetts,

  • 02138 USA.

  • Let me even make more room for all of this.

  • Good?

  • Bad?

  • Why?

  • Let's go here.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Sorry?

  • AUDIENCE: [INAUDIBLE] separate the data [INAUDIBLE] zip code [INAUDIBLE]..

  • DAVID MALAN: Well you say need to separate the data into street and city

  • and country and zip code and state, but why?

  • AUDIENCE: [INAUDIBLE] right now to sort people by [INAUDIBLE]..

  • DAVID MALAN: Yeah, that's a great example.

  • So if everything has just kind of been this long string, which

  • we humans and the postmaster general in the US can understand,

  • and computers can parse, it's not that easy to just glean what we care about

  • from that whole long string.

  • Like, we're gonna have to use some kind of heuristics,

  • we're gonna have to look for all numbers to get the zip codes.

  • And I can, as you say, not sort everyone who's in Massachusetts,

  • or everyone who's in 02138, because like it's not a separate field.

  • And so even mechanically in Google Spreadsheets,

  • I can't sort by anything other than 33 Oxford Street.

  • And I probably don't really care to sort people

  • by their street addresses, which really have no notion of ordering

  • that's interesting.

  • So we probably want to split this up.

  • So maybe this is actually Street, and maybe this is actually--

  • let me shrink this to make a little more room now--

  • maybe this is City, maybe this is State, maybe this is Zip,

  • maybe this is now Country.

  • And then over here, we have Email and Phone and so forth.

  • And so now, I could do 33 Oxford Street.

  • And then I could do Cambridge, Mass, 02138, USA, malan@harvard.edu, 6174495.

  • See, this is why you shouldn't allow user input in different formats.

  • 495, 5000 and so forth.

  • OK, so suppose we do that.

  • But you know, Rob has an office in the CS building, as well.

  • And so he's actually Mr. Rob Bowden.

  • So now he and I are both there.

  • And oops, we didn't fix this, so technically let's go back here.

  • 02138.

  • And now it's like rob@cs.harvard.edu.

  • And maybe we share the same office number.

  • And there are certainly, like, dozens of other people in the CS building,

  • if not a couple of hundred.

  • So this is actually yielding another problem.

  • At least if I keep typing, I'm going to dislike this for some other reason.

  • What's also bad about this?

  • Yeah.

  • AUDIENCE: [INAUDIBLE] the office room.

  • DAVID MALAN: The office room, yeah.

  • So we didn't specify the room in the building.

  • So we're missing data all together.

  • What else really rears its head over time?

  • Yeah.

  • AUDIENCE: It's very redundant.

  • DAVID MALAN: What is redundant?

  • AUDIENCE: [INAUDIBLE] wasted space [INAUDIBLE]..

  • DAVID MALAN: Which information is redundant, then?

  • AUDIENCE: The street [INAUDIBLE].

  • DAVID MALAN: Yeah, there's a lot of--

  • I mean, literally I copied and pasted, which in coding in general

  • should kind of be a bad sign or a bad habit,

  • certainly if you do it more than just once.

  • And so Cambridge is identical, Massachusetts is identical,

  • zip code is identical.

  • And even though this is gonna be a small white lie,

  • it's probably the case that every zip code actually maps to a unique city.

  • So I don't-- if I know 02138, do I really need to store Cambridge, Mass,

  • Cambridge, Mass, Cambridge, Mass for everyone?

  • No, it's probably all the same.

  • So I should be able to delete some of this data.

  • And this kind of messiness is a problem that databases can now solve.

  • You could do this in Excel and in Google Spreadsheets and so forth.

  • You can quote unquote normalize your data by factoring out commonalities.

  • But to do so, the most common way, then, is

  • to start assigning unique identifiers to things.

  • So for instance, if I want to go in here-- let me give myself a new column,

  • call it ID.

  • And I'm just going to be user one, Rob will be user two, and then three

  • and four and so forth.

  • We're just gonna use an int to keep it simple and efficient,

  • because now it's only like 32 bits, or maybe 64 bits,

  • no matter how long the humans' names and street names and so forth are.

  • Let me just fix the border so that everything looks uniform.

  • So now I've done this.

  • And you know what, if you've never noticed,

  • like Excel and Numbers and Google support

  • multiple sheets in a spreadsheet.

  • So just for good measure, let me actually rename this to Customers.

  • And you know what?

  • I'm gonna go ahead in Google Spreadsheets,

  • give myself another one called Cities.

  • And then here, I'm gonna do something like this.

  • I'm gonna store over here maybe a let's say city and state.

  • And actually, you know what, because cities I'm

  • gonna claim for the moment are uniquely identified by their zip codes,

  • let's just put zip code there.

  • And then in this spreadsheet, let's do--

  • preemptively this time-- plain text--

  • 02138 shall be Cambridge, Massachusetts.

  • And now, back in my customers database, which

  • of these columns, A through I or so, do I actually not need anymore?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, I don't need this.

  • I don't need this.

  • I probably still need zip.

  • So let me just go ahead and delete these and start

  • to assuage your concern, which was this redundancy of data.

  • Now, I can do a little better, but now at least I'm storing zip code only.

  • And actually, this is getting a little lazy.

  • Let me get rid of country, as well, and just put this in here, as well,

  • for generalizing zip code to be postal code now, say.

  • And this shall be USA.

  • So now my Customers table has a lot less redundancy.

  • So clear upside is that it just takes up less space,

  • and less space means less money, and just more room for other things.

  • So that's surely good.

  • But it probably means too it's easier to update the data, right?

  • If someone-- if Cambridge for whatever reason

  • changes its name back to Newtown, as it apparently once was,

  • we just change it in one place, not for every resident in the Customers table,

  • and so forth.

  • So this is just a good principle, right?

  • Anytime in coding we've done something again and again, or in CSS even,

  • like we've tried to factor it out.

  • In HTML, we factored it out with templates or with functions

  • or with other techniques.

  • So this normalization process feels like a good thing.

  • But you know what?

  • 02138, we can do a little better, because that's a string.

  • We could just store this as a number.

  • So you know what I'm actually gonna do?

  • I'm gonna also have in this spreadsheet a column called

  • ID whose value for this city is gonna be one, and then for subsequent cities

  • it's just gonna an arbitrary number that's just increasing and unique.

  • And so now what I can actually do here is I don't strictly need to store even

  • 02138 dash whatever whatever, I can just store the ID.

  • And I'm gonna go ahead and call that zip ID, just to make super clear to me

  • that now this is just a number.

  • And so by starting to assign arbitrary but consistent unique integers to all

  • of our data, you can kind of relate them back to one another

  • by looking up one value in the other.

  • So for instance, if you're looking at this table as a human

  • and you see that, oh, David and Rob live in zip ID 1,

  • you do have to ask a follow-up question, which is, well, what is zip ID 1?

  • You can only answer that by looking at the Cities table, checking it,

  • and being like, oh, Rob and David both live in Cambridge, Massachusetts,

  • 02138, USA.

  • So it's added a step, but it's also massively shrunk

  • the amount of data we need, assuming there's more people in the world

  • than just the two of us, and there's millions

  • of users or thousands of users, or just otherwise context

  • where you have lots and lots of data.

  • And in fact, as an example of this, what we put together on the course's website

  • is a Google spreadsheet that has a whole bunch of sheets.

  • This is based on a freely available open source

  • database that looks like real data, but it's mostly

  • taken from one guy's iTunes account and then made up names and company names

  • and so forth.

  • And you'll see that this is actually a database for like a digital music

  • store, a la iTunes or Google Play or Spotify or whatever.

  • And so there's different entities in the world of digital music.

  • You have album names, you have artist names,

  • you have customers who might buy those albums and listen to those artists.

  • You have employees at the company who is selling the music.

  • You have genres for those songs.

  • Because this is actually a very slippery slope whereby, over time,

  • you're gonna have lots and lots of data that

  • ideally would be in multiple places at once, but that's just bad design.

  • Keeping it factored out is ideal.

  • So if we glance at this album title, or rather album tab,

  • here are all of the albums that this digital music store sells.

  • Here are the unique IDs, album ID, that have

  • been assigned to those albums, arbitrarily but for permanently.

  • And then here are the artists who published those albums.

  • So the only way to know the artist's name

  • is-- at least in Google, I'm gonna actually have

  • to scroll over to my Artists tab and, OK,

  • here are the artists and their numbers.

  • Meanwhile, all of these albums later-- or songs are associated with genres.

  • So here's all the genres and the IDs associated therewith, and so forth.

  • So you're welcome to poke around this kind of data later.

  • Here's even data involving birth dates, so dates and times and so forth.

  • Here are invoices, like the addresses and the cities

  • of people who bought this music, supposedly, in the sample database.

  • But honestly, this is very, very, very quickly becoming

  • really really annoying, right?

  • Because if you the human just so much want to know, OK,

  • who produced Restless and Wild, artist two, who the heck was artist two?

  • This is not good for humans.

  • But this is good for computers.

  • We just need a language with which to reconstruct what we want and query for

  • and change what we want.

  • And that is the kind of thing that SQL is going to do for us.

  • So let me change to another tab here in CS50 IDE.

  • Let me find the right tab here.

  • So if I go back to CS50 IDE, let me introduce a tool called phpLiteAdmin.

  • Name doesn't really matter, but this is a tool that comes with the IDE--

  • and we downloaded it for free off the web and just installed it for you--

  • that allows us to do this.

  • I have among today's distribution code a file called lecture.db.

  • .db just means it's a database.

  • It could be another file extension, as well,

  • but this is one of the standard ones.

  • And notice that-- let me close the console-- if I double-click this,

  • this is going to open a special tab in the IDE called phpLiteAdmin.

  • What I am now seeing is the content of lecture.db.

  • Lecture.db is a binary file, zeros and ones,

  • that represents a SQLite database that I made before class.

  • So I essentially downloaded all this data, I made a few changes to it,

  • and then saved it in a file called lecture.db.

  • phpLiteAdmin, this GUI here in the tab, is just

  • a way of visualizing for us humans what's in this database.

  • And you can see there's a bunch of tables, I think 11 different tables,

  • that I named identically to the Google spreadsheet.

  • So this is the SQLite equivalent of the Google spreadsheet.

  • But notice just from the links alone-- not that this is a very good design--

  • the overwhelming amount of functionality that I now get.

  • I have the ability apparently already to browse the data, search the data,

  • insert, export the data, empty the data, and so forth

  • that we just get from a database.

  • And we'll see how to do that in just a moment.

  • If I go ahead and click on Album, I'm gonna

  • see the same data as before within this graphical tool,

  • but again I just have some handy functionality.

  • I have the ability to edit each of these albums, delete them.

  • But I can also see the artist ID.

  • Now, this isn't all that useful, but let me go ahead and do this.

  • Let me go ahead and go to the SQL tab.

  • And this is where this tool is actually handy.

  • phpLiteAdmin is not something you put on your resume.

  • This is just like a tool we will use to help visualize what's inside

  • of our database, and to try out queries before we actually

  • write them into code.

  • I recall from before that one of the operations that SQL supports is Select.

  • So let me try that, select star from album.

  • And these are the key words from SQL.

  • This is the name of my table.

  • I could quote it, but it's not strictly necessary to,

  • so I'm not gonna bother here.

  • And if I go ahead now and zoom out-- whoops--

  • and click Go, this is going to select for me

  • all of the albums in the album table.

  • Now, this is not all that interesting, because just a moment ago I

  • just clicked on the word album and I showed you all of the albums.

  • But suppose I don't care about the album ID.

  • I don't care about the artist ID at the moment.

  • Well, let me go to SQL tab again and do select star from album.

  • But instead of star, which was the wildcard, what might

  • I want to select instead?

  • Just the name of the album?

  • So let me do that, select--

  • oh sorry, it was called title, title from album.

  • Click Go.

  • And now I get back a set of rows, a result set, as it's called,

  • of just the titles.

  • Now again, I'm not programming, I'm not using this data.

  • But this is the language, the way I can express myself in the SQL language

  • to get back only the data I care about.

  • And you know what?

  • Let me go ahead and do this now.

  • And this is where it gets more powerful.

  • If I go back to selecting all the data, there's the artist ID.

  • How do I get the artist ID?

  • Well let me go to Artist, and there's that same data, artist ID

  • and all of the names.

  • But here's where SQL gets powerful.

  • I can now take one table, one of whose columns

  • is these numbers, take another table, one of whose columns

  • is these numbers, as well, and I can kind of join them cleverly

  • by lining up the numbers that are identical in both.

  • The syntax is a little funky at first, but let's try this.

  • Select star from album join artist.

  • I want to join the two.

  • But now I have to tell the database what two columns to join.

  • It just looks like numbers, but I have to tell the database which

  • columns should line up.

  • So artist on album dot artist ID equals artist dot artist ID.

  • So this is just a predicate or a condition at the end,

  • but you can kind of read it intuitively.

  • Select everything from the result of joining album and artist,

  • provided you join them on album dot artist

  • ID, which is the artist ID column in the album table on the artist ID

  • column in the artist table.

  • So that's kind of doing this digital equivalent.

  • And so now if I go ahead and click Go, amazing.

  • Like, now I get back all of the data reconstructed

  • in a way that's useful, because now every row contains the album

  • ID and the artist ID, but more importantly the title

  • of the song and the name of the artist.

  • And I now have the ability to query for the data I want without inefficiently

  • storing all of this together.

  • And I can actually store a join on yet more tables in here, too.

  • You'll notice that playlist, for instance, is one of the tables in here,

  • because this digital music store supports the ability to make playlists.

  • And here we have playlist IDs which map to playlist names.

  • But how do you associate songs with a playlist?

  • Well, if a playlist just has a name and an ID,

  • you need some way of associating songs with a playlist by way of these IDs.

  • So we actually need kind of a many to many relationship, so to speak.

  • And that's what a database person would say.

  • If you want multiple people, playlists, to all

  • be able to have different songs but maybe overlapping songs,

  • so that you can't dedicate one song to one person,

  • you want a many to many relationship.

  • And so if we look at Playlist Track, notice this column is crazy.

  • This column-- or rather this table--

  • has just playlist ID, track ID, whereby track ID maps

  • to one of the rows in the track table, playlist ID maps to one of the rows

  • in the playlist table.

  • And by using this table and joining with multiple tables,

  • we can map playlist, which is just names of playlists and IDs,

  • to tracks, which is names of songs on certain albums.

  • And with the right join commands, I can actually

  • reconstruct all of the data I care about, and then in one breath

  • get all of the data back and let the database figure

  • out how to find all of those people and all of those songs

  • and all of those playlists efficiently.

  • A CSV, you would have to read every darn file top to bottom

  • and then figure it all out yourself.

  • But there are some key design decisions to be made.

  • I mentioned earlier not just these data types that SQL supports,

  • but also the constraints pictured here.

  • So these are a few new terms that we now need as ingredients

  • to designing a good database.

  • And we've seen one of these before, Primary Key.

  • It turns out that, anytime in one of these tables

  • we have defined for ourselves an arbitrary but a unique number,

  • generally in the left-hand column, just by convention, called ID or called

  • Artist ID or Album ID, whatever you call it,

  • that we've declared typically as a primary key.

  • This is a way of telling the database, when you design the database,

  • give me a table with an ID column that is meant to be the primary key.

  • That is, database, I promise to you that this column will always be unique.

  • Therefore, database, use this in the future.

  • Anytime you want to do something efficient,

  • rely on that column as uniquely identifying rows.

  • Don't use albums or artist names and titles,

  • which can be longer and non-unique.

  • Instead, use that, as well.

  • But there's other fields that you might want to flag as being unique,

  • but not your primary key, the number that uniquely identifies.

  • Like in the world of customers, which field that we played with a moment ago

  • might you arguably want to make unique?

  • And this was our own little database here of customers.

  • Which are the candidates for unique values here?

  • In back?

  • No, here.

  • AUDIENCE: Phone.

  • DAVID MALAN: No.

  • Phone, phone could be unique, certainly in the world of mobile phones.

  • But some people still have landlines.

  • That might get a little messy.

  • So maybe, maybe not.

  • What else?

  • AUDIENCE: Names?

  • DAVID MALAN: Names, maybe.

  • But if you Google yourself, odds are there's

  • many people in the world with your same name.

  • And you don't want to have one of you only buying things from the store.

  • Yeah.

  • AUDIENCE: [INAUDIBLE].

  • DAVID MALAN: What's that?

  • AUDIENCE: [INAUDIBLE].

  • DAVID MALAN: Email probably more likely.

  • Now, we can probably think of some family members

  • even who are maybe sharing an email address at home,

  • which complicates even that scenario.

  • But maybe.

  • If you do at least want to mandate that email addresses be the same--

  • I'd probably go with email over name, which

  • is gonna collide with multiple people--

  • then maybe we could specify yes, email address for our customers

  • must be unique, because maybe they're using it to log in.

  • Right?

  • If you're using email as a login name, it better

  • be unique so that there can't be two people with different passwords

  • but the same username or email address.

  • But an email address is kind of long.

  • Malan@harvard.edu, rob@cs.harvard.

  • Like, that's more bytes than an integer.

  • So long story short, it might make good sense

  • to uniquely identify your users at the end of the day

  • with just an ID, a numeric ID, the primary key.

  • But one of the features you get from a database is you can tell the database,

  • hey database, make this other column also unique,

  • and make sure that I or my colleagues can never accidentally insert data

  • into the database that is duplicated.

  • The database will defend against that for you.

  • You don't need ifs and else ifs in your own code,

  • you can let the database do that for you.

  • Index, meanwhile, is really powerful.

  • You can specify by the keyword index, hey database,

  • this column is important to me.

  • I am going to want to do searches on this column.

  • I am gonna want to look up people by this column.

  • Therefore, please go ahead and index it by using

  • your sort of secret sauce using fancy data structures and solid algorithms

  • to actually find data efficiently.

  • And here again is where we're standing on the shoulders of others.

  • Other computer scientists have figured out really good algorithms and data

  • structures, usually involving trees, which

  • we did discuss a few weeks back, that allow you to find data generally

  • in logarithmic time, not linear time.

  • And so simply by using this keyword index,

  • we'll see you can tell the database use those years of knowledge

  • to actually find data for little old me more quickly.

  • Not null.

  • You can specify don't let null end up in this column.

  • I must have a user's name.

  • I must have their phone number.

  • It cannot be blank.

  • Foreign key is even more fancy, but we've seen it already, in fact.

  • When we looked in just these examples here, rather for the songs database,

  • notice that in Album we had two types of IDs.

  • We had album ID in the Album table associated with titles,

  • and then each of these albums was associated with an artist ID.

  • And we've got two keywords here to apply.

  • I claim that this is the primary key for this table

  • because it just uniquely identifies the albums.

  • Why is that the primary?

  • Well, the table itself, the spreadsheet is called Album.

  • This is the unique ID that's primarily responsible for identifying rows.

  • That key over there is not primary, because there can only be one,

  • so it's called a foreign key.

  • But it's foreign in what sense?

  • AUDIENCE: [INAUDIBLE].

  • DAVID MALAN: Say it again?

  • AUDIENCE: [INAUDIBLE].

  • DAVID MALAN: To the album.

  • It's related to the album in that it's the artist,

  • but it's foreign in what way?

  • AUDIENCE: [INAUDIBLE].

  • DAVID MALAN: It's in a different file in a different table.

  • So it is a primary key in another table, which

  • is what makes it foreign if we mention it in this table, right?

  • Strictly speaking, this column doesn't need to be here.

  • I could just go ahead and forget about all

  • of the values in this highlighted column.

  • But once I add it, that allows me to link it primarily to another table,

  • because in this context it's just called foreign keys.

  • So that's it for the vocabulary there.

  • So what does this actually all mean?

  • Let me go ahead now into phpLiteAdmin, which

  • is again just our web-based tool for manipulating data.

  • And let me go ahead actually and do this.

  • Let me go ahead and close this version of phpLiteAdmin.

  • Let me go back over to my file browser.

  • And let me open up a terminal window for just a moment.

  • And I'm actually gonna go ahead and do this.

  • If I want to create a new file in Linux, the operating system we're using,

  • you can actually do this pretty easily by just touching the file,

  • even if it doesn't exist.

  • So touch customers.db, and now notice at left customers.db exists.

  • It's just an empty file.

  • I can now, wonderfully, double-click that and begin interacting with it

  • as though it's a database.

  • There's no zeros and ones in it yet until I start manipulating it,

  • and that's why it says no tables in database.

  • It's just an empty file.

  • So let's create an actual table in this database akin to what I was doing

  • in Excel, or in Google Spreadsheets.

  • So here is where in this user interface I

  • can create a new table on database Customers.

  • So let me go ahead and call this my Users table, because they-- well,

  • we're just gonna call it--

  • yeah, Users is good.

  • And number of fields, let's just do four this time.

  • And now I'm gonna go ahead and click Go.

  • And this is again what you get from phpLiteAdmin.

  • It's just a nice, user-friendly interface

  • for, like, answering questions.

  • So what columns do I want in this table?

  • Well, the first one is going to be the customer's name,

  • or maybe it should be first name, so like first underscore name or F Name

  • or whatever you want to call it.

  • I'll just call it First.

  • What should the data type be?

  • AUDIENCE: Text.

  • DAVID MALAN: Yeah, text.

  • And then last name should be text.

  • And actually, I'm getting ahead of myself.

  • I don't want to uniquely identify users by their names.

  • What did we decide?

  • AUDIENCE: ID.

  • DAVID MALAN: Yeah, ID.

  • So just to be kind of tidy, I'm gonna put it first.

  • It doesn't strictly matter, but it's just kind of the norm.

  • So I could call this ID, as would be a norm.

  • You could call it user ID, which might be another norm,

  • so long as you're consistent is what's important.

  • This should be not text but integer.

  • And notice these other features.

  • I can tell the database this is my primary key.

  • And now no other field should be primary.

  • I can tell the database, you know what, you deal with auto incrementing this.

  • I don't want to keep track of who is what number.

  • The database can assign an auto-incremented ID,

  • so one becomes two becomes three over time.

  • Not null is a given, because if it's a primary key it can't be blank.

  • And then this is irrelevant here, but you can give default values.

  • So for instance, if, for whatever reason,

  • people want to remain anonymous in their name, we could say,

  • well, the default name for everyone in my database

  • will be Bob, or something like that.

  • But that is kind of silly, as well.

  • We probably want to say that there is no default value, but rather first name

  • and last name must not be null so that the users must cooperate and provide us

  • with a value.

  • As for the last one here, I don't know, we'll just use email for now.

  • We'll make that text.

  • And it turns out you can't specify in the database

  • that an email address must have something at something.

  • We're gonna have to do that if we ever want in code.

  • But I can at least make it not null.

  • Now notice what happens.

  • Even though this is a GUI, graphical user interface,

  • when I click Create now, I'm told this-- table Users has been created.

  • But what's nice about phpLiteAdmin, which is largely why we use it,

  • is it's a good teaching tool, too.

  • This is what phpLiteAdmin just executed in order to create this table.

  • So it's a way of kind of learning SQL as you go by just click, click,

  • clicking in a user-friendly way, and then seeing immediately feedback

  • as to what command you could have just typed manually

  • in order to create that same table.

  • And it's wrapping because it's a little long, and it's not nicely indented.

  • But here, we have something very reminiscent

  • of what I put on the board earlier.

  • We have an ID field that's an integer, that's a primary key.

  • And by the way, it should also be auto-incremented.

  • That's a new feature.

  • But it cannot be null.

  • Over here, first name is text, must not be null.

  • Last is-- should not be an integer.

  • That's a screw-up.

  • So I should have made that text, so that's flawed at the moment.

  • And then email all the way over here is text, not null.

  • So what does this actually mean?

  • Well, let me show you one other way before we use this in actual code

  • as to how else I could be interacting with this database.

  • If I open up a terminal and I go ahead and type this now, SQLite

  • 3, which is the third version of the program,

  • I'm gonna make this my alternative database and hit Enter.

  • This is not as user-friendly.

  • Like, I have to know what I type, or it seems I can type Help to get some help.

  • And these are all of the commands that I can type at the command line.

  • It's not all that user-friendly.

  • But I do know this.

  • Once I know SQL, I can just interact with this with SQL commands.

  • And I'm gonna go ahead and fix this in my alternative database.

  • I did not mean to say integer, so let me make that text.

  • I'm gonna put a semicolon at the end, which is typically needed,

  • though it's not always needed.

  • So let me go ahead and hit Enter now.

  • Nothing seems to happen.

  • But in SQLite, if I type Schema, schema is like my database design.

  • .schema means show me my database.

  • And any command that starts with a dot is SQLite specific.

  • It is not SQL, it's now part of the language, it's just a local command.

  • Oh, that's kind of cool.

  • It just spits back out what it is my tables look like.

  • I can see this in another way.

  • Let me go ahead and close this for just a moment.

  • And notice that, in today's files on the website,

  • I have lecture.db, which was the music database that we opened up earlier.

  • Let me look at that.

  • SQLite 2, lecture.db.

  • Let me do .schema.

  • Oh.

  • That is all of the data inside of this database redisplayed to me as text.

  • So I can kind of back it up if I want, or do other things.

  • But let's just skim it for something interesting.

  • Let me go to the top.

  • And here we have, nicely formatted, these are the commands

  • that the people who wrote this database online from whom we downloaded it

  • essentially made.

  • Create Table, Album.

  • The square brackets are optional, but they just are alternatives to quotes.

  • One of the fields will be called Album ID, Title, Artist ID.

  • We've seen those things before.

  • One is an integer, one is text, another is integer.

  • And then we'd see these features here, which we didn't see earlier.

  • And I'm gonna wave my hand at them for now.

  • But there is a way to tell the database that there is indeed a foreign key.

  • So you can say, in this table, there's a number that refers to another table,

  • and you can let the database make sure that you can never

  • add an artist ID that doesn't map to an actual artist.

  • So you can't just insert artist number one, two, three

  • if artist number one, two, three doesn't even exist.

  • So again, another feature you get from the database.

  • And for PSET 7 we're just gonna scratch the surface of some of these.

  • But again, this is one of the reasons that you sort of graduate from text

  • files and you graduate from [INAUDIBLE] and start using [? full-pledged ?]

  • database software like SQLite or Oracle or other such tools.

  • So let me pause to see if there's any questions before we dive into Python.

  • All right, so let's now actually use this in code.

  • The value of seeing, hopefully, SQLite 3 and in seeing phpLiteAdmin

  • is just so that you have tools with which you can create your databases

  • and actually see what's inside of them without having

  • to write code to do both, which just makes it very hard and annoying to get

  • started.

  • So phpLiteAdmin is what you'll use for problem set seven.

  • And we'll end with a look at that in a bit.

  • But it's just a way of administering your database.

  • It's not actually the intellectually juicy part.

  • For that, we're gonna need some actual Python code.

  • So for that, let me go ahead and open up my console again.

  • Let me get some tabs open here.

  • And let me go into let's say Lecture 0 folder from today.

  • And this is not online now, but soon will be.

  • Let me open up application.py and see what this app is.

  • Super simple.

  • This is just copy paste pretty much from past examples.

  • And notice that it has one route, slash, that's gonna render index.html.

  • And then what is that template?

  • Well, index.html is one of two files.

  • My layout is big generic HTML structure of the page.

  • Index.html looks like this.

  • It's a big to-do.

  • So let's actually do something with this program as follows.

  • But first, let's take a stepping stone.

  • Let me go into New File, save this as lecture.py.

  • It's just a local script, nothing to do with web programming just yet.

  • And let me go ahead and do this.

  • First, for those in the room who've never seen Python before today,

  • python lecture.py--

  • that is a program written in Python.

  • If you're underwhelmed, that's deliberate at this point.

  • It's a one-line program.

  • But now we can do something so much more powerful.

  • Let me go ahead and essentially do this.

  • In pseudo code, query database for all albums.

  • Whoops, albums.

  • For each album in database, print title of album.

  • Like, suppose this is now the program we want to write.

  • How do I do this?

  • Well, in Python, I'm gonna go ahead and do this.

  • From CS50, import CS50 SQL library, which

  • will be introduced again in PSET 7.

  • And it just makes SQL a little easier to get into.

  • And for your final projects, you're welcome to drop use of the library

  • all together.

  • It's just training wheels.

  • And I'm gonna do this.

  • DB for database, but I could call that anything I want,

  • is going to get a SQL database whose file is of type SQLite colon slash

  • slash slash just because dot DB.

  • So long story short, this one line of code just says,

  • hey Python, open a SQLite database called lecture.db

  • that's in the same directory.

  • That's all.

  • So after that, I have a Python variable called DB.

  • And this is a variable that allows me to talk to the database in code.

  • And this is the missing piece.

  • So far the only way we've interacted with lecture.db thus far--

  • and lecture.db, again, is the musical database--

  • is by using phpLiteAdmin, which is just this administrative GUI,

  • or SQLite 3, which is the command line version of the same.

  • We haven't actually done anything intellectually interesting by querying

  • for data in a programmatic way.

  • I've done it all very manually.

  • But not now.

  • Now let me go ahead and do this.

  • Let me say, go ahead and store in a variable called Rows.

  • All of the rows in the table that are the result of executing,

  • select star from album.

  • So I'm writing Python code, but inside of my Python code I have a string,

  • and that string has SQL in it.

  • So what do I now want to implement this pseudo code?

  • So that's what I've just done.

  • That's my comment.

  • Now I want to iterate over each album in the database.

  • How can I do this?

  • Well, let me go ahead and do that.

  • So for each album in database, so for row in rows, old syntax

  • by now, what do I want to do?

  • I want to just go ahead and print out row--

  • what were the columns in my album table?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, it was album ID and?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: I think it was title in the album one.

  • So I'm gonna do this.

  • Quote unquote-- I think it was called title.

  • We shall see if I'm remembering correctly.

  • And now let me just put my comment over here, as well.

  • So now we have a pretty short program that just executes a query,

  • and it turns out-- oops, I indented excessively-- this variable Rows

  • contains what's called a result set.

  • It is a list, an array, a list of all of the rows that match that query,

  • which hopefully is all of them.

  • Then this is just Python from a week or two ago

  • for each of the row in that Rows list, print out that row's title.

  • Now what is this exactly?

  • You can think of this as a hash table, or a dictionary.

  • Row is one of the rows.

  • It has a bunch of columns.

  • And the syntax in Python for indexing into those columns

  • is just square bracket notation.

  • This is a dict, or a dictionary object in Python.

  • OK.

  • So let's cross my fingers because I'm not sure

  • if I named everything correctly here and do Python of lecture.py, enter.

  • Whew.

  • It worked.

  • So outputted here is every title from the database.

  • So let me go ahead and just pick one.

  • For instance, a soprano inspired.

  • And just do another proof of concept.

  • You know what?

  • I don't want to get all of them.

  • I want to go ahead and get where title equals.

  • And now I will use quotes to distinguish it.

  • So how many rows should this query now return?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Hopefully just one.

  • So let's see.

  • Let me go ahead and rerun it, lecture.py.

  • And very nice.

  • And now two things printed, though, which

  • we didn't see before because there was a lot of output, but it was there.

  • It turns out that the CS50 library, to help you out, especially with problem

  • set seven, always prints out this debugging information that you

  • can technically disable.

  • It's not part of your program, just kind of sent in addition.

  • This shows you exactly what query was executed,

  • which is gonna be helpful because, when you start using web programming,

  • as we're about to, as well, you can then see in your terminal window

  • what your web app is actually doing to make sure your

  • can diagnose any bugs in your code.

  • But we can be fancier than this.

  • Remember for problem set six, or soon see in problem set six,

  • that if you import sys for a Python program, this gives us access to what,

  • among other things?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, command line arguments.

  • So what if I do this, where title equals something.

  • I want that to come from the command line.

  • So how can I go about doing this?

  • Well, turns out there's one other piece of syntax

  • that you need to know from the CS50 library.

  • The Execute method supports what are called name placeholders, where

  • you can essentially say this, colon t, or colon title,

  • or colon whatever you want.

  • It's just a placeholder.

  • And I'm just gonna call it t for title.

  • And what you can then do is this.

  • You can specify that the value of t shall be,

  • for instance, sys.argv bracket 1, which will

  • be the first word that the human typed in at the command line.

  • And what is Python gonna do?

  • It's gonna plug this into colon t.

  • That's all.

  • It's just a placeholder.

  • And just as an aside, we've seen this before.

  • Let me delete this for just a moment.

  • As of like two minutes ago, the way you might have done this

  • is like this, with a so-called f string or format string, just plug it in.

  • This is dangerous, for reasons we'll see in just a moment.

  • So I'm gonna actually do this instead, the colon approach,

  • and let the CS50 library actually scrub or sanitize my data so

  • that the user can't accidentally trick me into, like, deleting all of my data.

  • So now let me go ahead and do this.

  • Python of lecture.py-- whoops.

  • Python of lecture.py.

  • And now let me go ahead and type in that same title, which was--

  • what was it, a soprano?

  • AUDIENCE: Soprano inspired.

  • DAVID MALAN: A soprano inspired?

  • Inspired?

  • And now I'm using quotes.

  • Why?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, it's one phrase.

  • I want it all in argv1.

  • I don't want it partly in argv1 and 2 and 3.

  • It's just gonna be messy.

  • Enter.

  • Whew.

  • It actually worked.

  • So now I've written an interactive program

  • that takes user input at the command line, plugs it into the SQL string,

  • and actually allows me to get the data that I care about.

  • Now, why did I do this?

  • As a quick aside, if you do it this way, which would have been our technique

  • as of, again, like minutes ago, before seeing this, this technique--

  • and actually, now I need to quote it inside the SQL string,

  • the library otherwise does that for me--

  • this is actually very dangerous.

  • This is violating a principle that henceforth

  • you should be very mindful of.

  • Why might this be dangerous, just plugging in the command line

  • argument to that string?

  • AUDIENCE: I could put malicious quotes, or I could put, like, [INAUDIBLE]..

  • DAVID MALAN: Exactly.

  • You can plug in sort of code that you didn't intend.

  • So let me go back down here.

  • What if a really malicious user didn't just type

  • in the beautiful name of a song, but did something

  • like haha semicolon delete from album to try to delete your whole database.

  • This isn't perfectly right.

  • This won't actually execute properly.

  • Haha is not a keyword.

  • But the semicolon means that's it for this query, the delete from album

  • is a valid query that we saw before, and this is

  • what's called a SQL injection attack.

  • And in fact, this is very common even today,

  • because one of the features you're getting from the CS50 library

  • is sanitization or scrubbing of the user input.

  • And the takeaway for today is that you should never

  • henceforth trust the user's input ever, even if it's you.

  • Get into the habit now of distrusting even yourself,

  • lest you somehow inject data into your database that shouldn't be there.

  • Let me demonstrate this as follows.

  • So if you're logging in, for instance, to a Yale University website,

  • your login form looks like this.

  • Or at Harvard University, it looks like this, Harvard [? key. ?]

  • Now I'm gonna g sort of reveal what a password prompt actually

  • looks like without showing the bullets, which is the norm,

  • but in the context of the web, if a user were

  • to type in seemingly dangerous instructions, maybe this could happen.

  • So suppose that my name is me@exampleprovider.com or whatever,

  • and my password is this funky single quote or quote unquote one equals one.

  • Like, this is not a complete thought.

  • But notice it's miss-- it ends a quote here, and it lacks one here.

  • So it's kind of like this partial thought,

  • because if an adversary out there on the internet

  • is trying to hack into a Harvard or a Yale system or the like,

  • he or she might think what if someone at Harvard or Yale wasn't so bright

  • and did not sanitize users' input, and just trusted whatever it is.

  • Maybe I, the malicious adversary, can try

  • to finish their thought, so to speak, in a malicious way.

  • And what does this mean?

  • Well, suppose that the code that someone at Harvard or Yale had written

  • looks like this.

  • Give me a user name variable and a password variable

  • using syntax that we started seeing last week,

  • and you'll see more of in problem set seven.

  • This just says get me whatever the user passed into the web application,

  • as opposed to from the command line.

  • And then suppose that the Harvard or Yale staffer actually

  • executed select star from users where username equals this f string syntax.

  • What's interesting here is that they are correctly

  • quoting whatever the string is supposed to be for both username and password.

  • But remember what the user was about to type in a moment ago.

  • If I kind of close one quote and then like leave one open,

  • I can kind of inject arbitrary statements in here.

  • And so what a SQL injection attack might look like is this.

  • If I type in exactly those values, me@exampleemailprovider.com,

  • and that funky password, notice if you look past the syntactic weirdness what

  • really this is saying.

  • Where password equals quote unquote, which

  • is kind of silly, if your password is blank, or if one equals one.

  • Now, which of those statements is obviously true?

  • One equals one.

  • And this is arbitrary.

  • I could've said foo equals foo or anything.

  • The point is, I sort of finished one thought with that single quote.

  • Then, I typed whatever I wanted because I'm

  • presuming a little cleverly, but also kind of guessing,

  • because the adversary can't see this code,

  • I just have a hunch that, if I construct the right query,

  • I can trick the database into returning any user because one always equals one.

  • So I'm gonna get back a row which is probably gonna let me into the system.

  • So what the end result is is very bad, unless you sanitize the user's input.

  • So this other approach with which I started,

  • actually using these colon-based conventions, which

  • is very similar to percent [? s ?] in C, but it's

  • just now SQL-specific, or SQLite-specific,

  • this now ensures that when the user types

  • in even the most malicious-looking data, CS50's library, or other libraries

  • out there in the world that do the exact same thing,

  • is gonna make sure that any dangerous characters like apostrophes

  • are escaped with backslash.

  • So long story short, you don't even have to know

  • or care what a SQL injection attack is fundamentally

  • if you just practice good database sanitization techniques, which

  • doesn't sound very cool.

  • But it's very easy to do so by using, in our case, the CS50 library,

  • but in the more general case, actually using a library like it.

  • And so this is kind of a fun meme that kind of goes around the internet,

  • supposedly like some actual person's car.

  • This was meant to do what, do you think?

  • AUDIENCE: Hide the license plate.

  • DAVID MALAN: Well, not just hide the license plate.

  • This is probably one of those cities where you actually

  • have cameras overhead that are using optical character recognition to bill

  • the right person for their license plate, or ticket the right person

  • for their license plate.

  • So this person, correctly or incorrectly, was presuming,

  • hey, maybe, you know, the city of Cambridge

  • has some buggy software that's not sanitizing its inputs like that.

  • And before we look at an actual web example,

  • I would be remiss if I didn't teach a generation of aspiring programmers

  • this comic.

  • [LAUGHTER]

  • So henceforth, if you ever hear about Little Bobby Tables,

  • this is what CS people are actually referring to.

  • So let's actually now use this in a web context.

  • So I'm gonna go ahead and do this.

  • I'm gonna go back to the IDE, close out the command line program

  • that I was writing a moment ago.

  • And let's just fill in this to-do.

  • Let me go ahead and go back to application.py, which

  • doesn't do much of anything just yet.

  • But I'm gonna go ahead here and do the following.

  • I am going to initialize my database.

  • So this gives SQL of SQLite colon slash slash lecture.db, so just like before.

  • And then in my index route, I'm gonna do something like this,

  • rows gets db.execute, which is the only method in there that you need to know.

  • Select star from album.

  • And that's gonna get me all of them.

  • And now I'm gonna pass to this template all of the rows.

  • And I'm gonna call it albums, and I'm gonna pass in all the rows.

  • And this is a way now of handing all of those rows and those titles

  • into my template.

  • If I now go to my template where I have this to-do,

  • I can use some Jinja syntax, which honestly is not

  • going to sink in until you start practicing it some more,

  • but I do know from experience it's for album in albums.

  • Now close brace, and then endfor is gonna

  • be how we end this in my Jinja template.

  • And in here, I'm gonna go ahead and output each time

  • the album's quote unquote title, if I didn't get a syntax error.

  • And so now, let me go ahead into my terminal window

  • and do in my source directory in my lecture example

  • flask run to start the web server.

  • Open up the URL.

  • Whoa.

  • So that's kind of a mess.

  • But it kind of looks like all of the titles, probably.

  • Let me clean this up.

  • And here's where HTML now becomes really useful.

  • In HTML, what's the tag for, like, an unordered list, [INAUDIBLE] list?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, UL for unordered list.

  • So let me go ahead and make an unordered list tag there,

  • but close it after this loop.

  • Let me go ahead and nicely indent everything.

  • And now inside of the loop, let me go ahead and output

  • a list item, recall from a few weeks back.

  • And now it still looks super cryptic, but what is the logic saying?

  • Within the body block of my page, go ahead and start an unordered list.

  • Iterate over each of the albums, and then output

  • its title inside of a list item.

  • So if I go back here and reload, what should I see?

  • A whole list of those album titles.

  • Moreover, I can do a little bit better.

  • Let me go ahead back now here and do something like this.

  • Let me go ahead and say q equals request.args.get q.

  • And q is, again, what Google uses for its queries,

  • but I could call this anything I want.

  • Let me now go ahead and pass in the following.

  • Where title equals colon q, and then let me pass in q equals q.

  • This feels a little silly, but just think about what it's doing.

  • I want to substitute for colon q the actual value of q.

  • And maybe I can be more explicit here.

  • If I call this query, what I'm really doing here is passing that in to there.

  • So now that I've done that, what can I do?

  • This is not a very fancy web application,

  • but suppose I want to search only for what was it, the soprano song before,

  • soprano inspired.

  • Actually, let me search for something simpler.

  • How about just worlds.

  • A la hello world.

  • Let me go to my URL.

  • And remember, if you understand HTTP, we can

  • do question mark q slash question mark q equals worlds, and enter.

  • And dammit.

  • Where title equals worlds q-- where title equals worlds.

  • Why did you not come out?

  • request.args.

  • request.

  • Stand by real quick.

  • We were doing so well there.

  • All right, let me just reload.

  • Oh actually, what a good time--

  • none.

  • OK, none is bad.

  • What am I doing wrong?

  • Did I screw up?

  • OK, hang on.

  • Stand by.

  • What a good time to introduce eprint for just a moment from earlier.

  • Now let me go ahead and reload the page to trigger the load.

  • OK, that's bad.

  • Dammit.

  • eprint is not defined from CS50.

  • Oh.

  • Oh, so close to--

  • yes!

  • OK.

  • [APPLAUSE]

  • In fairness, I have no idea what just happened.

  • But it worked.

  • So let's now, with those inspiring words,

  • give us a sense of what it is you're going

  • to be doing with these ingredients.

  • We focused a lot today on SQL and on database design,

  • but the ultimate goal at hand is going to be to use these ingredients

  • and use this new language to query for real information.

  • So here is what we call CS50 Finance, AKA problem set seven.

  • And you'll be given a little bit of distribution code

  • that essentially gives you a user interface similar to this that just

  • allows users to log in and log out.

  • But it doesn't allow users to register.

  • It doesn't allow users to buy or sell stocks yet, as will be the goal.

  • CS50 Finance is like building your own e-trade website

  • where you can buy and sell stocks.

  • But it's going to integrate thanks to code

  • you write with Yahoo Finance, which happens to provide essentially a very

  • lightweight API, application programming interface, whereby if you ask Yahoo

  • for the current stock price of a certain stock ticker, they will respond

  • and actually give you that answer so you can integrate it into your website.

  • So what you'll ultimately be able to do--

  • and this is simply the staff solution--

  • is build a website like this one here, such that I can go ahead now

  • and log in to CS50 Finance as a John Harvard with his password.

  • I can then see that, by default, all of your customers,

  • wonderfully, for a special deal, are going to get 10,000 virtual dollars.

  • I can go ahead and get a quote for a stock, like GOOG is G-O-O-G.

  • Google as of right now is just over $1,000.

  • That's great, because I have $10,000.

  • So I'm gonna go ahead and buy one share of GOOG by typing in its stock ticker

  • symbol and one.

  • And now I actually have one share of Alphabet AKA Google plus some remaining

  • cash.

  • So all of this and more you will implement.

  • Happy to stick around for questions.

  • That is it for today.

  • And thank you so much for today's parents and families for coming.

  • [APPLAUSE]

DAVID MALAN: All right, this is CS50 and this is lecture 10.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級 美國腔

CS50 2017--閱讀10--SQL (CS50 2017 - Lecture 10 - SQL)

  • 52 5
    小克 發佈於 2021 年 01 月 14 日
影片單字