2016年CS50--第5周--數據結構 (CS50 2016 - Week 5 - Data Structures)

字幕列表影片播放

SPEAKER 1: All right, this is CS50 and this is week five.
And let's take a look at where we left off last time.
You may recall this guy here, Binky from our friends at Stanford.
And we used Binky to start talking about pointers.
What is a pointer?
So, a pointer is just an address, the location
of some piece of data in memory, because recall at the end of the day
your computer just has a few pieces of hardware inside of it, one of which
is RAM or Random Access Memory.
And in RAM you have the ability to store bunches and bunches of bytes,
or kilobytes, or megabytes, or gigabytes,
depending on how much memory you have.
And if you assume that no matter how much RAM you have you
can enumerate the bytes-- this is byte 0, this is byte 1, this is byte 2,
and so forth-- you can give each of the bytes of your computer's memory
an address and those addresses are simply called pointers.
And now in C we have the ability to use pointers
both to go to any location in memory that we want
and even to dynamically allocate memory in case we don't necessarily
know a priori how much memory we might need for a program.
Now, in terms of your computer's RAM, recall
that we divided the world into this picture
here whereby if this rectangular region, arbitrarily, represents your computer's
memory, here is how the computer divvies it up
when you're actually using a program.
At the bottom of your computer's area of memory, you have the so-called stack.
And recall that the stack is where any time you call a function,
it gets a slice of memory-- a frame of memory,
if you will-- for all of its local variables, all of its arguments
and anything else that it might need.
On top of that might go another slice or frame of memory
if that first function calls another.
And if that second function in turn calls another function,
you might have a third frame on the stack.
Of course, this doesn't end well if you keep calling function after function
after function after function.
And so, hopefully you don't accidentally induce some kind of infinite loop
such that these frames pile on top of each other infinitely
many times, because eventually they'll run the risk of hitting the heap.
Now, the heap is the same type of physical memory.
You're just using it in a slightly different way.
The heap is used any time you want to dynamically allocate memory,
when you don't know in advance how many bytes
you need but you do know once the program is running how many you now
want.
You can ask via functions like malloc the operating
system for some number of bytes, and those bytes
are allocated from the heap.
So, those two have addresses or numbers.
And so, the operating system, by way of malloc,
just figures out which of those bytes are not yet
being used so that you can now put whatever piece of data
you have in that particular place.
Now, beyond that [? appear ?] things like initialized data,
uninitialized data.
That's where things like global variables that are initialized or not
end up that might be outside of your main function.
And then above that is the so-called text segment,
which are these zeros and ones that actually compose your program.
So when you double click an icon on Windows or Mac OS
to run a program or you type dot slash something in the Linux command line
environment in order to run a program, the bits that compose your program
are loaded also into memory up into this region here.
So, at the end of the day, you have access to just pretty generic memory,
but we use it in these different ways.
And it allows us to ultimately solve problems
that we might not have been able to in the past.
Recall for instance this example here, deliberately shown in red because it
was [? buggy. ?] This does not work.
Now, logically, it does do the swap that we intend whereby a goes into b and b
goes into a.
And we achieve that result by way of this temporary variable
so that we have a temporary placeholder into which to store one of those values
while doing the swap.
But it had no permanent impact on the two variables that were passed into it.
And that was because by default in C any time you pass arguments to a function,
those arguments are passed so to speak, by value.
You get copies of those values being passed into a function.
And so, if main, for instance, has two variables, x and y--
as they did last time-- and you pass x and y
into a function like this one here swap, x and y
are going to get copied as a and b respectively.
So you might perfectly, logically, correctly swap a and b,
but you're having no permanent impact on x and y themselves.
But what if, per this green version here,
we reimplement swap to be a little more complicated
looking, but at the end of the day actually correct?
Notice now we've declared a and b not to be
integers but to be pointers to integers, the addresses of integers.
And that's what's implied by the star that we're putting
right there before the variable's name.
Meanwhile, inside of the body of this function,
we still have three lines of code.
And we're still using a temporary variable, and that in itself
is not a pointer.
It's just an integer as before, but notice
we're using this star notation again, albeit for a different purpose
to actually dereference these pointers.
Recall that int star a and int star b means give me a variable that
can store the address of an integer.
That's declaring a pointer.
Meanwhile, if you just say star a without declaring something
to the left of it with a data type like int,
you're saying go to the address that is in a.
So if a is an address, star a is at that address,
which of course per its declaration is going to be an integer.
Similarly, star b means go to the address in b.
Star a means go to the address in a and put the former into the latter,
ultimately putting the value of temp at the address in b-- so absolutely more
complicated at first glance, but if you consider again
the first principles of what's going on here, all we are doing
are moving things around in memory.
And we can do that now because we have the ability
to express the locations, the numeric locations of where
things are in memory.
But nicely enough, we, the programmer, don't have
to care where things are in memory.
We can access things symbolically as we're doing here with a and b.
So even though we might have seen on the screen
or you might see while debugging actual addresses of memory,
rarely does that actually matter in practice.
We can deal with everything we've learned thus far symbolically.
Now, last time we also took a look at the world of forensics,
and we took a look at how images are implemented and specifically file
formats like BNP, and JPEG, and GIF, and yet others.
And we glanced into [? Asmila's ?] here as we tried to enhance this image,
but of course, there was only finite amount of information.
So, what you see is what you get in terms of any kind of glint
or suspect in her eyes.
But we did this in part so that we could also
introduce another feature of C that allows us to declare our own data
types, indeed our own data structures.
For instance, we proposed that if you wanted
to write a program that stores a student,
you could actually declare your own student data type
inside of which is a name and inside of which is a dorm,
and anything else that you might actually want.
Meanwhile, this syntax here gives us a new data type called student
so that if we want to write a program that implements students,
we can actually wrap related information together like name and dorm
without having to maintain a whole bunch of strings
for just names and a whole bunch of strings for just dorms.
We can actually encapsulate things all inside of one structure.
And indeed encapsulation is another principle of computer science
that you'll see throughout program and throughout the field itself.
So, what do we now do this time?
So, today we introduce more sophisticated ingredients
with which we can solve problems and we revisit a problem from the past
that we thought we had rather knocked off and had solved.
So, this might represent a whole bunch of names,
a whole bunch of numbers, a whole bunch of telephone numbers in a phone book
back to back to back to back stored in this case
in the form of an array, the simplest of data structure, so to speak,
that we've discussed thus far.
And an array, again, is a contiguous block of memory each of whose element--
typically are of the same data type, integers, or strings, or the like--
and they are by definition back to back to back to back,
which allows you random access.
Which means you can jump to any of these locations
instantly just by using in C that square bracket notation
or as we saw last time using pointer arithmetic,
actually using the star operator and maybe adding some number two
and address to get at some subsequent address.
But it turns out there's a few problems with this fundamental approach.
Nice and as simple as it is, it would seem that we rather paint ourselves
into a corner with this approach.
This array has 1, 2, 3, 4, 5, 6 total elements, at least as depicted here.
So that's fine if you want to insert a number, and then
another number, and then four more numbers.
But what if you want to then insert a seventh number,
not to mention an eighth number or a ninth number or the like?
Well, where do you put them?
Well, you might think, well, that's fine.
I'm just going to go put the seventh number over here,
or the eighth number over here, or the ninth number over there.
But you can't just blindly do that.
If this memory is being managed not by you
per se but by malloc and by the computer itself inside-- and your program,
this memory over here, while it might physically exist,
might be used by some other part of your program all together.
It doesn't necessarily belong to you unless you've asked for it.
And the problem with an array is that as we've seen it typically
you declare their size in advance, as with the square bracket notation,
and say give me six integers or give me six something or others, but that's it.
You have to decide in advance.
You can't just grow it as you can in some programming languages thereafter.
You've rather painted yourself into a corner.
But with malloc and other functions like we saw last time,
you can actually allocate more memory using malloc.
Unfortunately, it might end up in another location in your computer's
memory, so you might have to do some copying
to take the original six elements and move them
elsewhere just to make room for more.
And there is a data function for that, something called [? re-alloc ?]
or reallocate.
And indeed it can do exactly that.
It can give you a bigger chunk of memory and reallocate
what was previously there to be a larger [? size. ?]
But you have to do a little bit of work.
You have to invoke it in order achieve that.
You can't just blindly keep adding things at the end of this array.
Now, unfortunately, while a solution that might not be very efficient.
Even if you can allocate a bigger chunk of memory that's bigger than six
because you have more numbers, for instance, to store,
what if that takes a bit of time?
And indeed it's going to.
If you allocate more integers somewhere else in memory
you still have to copy those original values,
and now it just feels like you're wasting time.
Now, instead of just inserting things into the list,
you might have to copy it into a bigger space, reallocate things, grow.
It's a lot more work.
And all of that discussion of running time and performance
comes back into play, because if that whole copying process and reallocating
is costing you time, your algorithm or your program
ultimately might not really be as fast as you might want it.
So, what could we do instead?
What could we do instead in order to solve
this problem dynamically, so to speak, that being the operative word.
And luckily enough, last week we learned that there is dynamic memory allocation
in C by way of that function malloc.
And we also learned that there is ways of representing structures
in C that you don't necessarily get with the language itself,
because they're not primitives.
They're not built in.
In other words, let me propose this as a solution to our problem.
This is a list of, let's see, five numbers it would seem,
9, 17, 22, 26, and 34.
Pretty arbitrary right now, but you might
imagine very simply drawing those same numbers-- 9, 17, 22, 26,
34-- in the form of an array and they're clearly deliberately sorted.
But again, what if you wanted to grow that array or even shrink that array
dynamically over time?
Well, let me propose that we not draw those numbers back to back to back
to back literally next to each other but allow ourselves potentially
a little bit of space?
But if that's the case and nine is here in my computer's memory and 17 is here
and 22 is here, or over here, or over here-- in other words,
what if I relax the constraint that my numbers or my data types more
generally have to be stored contiguously back to back to back to back in memory
and instead allow them to be anywhere, indeed anywhere a function like malloc
wants to give me more memory, that's fine.
If it wants to give me memory up here in my computer, I'll deal with that.
If it wants to give me extra memory over here, that's fine.
I'll deal with it, because I'll use these conceptual arrows to stitch
together my data structure this time.
And now, where have we seen these kinds of arrows before?
What feature of C allows us to connect one thing
to another where a la chutes and ladders get from one place to another?
Well, that's exactly what we saw last time which was pointers.
While we've drawn these here per the snippet from a textbook using arrows,
those are really just pointers.
And what does each of these rectangles represent?
Well, clearly a number in the top half of the rectangle,
but I claim that at the bottom half of these rectangles
let's consider that bottom rectangle to just be another piece of data,
specifically an int star, a pointer.
Or rather not a pointer because it seems to be pointing not just to the number
but to this whole rectangle, so I need some new terminology.
I need some kind of structure to contain an integer and this pointer.
And for that, I think I'm going to need a struct.
And indeed let me propose that to solve this problem we give ourselves
this building block as a new C data type called a node.
You can call it anything you want, but the convention
would be to call something like this in a data structure-- that's
like a puzzle piece or a building block in a data structure
would be called a node.
Let me propose that we define it as follows.
I'm using that same syntax from last time with which we declared a student
data type, but here I'm saying inside of this data structure,
this node shall be an int.
And that's pretty straightforward.
Just like a student might have a name and a dorm,
this node will have an int called n arbitrarily.
And then the only piece of detail that's a little bit new now is
the second line, struct node star next.
Now, what does that mean?
It's pretty verbose, but struct node is just recursively, if you will,
referring to this same type of data structure.
Star means this is going to be a pointer, the address of one such thing,
and next is just an arbitrary but pretty reasonable
name to give to such a pointer.
So this line here, struct node star next,
is the incantation in C with which you declare
one of those arrows that will point from one node, one rectangle
to another node, another rectangle.
And the fact that we have a little bit of additional verbiage
up here, typedef struct node, is because again C
is a language that is read top to bottom, left to right,
so words have to exist before you actually use them.
So, whereas last time when we declared a student,
we didn't actually mention struct student or anything like that.
We just said typedef open curly brace.
Today, when declaring a node, we actually
have to have some additional syntax here just called struct node.
And technically this word could be anything,
but I'll leave it as node for consistency.
And that allows me inside of this definition
or to specify that the second data member is
going to be a pointer to exactly that kind of data structure.
But typedef, just to be clear, allows me to type a smaller name for this data
structure here, which I will simply called node at this point.
So, what can we actually do with this kind of data structure now?
And, indeed, let's give this data structure a name.
Let's start calling a linked list.
Previously, we had arrays, but now we have linked lists, both of which
at the end of the day are types of lists, but linked lists,
as the name suggests, are linked or threaded together using pointers.
Now, when you have a linked list, what might be some operations,
some algorithms that you might want to run on them?
Well, if you've got a linked list of say numbers, for the sake of discussion,
you might want to insert a new number into that list.
You might want to delete a number from that list
and you might want to search that list.
And that allows us to then consider how we might implement
each of these kinds of things.
But it turns out while all simply-- while fairly simple intuitively,
we're going to have to be a little careful now by way of our pointers.
So, let's more formally declare a linked list to look something like this.
It's a collection of nodes that are linked together
with pointers as represented by these arrows here,
but we're going to need some special pointer, at least
at the beginning of the list.
Let's just call it first.
It doesn't necessarily store a actual integer.
It itself first is just a pointer to the start of the list.
And by way of that pointer can we access the first actual node in the list.
From there can we get at the second, from there can we get at the third,
and the fourth, and the fifth, and any number of others.
And this syntax over here might just represent null.
Because you don't want to have that pointer just
pointing off into no man's land, that will have to be a null pointer so
that if we check for that with a condition we know,
OK, we're at the end of the list.
So, let's pause for just a moment and consider these three algorithms--
insert, delete, and search, and consider what's going to be involved.
Well, how would you go about searching for an element of this list?
Suppose I wanted to find the number 22?
What do you do?
Well, me, I, the human can just look at this and be like all right,
22 is right there.
But a computer can't do that.
A computer every time we've had this discussion
can only look at one thing at a time.
But moreover the computer this time is even more constrained
because it can't just use our old friend binary search
or divide and conquer, because how do you get to the middle of a linked list?
Well, you have to find your way there.
The only thing you have in a linked list from the outset
is one pointer called first or whatever it is, but one pointer that leads you
to the beginning of the list, the first node in the list.
So, if you want to get to the second node in the list,
you can't just go to bracket one, or bracket two, or bracket three
to get any number of other elements in the list.
You have to follow these bread crumbs, if you will.
You have to follow these arrows or these addresses to go from one node's address
to the other to the other.
And so, we've paid a price already.
And we'll see that there is still an advantage here,
but what's the running time?
What's an upper bound on the running time of search for a linked list,
even if it is sorted?
Any thoughts?
Is it constant time like big O of 1?
Is it log of n?
Is it n, n squared?
What's the running time going to be?
Well, they're sorted, and that was this magical ingredient, this assumption
we've been allowed to make in the past which was helpful,
but that assumed that we had random access.
In C, we had square bracket notation, so that using some simple arithmetic
we could jump roughly to the middle, and then the next middle,
and the next middle looking for Mike Smith or whatever element it is we're
looking for.
Unfortunately here, one price we have already
paid already by taking this step toward linked lists is linear time.
Big O of n would seem to be the running time of searching a linked list,
because the only way you can start is at the beginning,
and the only way you can get through the list is by following these arrows.
And if there's n nodes in the list, you're
going to need as many as n steps to find, something like 22, or 26, or 34,
or any elements all together.
Well, that's not all that great.
What about insert?
What's an upper bound on the running time of insert?
Well, here too it depends.
Suppose that we don't care about keeping the list sorted.
That's kind of a nice advantage, so I can be a little lazy here.
So, what's the running time going to be if I
want to insert a new number like the number 50 into this list,
but I don't care about keeping it sorted?
Well, instinctively, where would you put this element?
Where would you put it?
You might be inclined-- you kind of want to put it over here,
because it's the biggest element.
But again, if you don't care about keeping it sorted,
where is the fastest, the quickest and dirtiest place to put it?
I would propose let's just put it at the front of the list.
Let's take this first pointer, point it at the new number
50 that we've have somehow added to the picture
as by calling malloc, asking malloc for a new node.
And then have 50, in turn, point to the number 9, and then 9 can point to 17,
and 22, and so forth.
What if we want to insert another number, 42,
and we don't care about where it goes?
Well, why don't we just put it at the beginning of the list?
Then we have the first pointers pointing at 42,
which in turn should point at 50, which in turn can point at 9, then 17, then
22, and so forth.
So, if we're just lazy about this, we can actually
achieve a great running time for insert constant time.
Unfortunately, if we want to keep things sorted then
we're going to have to incur a linear time cost again, right?
Because if we have to insert 42 or 50, worst case
they might belong all the way at the end of the list and that's
Big O of n steps.
And delete, too, unfortunately, whether it's sorted or unsorted
is also like search going to be Big O of n
because you don't necessarily know when you're searching for a number
to delete if it's going to be at the beginning, the middle, and the end.
So, in the worst case, it might indeed be at the end.
You know what?
Why don't we instead of walking through this verbally,
let's see if we can't get some volunteers?
Can we get seven volunteers to play-- wow, to play the role of numbers here.
1, 2, 3, 4, 5, 6, and yes, 7, come on up.
All right, so I have here some printouts for all seven of you
that represent exactly the nodes that we have here on the screen.
Let's meet one of our first contestants.
What is your name?
AUDIENCE: Scully.
SPEAKER 1: Scully, nice to see you.
So, you shall be literally first and represent our first pointer.
So, if you want to come and stand roughly over here.
And then what is your name?
AUDIENCE: Maria.
SPEAKER 1: Maria, nice to see you.
And you can be the number 9 right next to our first contestant.
And your name?
AUDIENCE: Sarah.
SPEAKER 1: Sarah, nice to see you.
You shall be the number 17.
And your name?
[? AUDIENCE: Satoshi. ?]
[? SPEAKER 1: Satoshi, ?] nice to see you.
You shall be 20.
And your name?
[? AUDIENCE: Mosof. ?]
[? SPEAKER 1: Mosof, ?] nice to see you.
And you shall be 22.
AUDIENCE: Jed.
SPEAKER 1: Jed, nice to see you-- 29, formerly 26.
And your name?
AUDIENCE: Erin.
SPEAKER 1: Erin, nice to see you.
You shall be 34.
All right, so what we have here is seven elements, six of which
are very similar to themselves, one of which is fundamentally different.
So, Scully here represents first, and indeed her sheet of paper
is horizontal to suggest that she is just a node.
She is just going to be the pointer of to a node in a list.
Everyone else's nodes are vertical, as have
been the rectangles we've been drawing on the screen,
because each of these guys represents a number as well as a next pointer.
Now of course, you're only seeing in front of you the number.
So we're going to go ahead and if you wouldn't mind,
use your left hand to represent the arrow that we've long had on the screen
to point to the person next to you.
And Erin, you're a bit of an anomaly, but also
because you need to have a null pointer at the end of the list,
so that you're not just pointing aimlessly.
And pointing to the ground seems fine, so literally
pointing to the ground will represent-- will infer as null.
So, Scully, you are the only thing keeping this list together,
so to speak.
So, you two need to point with your one pointer to Maria there.
So, here we have a linked list.
And just to make this clear, could everyone separate from each other
by a step or two in any direction?
Notice that the list is still-- it's almost identical to before.
Can some of you take a step forward, a step back,
but still point at the other person?
So, now we're capturing a little more accurately the fact
that these nodes and these pointers can be anywhere in memory,
so long as they're linking themselves together by way of these pointers.
All right, so suppose now that I want to insert
an element like 55, which happens to belong at the end of this list.
Let me go ahead and malloc Christian if I could.
So we have asked malloc for a chunk of memory equivalent
to the size of one integer and one pointer.
That is going to be represented with this rectangle here.
Nice to see you.
AUDIENCE: Nice to see you.
SPEAKER 1: You shall be 55.
And I'll play the role of the temporary pointer as predecessor pointer
or pointer just using my left or right hand
to try to figure out where Christian belongs.
So, just like you might-- just like we might want to search the list,
inserting is fundamentally rather the same.
The only thing I have access to at the outset of this algorithm
is my first pointer, and it's only by way of Scully
that I can even access the rest of the list.
I cannot jump to the middle, jump to the end.
I can only start at the beginning and literally follow the pointer.
So, let me go ahead and do that.
Let me go ahead and point at whatever Scully
is pointing at, which happens to be Maria, which is the number 9.
55, of course, is bigger than that, and I
do want to keep the list sorted for today's purposes.
So I'm going to very carefully follow the next pointer, 17.
Follow the next pointer, 20.
Follow the next pointer, 22.
Follow the next pointer, 29.
Follow the next pointer, 34.
Follow the next pointer, ah dammit, null.
And so this is why it is important with some of these algorithms
to have a predecessor pointer, a second pointer or really my left hand
so that maybe my left hand can still point at Erin.
My right hand can realize, ah, null, so that I still
have access to the last node in the list so that Christian--
if you could come over here.
I'm going to go ahead and tell Erin quite simply to point at Christian.
Good, and let's just say for students' sake come on over here,
but technically we could have left Christian there and just had
Erin pointing at him.
It's just going to get a little confusing before long,
so we'll just cheat and move you right over here.
But now we have a linked list that has one additional member.
Suppose now that we want to make another insertion-- pardon.
Let me go ahead and propose that we insert say the number 5.
Well, the number 5, of course, belongs at the beginning of the list.
So you know what?
I need to malloc.
Can I malloc Jordan off camera five, perhaps?
So malloc, a very slow return value.
OK, we're going to store your node your n value five here.
His pointer is undefined right now, because he's not actually
pointing at anything.
And so where does he ultimately belong?
Well, he belongs at the start of the list.
So, let me deliberately make a mistake.
Let me go ahead and update Scully to point at Jordan,
thereby putting Jordan effectively at the front of the list.
Unfortunately, whom should Jordan now point at technically?
It should be Maria, but this is code.
The only thing we can do is copy pointers in memory,
and if Scully's left hand is no longer pointing at Maria,
I have literally orphaned the entirety of this list.
I have leaked 1, 2, 3, 4, 5, 6, 7 chunks of memory,
seven nodes, because I got my order of operations out of order.
Indeed, I should have done what-- let's undo, control Z.
And now let me go ahead and do what?
Jordan should point at the exact same thing Scully is pointing at,
which has no downside.
Even though it feels redundant, we've not lost any information.
And now that Jordan is pointing at Maria,
Scully's pointer can be pointed at Jordan.
And now, even though the list looks a little weird,
this is a key feature of the linked list.
These nodes could have been malloc from anywhere.
So indeed, even though we initially kept everyone physically sorted
left to right-- and you've all cleaned the list up even since-- that's OK.
The point is that all of these nodes are linked together.
So, thank you so much to our volunteers.
You can keep these pieces of paper and later on we'll
have some stress balls for you.
But that's the key idea here behind a linked list.
Thank you.
So, of course, there are some more complicated operations
that we might have to deal with.
For instance, if we want to insert into the middle of the list,
that's going to be a little more of a burden on me, the program,
keeping track of where things have to go.
But nicely enough, there's only these three cases--
the beginning of the list, the end of the list, and the middle of the list,
because middle of the list doesn't have to mean literally the middle,
just anywhere that's not the beginning or the end.
Of course, we should be careful to make sure
that we handle the empty list scenario, which
is equivalent to putting something at both the beginning of the list
and the end of the list.
But that would be perhaps a special case we could deal with separately.
Of course, there are other operations like inserting-- or rather removing
from the tail of the list, removing from the head of the list,
and removing in the middle.
And that would be the opposite of malloc, if you will.
And in those cases, we have to take care to call
our friend free to free those bytes of memory,
give them back to the operating system so that we don't leak memory.
But there, too, I'm probably going to have to be careful as to what order
I change my pointers and free nodes.
Because what you don't want to do, and what unfortunately you
might very well accidentally do at some point,
is free a pointer and then try to access that pointer or change the pointer,
even after you've told the operating system I'm done with this address.
That can give you what's called a segmentation
fault, which is just one of the ways in which you
can deduce that kind of mistake.
So, let's actually implement one of these methods.
And we'll pluck off one that allows us to actually take
a look at the syntax with which we can manipulate pointers.
And let's go ahead and implement a function
called search, for instance, where search
I [? proposed ?] just returns a bool, true or false, this number n
is in the given the list.
And now, why have I said node star list?
Well, at the end of the day, a linked list is just a whole bunch of nodes.
But the first of those nodes that we keep calling first is of what
data type?
If you have a pointer, a variable, that's
pointing to a linked list, that means it's storing the address of a node,
otherwise known as a node star.
So, this would be the syntax with which you can pass to a function something
like a linked list.
You simply have to pass it a pointer to the first element in that list.
And if I want to go ahead now and implement this,
let me go ahead and propose the following.
Let me go ahead here and give myself a temporary value, so node star pointer
we'll call it, PTR.
And that's going to equal the start of the list.
So, I'm just creating another box of memory
and I'm storing inside of it the same address
that I was passed in, just so that I have a temporary variable that I
can use to update.
After this, let me go ahead and say while that pointer is not
equal to null-- because recall that null is this special sentinel value that
means end of the list.
So inside of this loop, what do I want to do?
I'm going to go ahead and say if pointer--
and now I have to get at the number inside of it.
So, if I recall from the last time, we only spent a little bit of time
on the student example, but we said something like student
dot name or student dot dorm.
And in this case I'm inclined to say pointer dot n, where n is
the number, the integer that's inside.
But pointer this time is not a struct, per se.
It's the address of a node.
It's the address of a struct.
And so, perhaps the most intuitive piece of syntax in C,
at least retrospectively now, is that if you
want to access a piece of data that's inside of a node
and you have a pointer to that node much like our arrows in the pictures imply,
you literally draw an arrow using a hyphen
and then using a right angle bracket.
So, now if we do see-- whoops, let me finish my thought.
If pointer n equals equals the n we're looking for, let me go ahead in here
and say return true.
Or else, let me go ahead and not return false,
because I don't want to just check one element and then blindly say false.
I instead want to say pointer should get pointer arrow next.
And then only after that loop is all complete
should I say something like nope, return false.
So, what's actually going on here?
The function declaration, again, took in two arguments--
one, an int n that we're looking for, two a pointer to a node,
otherwise known as a node in a linked list.
And per the pictures we've been drawing, you
can access any other element in that linked list by way of the first element
in that list, as suggested here.
So, now I'm just giving myself a temporary variable called pointer,
but I can call it anything I want.
And I'm declaring it as node star, so that it can store the address of a node
as well, and then I'm just initializing it to be
the exact value that was passed in.
So, I don't want to accidentally break the list that I was passed.
I don't want to change the value of my parameter
unnecessarily and complicate things.
I just really want a temporary variable, much
like I in the world of loops that allows me to constantly iterate
through something and update it as I go while the whole along the way
I want to be checking this.
While pointer is not null.
If pointer is null, that means I'm at the end of the list,
or maybe more curiously, I was passed null, in which case there is no list.
And that's a valid scenario that could happen, even though it's a bit strange.
But if pointer is null, I don't want to proceed further inside of this loop.
But so long as pointer is not null, let me go ahead and do this.
Let me follow that pointer and go inside that node and say
is your n value equal equal to the [? end ?]
value that I've been asked to search for?
And if so, return true.
I don't want to just return false now because otherwise I'd
only ever be checking the first element in a linked list.
So, I now want to do the equivalent in spirit of i plus plus.
But I'm not using i's.
I don't need to use pointer arithmetic here,
and indeed it won't work because I have this thing stitched together.
It's not an array of contiguous memory.
I want to say that my current temporary value
pointer should get whatever pointer arrow next
is, and then let this loop continue.
If it's not null, check again.
If it's [? end ?] value equals what I'm looking for and repeat, repeat,
repeat until pointer equals null.
So, let's make this more concrete.
Let me go ahead and just draw a temporary picture here.
And let me suppose here that what I have been passed
is something like the following.
Let's do a very simple linked list that has maybe the number one,
and has the number two, and has the number three.
And, again, I've drawn gaps between these nodes,
because they could be anywhere in memory.
So, technically they don't need to be left to right like this,
but that'll keep us sane.
And if this is indeed a correct linked list,
there are pointers in each of those fields
that point to the next node in the list, and that slash
I drew in the last one just means null.
You can draw it however you want.
But for a linked list to work, we need to know the beginning of this thing.
So, we'll call this first, and that of course
has to point to the first element in my linked list.
So, here's the state of our world.
It's a linked list quite similar in spirit to the other one.
It's a little shorter, just for the sake of discussion.
And now, let's consider my function search,
which again takes two arguments.
So, that the first argument is of type int called n.
And suppose I'm searching for the number three.
The second argument is a node star, so the address of a node called list.
So, what does that mean?
When this function search is called, let's
suppose that we currently have the value n, which
is going to be 3, because that's arbitrarily
the number of decided to search for.
And then this other value pointer is going to be initialized-- sorry,
not that.
List is going to be whatever pointer is passed
in as the second argument to this function.
So, let's suppose that this linked list, this sample linked list at the top,
is indeed what is passed in to this function.
So, I've passed in 3, because I want to search for 3.
So what goes here?
If I pass this sample linked list into my search function,
what is the value of list?
Well, list if I'm past this first pointer
is really going to be the pointer to the first element in the list.
That's all we're saying.
Node star list just means give me the address
of a linked list, which means give me the address of the first node
in the linked list, which means that initially when
I call search my picture-- my stack frame, if you will,
in terms of my local arguments-- is going to look like this.
All right, so with that said, how does this code work?
We recall in this code have this while loop that
just sits in a loop checking whether the current nodes
n equals equals the one we're looking for, and if not, it updates it.
So, we need one more local variable called pointer that's
initialized to the start of this list.
So, this will be a pointer and it's initialized to the same thing
that my second argument is initialized to.
So, this now is the state of our world once one line of code has executed,
that very first one in there.
So, now let's implement the loop.
While pointer does not equal null, so here's pointer.
Does it equal null?
No, because if it did, we would just draw
a slash or some other piece of syntax.
But it's pointing at clearly something that exists.
So, this node here has some valid address.
Pointer is pointing at it, so it's not null.
So, what do I do inside of my code?
I check if the n value inside of pointer, PTR,
equals equals the number I'm looking for,
and if so, return true, otherwise, if not I update pointer.
So let's check.
Let's follow the arrow, PTR, and look at the value n.
Recall that the top of these boxes is n.
The bottom of them is called next-- n, next, n next.
So, I followed this pointer.
I'm looking at the box called n.
Does 3 equal 1?
No, obviously not.
So I update-- pointer gets pointer next.
So, to be clear, pointer gets pointer next,
that second to last line of actual code.
So, what does that mean I need to do?
That means I need to update pointer to be equal to pointer next.
What is pointer next?
Well, here's pointer and here's next.
We were looking a moment ago at n.
Now I'm looking at next.
So, pointer next means that I should update whatever is inside this box--
and a lot more on the screen-- to be equal to pointer next, which
is this field.
This field is pointing at that, so that line of code
has the effect of updating PTR to simply point at the second node.
So, what happens next?
I seem to still be in that loop and I say, well, pointer does not equal null,
and it doesn't.
It's pointing at that second node.
If pointer arrow n equals equals n, but no that's not
the case, because I'm looking for three.
I'm pointing at two, so that is again false.
So, again, I don't return true.
I instead update pointer to equal pointer next.
So, what has to happen here, at the risk of deleting my handiwork again,
now pointer gets pointer next, which is this element, which
is equivalent to pointing at this node here.
And so, now I'm still inside that loop while pointer-- not equal to null.
It's not null.
If pointer arrow n equals equals n, well, let's follow that logic.
If pointer, follow the arrow, n equals equals n,
three-- which is the one I'm looking for-- returned true.
And so, how then does this function ultimately behave?
It would seem in this case to return true, because I have eventually
found that number three.
What would happen by contrast if I were looking not for three, but for four
with this code?
In other words, what if I'm not looking for three and I want to go one
step further?
Well, one step further is going to update PTR
to equal null, that slash in my last node.
And that means code wise, I'm going to break out
of that loop, because pointer now does equal null.
And so, by default that very last line of code return false, not found.
So, complicated at first glance, and it certainly
looks more complicated than things we've written before,
but again, if you go back to basics, what does each of these lines mean?
Consider that there's no magic here.
This first line means give me a variable that's a pointer to a node.
It'd be in other words the address of a node
and assign it whatever I was passed in.
While pointer [? naught ?] equals null, we've seen null before.
It's this special zero value, and I'm just
making sure that the pointer I'm using, PTR, does not equal that special value.
And then inside of this loop I'm using one piece of new syntax, this arrow
notation, which just like the picture suggests means go there, and then
look at the field called n and check if it equals the n you're looking for,
and if so return true.
Otherwise, update yourself much like i plus plus
but specifically update pointer to be whatever the value is when you
follow the arrow in that next field.
So, this of course is just search.
We've not actually changed the list, but imagine, if you will,
that you could now implement insert and delete,
not simply by following these pointers but actually changing
the value of next in a node to the left, a node to the right, or a new node all
together.
So, who cares?
Why did we add all of this complexity?
We had arrays, which were working really well for a whole bunch of weeks,
and now we've claimed that arrays are not so good.
We want to use linked lists instead.
But why might we want to use linked lists?
Well, linked lists gives us dynamism.
We can call malloc and give ourselves more, and more, and more nodes
and grow our list of numbers, even if we don't know in advance how
many such numbers we need.
And we can shrink them, similarly, so we don't
have to allocate a massive array unnecessarily.
We can shrink our data structure based on how many numbers we actually need.
But we're paying a price.
Search is a little bit slower, delete is a little bit slower.
Insert would be slower if we insist on keeping things sorted,
so we've paid this price.
And indeed, this is thematic.
In CS50, in computer science more generally,
there's often going to be these trade-offs of time,
or space, or just complexity, or your human time.
Any number of resources can be in scarce supply.
And, indeed, we've seen by way of linked lists
that we're solving one problem while introducing another.
It's like that silly situation you might see
memes of where you cover your hands-- put your hand around a hose that
has a leak and all of a sudden another leak springs up over there.
We're just moving the problem elsewhere, but maybe that leak
is less problematic to us than this one here.
So, again, it's this theme of trade-offs.
Now, this here is Mather Dining Hall.
And this, of course, is a whole bunch of trays
where you might go and get some food, lunch, breakfast,
or dinner, or the like, and you pick up this tray and you put food on it.
But what's interesting about trays, as Mather and a lot of cafeterias do,
is trays are stacked one on top of the other.
And it turns out now that we have this second building
block with which to create data structures-- we're not just
using arrays anymore.
We now have pointers and this general idea of linking nodes together
in our toolkit, we can now start to imagine more interesting data
structures that solve problems in slightly different ways.
For instance, suppose that I wanted to implement
this paradigm of stacking things on one on top of the other like this.
Indeed, this is a data structure called a stack,
and it generally has two operations associated with it,
push to push something on the stack and pop to take something off of the stack.
And this is perhaps a useful data structure
if you just want to store numbers or something else in really just
the most efficient way for you without regard really to fairness.
So, for instance, if this table here is my initial data structure
and it's empty and I have a piece of information that I want to store,
I'm just going to going ahead and put it right there.
And now suppose I want to push another number onto the stack.
I'm just going to go ahead and simply put it
on top, third number on top, fourth number on top.
But I've now committed to a certain property, if you will,
a certain property whereby the last tray in has to be the first tray out,
otherwise known in computer science as LIFO-- last in,
first out-- because if I want to get that first number, I mean,
I've created a mess for myself.
I have to lift these all up or move them just to get at it.
That just seems stupid.
Intuitively, the easiest thing to grab is probably going to be the top,
but that's not necessarily the first element I put in.
But that's OK.
This might still be a valid data structure--
and indeed later in the term when we introduced web programming
and we look at languages like HTML, there's
actually a number of applications where it's actually super useful
to have a data structure where you can just stack stuff on top of each other
in order to tuck some data away for subsequent use.
And, indeed, when we talked about memory in our own computer,
stacks clearly have some value.
We talked about main and then swap, and then maybe other functions.
There are many contexts, one of which we've seen already, where in life you
actually want to stack things on top of each other
so as to keep track of really what did I do most recently, because that's
the next problem I'm going to deal with or the next frame--
in the case of memory-- that I'm going to pop off of the stack.
So, how might we implement this data structure?
Let me propose that if we want to define our own data type
called the stack that implements that idea of cafeteria trays or frames
in memory, let me go ahead and typedef a structure called stack inside
of which are two data members.
Suppose that, for the sake of discussion,
I'm not going to try to store trays or something that doesn't really
exist computationally but rather numbers, just integers.
Inside of this structure called a stack is
going to be an array called numbers of type int
and it's going to store capacity.
So capacity, let's assume, is hash defined elsewhere to be some constant.
So, maybe it's 10, maybe it's 1,000, but it's some constant integer elsewhere
that limits, ultimately, the capacity of the stack to some integral value.
And then size-- this seems weird.
I have capacity here and then size here.
Well, there's the semantic distinction here.
Just because you have a stack ready to go,
as I did a moment ago-- just because I had this stack ready to go,
empty initially, it's going to have some capacity.
Realistically, I can only go as high as the ceiling
or until the things fall over.
But there's also a size here, and the size is currently zero,
but the capacity might be like 1,000 or whatever.
So, that's the difference there and the size now is 4
and the capacity is like 1,000 minus 4 at this point--
or rather, capacity is still 1,000, because that's the total possible size,
not the actual size.
But what if that's a limit?
What if I don't want to restrict myself to some fixed, finite number of trays
or a fixed number of numbers?
Well, I could instead declare my stack as being a pointer.
Now, this pointer initially has no value,
so let's assume that it's probably going to be initialized to null in my code,
but that too is not going to be useful.
Why would I declare numbers now not to be in an array, which
felt very straightforward.
Normally we draw arrays left to right, but with the trays,
just imagine turning an array vertically and thinking of the number stacked
on top of each other.
But this is just a pointer to a number.
Why might I want to implement a stack as just a pointer to an int?
That seems wrong.
I want lots of numbers, not one number.
So, what could I do?
Well, what if in this world to implement a stack I invoke our friend malloc
and I say to malloc, malloc, give me enough memory
for 2,000 numbers or 5,000 numbers.
What is malloc going to return?
Well, by definition, we know malloc is going
to return the address of a chunk of memory, and that chunk of memory
is going to be of whatever size I ask malloc for,
and the address of the first is really just equivalent to the address
of one integer.
And so long as I, the programmer, remember that I asked malloc
for 2,000 integers or for 5,000 integers,
I know implicitly the end of that chunk of memory and malloc
just need to tell me the beginning.
So, it's perfectly fine to implement the stack by way of a single pointer,
because all I need to know is, hey, malloc,
where should I put my first integer?
Because I know via pointer arithmetic, per last week,
that I can put my next integer four bytes later,
four bytes later, four bytes later.
And I'm deliberately going up this time, but it really
is just an array where you can think of the array as left and right.
So, this would be a way of giving ourselves a data structure called
a stack that is not fixed from the outset like this previous version
to some specific capacity.
Now, we are limited only by how much physical memory or virtual memory
my computer actually has.
So, suppose Apple or someone similar implemented the lines
outside their stores for the release of the iPhone as a stack.
So, it's weird maybe to think of people stacking on top of each other,
but maybe you could imagine Apple funneling everyone into the glass store
here in Manhattan, and then whoever is the last one in gets their phone first.
Because why?
They're closest to the exit.
So, you have all these people show up super early in the morning or days
before, you pile them all into the store saying everyone, hey,
please go into the corner there.
Please get into the store.
And then as soon as 9:00 AM rolls around and it's
time to give out the iPhones, just for logistical convenience you realize,
all right, why don't we just give the person who came in
last their phone first because they're closest to the exit
and get them out, last in, first out?
Good design, bad design?
It's correct in so far as everyone's going
to get an iPhone if supply is there, and that's never going to be the case.
So, it's not necessarily very equitable or fair,
and indeed the humans are not going to be very pleased with Apple
if they used a LIFO data structure or a stack.
What would these fans of Apple hardware prefer that Apple use?
We call it a line.
If you go to the UK, they call it a queue,
which is actually a perfect answer, because there's this other data
structure in the world called a queue, which
is exactly what you would hope the Apple store line would be,
a line whereby it's first in, first out.
So, the first person there three days before,
at 5:00 AM gets his or her phone first, and the one person
who comes in at 9:01 AM doesn't get their phone
because they're at the last position in the queue or the list.
And a queue, nicely enough, might just have at least two operations--
enqueue and dequeue whereby enqueue means get
into line d queue means get out of the line,
but these happen at different places.
For instance, if there's a whole bunch of people lined up here on the stage,
closest over there's the front of the list.
I get here last.
I enqueue myself at the end of this data structure,
but you dequeue someone from the beginning of the list.
By contrast, when we had a stack, when you push someone onto the stack,
you pop it off, or him or her off first by nature
of it being a LIFO data structure.
So, how might we implement a queue?
It's actually slightly more complicated, 50% more
pieces of information you need to keep track of, the front of the list.
But you can still do it in an array.
So, suppose that we do use an array, and let me go ahead and draw this
as follows.
Suppose that like hopscotch we draw the queue for an Apple Store
like an array like this.
And here is the door of the Apple store, so you want to be at location zero,
ideally.
1, 2, 3, 4, 5, 6-- so this is how many people
can fit into our queue in this case.
So, suppose that Alice wants to buy an iPhone and she gets to the store first.
Where should she go to keep things fair?
This is the queue, so we don't want to put her into the corner,
so to speak, in our first example.
We want to put her at the front of the list.
So, Alice belongs right there, pretty straightforward.
Now, Bob arrives and he comes in slightly after Alice,
so he gets to get behind Alice in line.
And so Bob is there, and maybe Charlie arrives thereafter, and then so forth.
David maybe comes in fourth and beyond.
So, that's how people would queue up, so to speak.
And now, when it's time for Apple to open this door
and start selling iPhones, what happens?
We want to take Alice out of the list first.
We want to de-queue Alice.
So, we need to start remembering some information,
because it's not sufficient now to just remove-- whoops,
it's not sufficient just to remove Alice like this,
because suppose that we do keep adding other people, person
d, e-- whoops, that's not an f.
OK, don't know what happened there.
Person f is here, g, h.
Suppose that Alice has bought her phone and left the store,
and then Bob does the same.
He goes ahead and leaves the store, and then Charlie leaves the store.
Where do I put person i who maybe shows up a little late?
It would seem that I want to put them at the end of the queue, which
makes good sense, but right now d, e, f, g, and h are still in the queue,
and this is an array I proposed.
My data structure's an array, so I can't just move d to the front of the line
easily.
I have to actually shift him or move him,
and this might conjure up some memory of our searching and sorting examples
where when we had our humans on stage, we actually
had to physically move people like an insertion sort
to make room for those elements.
And that's fine.
We can absolutely say, hey, David, come over here please and person e,
come over here please, and that's obviously what the Apple store does.
But when you're implementing this idea in memory,
you can't just ask the numbers themselves or the elements themselves
to do that moving.
You need to do it for them, and that's going to cost you time,
and that's going to be a price that you have to pay.
But I propose that we can be clever here.
We do not need to incur the cost of moving d, e, f, g h where Alice, Bob,
and Charlie previously were.
Where can we put person i instead?
I mean, there's obviously room in the line,
so maybe why don't we just put person i here?
But, again, we don't want to piss off everyone who's already in the line.
So, if this now is an array, we have to be mindful of the fact
that the front of this list has to be remembered separately.
This data member here front really should store not 0 in perpetuity
but really 0, 1, 2, 3.
It should store the current front of the list.
And I need another variable, presumably, called
size that keeps track of how many elements are in the list, which
in this case is going to be six.
So with a queue, if I'm implementing it using an array,
there's some added complexity if I want to avoid
the inefficiency of moving all of these elements
and incurring the kind of running times we saw when we talked about searching
and sorting the other day.
There's no reason mechanically to move d, e, f, g, h anywhere in the array.
We can in constant time, and maybe our old friend the modulo operator
that you might have used in [INAUDIBLE], we
can just figure out where i, and j, and everyone else
should go so long as we keep track separately
with a queue of what the front of the list would be.
And why is this 3?
Well, if I continue numbering the array like this, as we've often done,
you can now see that d is the head of the list, or the front of the list.
And so, we should remember his location there as 3.
But, again, what happens if j, k, and then l shows up?
There is no room for l in this world, not to mention m, n, o.
So what if we solved this problem as before
by changing the array from having some fixed capacity to having
no pre-determined capacity, just use a pointer so that we
can use malloc to dynamically allocate a big chunk of memory,
remember its capacity ultimately, but also
remember the front and the size of this data structure?
So, the same idea there might apply.
So, at the end of the day, what have we done?
We've taken these new building blocks, pointers,
and this notion of linking things together using pointers
much like a linked list, and we've looked back
at our data structure called an array.
And using these now, we can implement what are generally called abstract data
types, a queue in a stack does not have as a low level
meaning as an array does, which is a very technical concept.
And in a linked list, this is a very technical concept.
It's a node with pointers linking things together.
A stack is like a stack of cafeteria trays,
or a queue is something like people lining up outside of an Apple Store.
These are abstract data types that can be implemented clearly
underneath the hood in at least a couple of different ways.
You can use an array and keep things simple, a la
weeks two and beyond in the class, but you're
going to paint yourself into a corner by fixing their size,
as I did a moment ago, by declaring this queue
and before it a stack to be of a fixed capacity.
But now that we have pointers, and malloc, and dynamic memory allocation,
and this spirit of linked lists, we can change
that to actually be numbers and actually just
remember where things are underneath the hood.
And nicely enough, a stack in a queue doesn't even
need me to stitch things together like a linked list.
I just need malloc in order to allocate those.
So, let's tie these two topics together if you would
and compare and contrast them by way of a wonderful animation
that a fellow educator made and posted online
that we've abbreviated here to give us a sense of the difference between stacks
and queues.
[AUDIO PLAYBACK]
[MUSIC PLAYING]
-Once upon a time, there was a guy named Jack.
When it came to making friends, Jack did not have the knack.
So, Jack went to talk to the most popular guy he knew.
He went to Lou and asked what do I do?
Lou saw that his friend was really distressed.
Well, Lou, began just look how you're dressed.
Don't you have any clothes with a different look?
Yes, said, Jack.
I sure do.
Come to my house and I'll show them to you.
So, they went off to Jack's, and Jack showed Lou the box
where he kept all his shirts, and his pants, and his socks Lou
said I see you have all your clothes in a pile.
Why don't you wear some others once in a while?
Jack said, well, when I remove clothes and socks,
I wash them and put them away in the box, then comes the next morning
and up I hop.
I go to the box and get my clothes off the top.
Lou quickly realized the problem with Jack.
He kept clothes, CDs, and books in a stack.
When he reached something to read or to wear
he chose [? the top ?] book or underwear.
Then when he was done he would put it right back,
back it would go on top of the stack.
I know the solution, said a triumphant Lou.
You need to learn to start using a queue.
Lou took Jack's clothes and hung them in a closet,
and when he had emptied the box he just tossed it.
Then he said, now, Jack, at the end of the day put your clothes
on the left when you put them away.
Then tomorrow morning when you see the sunshine,
get your clothes from right from the end of the line.
Don't you see, said Lou.
It will be so nice.
You'll wear everything once before you wear something twice.
And with everything in queues in his closet and shelf,
Jack started to feel quite sure of himself all thanks
to Lou and his wonderful queue.
[END PLAYBACK]
SPEAKER 1: All right, so let's take a look at another data type,
this one known as a tree.
Because now that we have the ability to stitch data structures together much
like a linked list, we now have the ability
to stitch things together not just left to right or top to bottom conceptually,
but in any number of directions.
And indeed, there's nothing stopping us from having
one node linked to by way of multiple pointers, multiple nodes.
So, for instance, this picture here from a textbook is a tree structure.
And it's very much like the family trees that you might
have drawn in grade school or the like.
But in this case, you have just one root node,
the node at the top of the data structure, so to speak,
from which everything else descends.
And that node is said to have children.
For instance, 2 and 3 are children of the node number 1 here.
And then there's other semantics in this world of trees in computer science.
Much like family trees, anything that does not
have children-- like 5, 6, and 7, or 8 and 9--
would be called leaves of the tree, because like the leaves
at the end of the branches, there is nothing beyond them.
So, nicely enough we borrow a lot of the language
from family trees and actual trees in order
to discuss this data structure known as a tree.
But why in the world would we want to lay out data in a tree structure?
Now we just seem to be doing things because we can,
it would seem at first glance.
Because, for instance, suppose we had these numbers-- 22, 33, 44, 55, 66, 77,
and 88.
They're clearly sorted.
And suppose that I wanted to lay these out in a data structure
and be able to search them efficiently, assuming the whole time
that they are indeed sorted.
Well, if we wanted to do that, we have our old friend arrays from weeks ago.
And we also have our old algorithm from Mike Smith, our binary search
algorithm, divide and conquer.
And we can find nodes in this data structure super,
super fast in logarithmic time, big O of log n.
So, we've solved that problem.
But it turns out we don't necessarily have to use an array laying out data
from left to right because, again, one of the prices we pay of using arrays
where as we've realized today is this finiteness.
At the end of the day, the size of an array is fixed.
You have to decide in advance how big your array is going to be.
So, what if you want to add more numbers to it?
What if you want to remove numbers for efficiency
and not waste so much memory?
You can't really do that with an array.
You can, but have to jump through some hoops.
You have to reallocate the array, as with a function like [? re-alloc ?]
if you indeed used malloc in the first place to allocate it.
But then you have to copy the old array into the new array,
so it's all possible.
Nothing's impossible once you have a keyboard at your disposal,
but it's a lot of work, and it's more time,
and it's expensive there for both in terms of your time
and the computer's time.
But could we achieve the beauty of divide
and conquer and binary search from week zero without the constraints
that arrays impose?
And today, the solution to all of our array problems
seems to be linked lists or more generally pointers so
that, one, we can dynamically allocate more memory with malloc when we need it
and then use pointers to thread or stitch together
that node with any existing nodes.
So, indeed let me propose this variant on a tree structure
that the world calls binary search trees or BSTs.
Binary in this case means two, and this just means that every node in this tree
is going to have 0, 1, or 2 maximally children.
And now, in this case binary search tree means that for every node in the tree
it's left child is less than it and its right child is greater than it.
And that's a recursive definition.
You can look at the root of this tree and ask that same question.
55?
Is it greater than its left child?
Yep.
Is it less than its right child?
Yep.
That is the beginning, it would seem, of a binary search tree.
But it's recursive in so far as this is indeed a binary search
tree if that statement is true.
Those answers are the same for every other node in the tree.
33, is its left child smaller?
Yep.
Is its right child bigger?
Yep.
How about over here, 77?
Left child smaller?
Yep.
Right child bigger?
Yep, indeed.
How about the leaves of the tree?
Is 22 greater than its left child?
I mean, yeah, there is no child, so yes, that's a fair statement.
It certainly doesn't violate our guiding principle.
Is it less than its right child, if any?
Yes, there just isn't any.
And so this is a binary search tree.
And indeed, if you took a scissors and snipped off any branch of this tree,
you would have another binary search tree, albeit smaller.
But it's recursive and that definition applies to every one of the nodes.
But what's beautiful here now is that if we implement this binary search
tree, similar in spirit to how we implemented linked lists
using not arrays but using pointers and not one pointer but two pointers
whereby every node in this tree apparently has up to two pointers--
let's call them not next but how about left and right just to be intuitive.
Well, if every node has a left and a right pointer,
now you can conceptually attach yourself to another node over there
and another node over there, and they too can do the same.
So, we have the syntax already with our pointers with which to implement this.
But why would we?
Well, one, if we're using pointers now and not an array,
I can very, very easily allocate more nodes for this tree.
I can insert 99 or 11 really easily, because I just
called malloc like I did before.
I put the number 99 or 11 inside of that node,
and then I start from the root of the tree,
much like I start from the first element in the linked list,
and I just search for its destined location going left or right
based on the size of that value.
And what's nice, too, here is notice how short the tree is.
This is not a linked list.
It's not a long list, whether vertically or horizontally.
This is very shallow this tree.
And indeed I claim that if we've got n elements in this list,
the height of this tree it turns out is log of n.
So, the height of this tree is log of n, give or take one or so.
But that's compelling, because how do I search this tree?
Suppose I am asked-- I'm trying to answer the question is 44 on my list?
How do I answer that?
Well, we humans can obviously just look back and it's like, yes, 44 is in it.
It's not how a computer works.
We have to start from what we're given, which in this case
is going to be as the arrow suggests a pointer to the tree
itself, a pointer towards first node.
And I look is this the number 44?
Obviously not.
55 is greater than 44, so I'm going to go down to the left child
and ask that same question.
33, is this 44?
Obviously not, but it's less than it so I'm
going to go down to the right child.
Is this 44?
Yes, and simply by looking at three nodes
have I whittled this problem down to my yes no answer.
And indeed, you can think of it again with scissors.
I'm looking at 55 at the beginning of this story.
Is 44 55?
No, 44 is less.
Well, you know what?
I can effectively snip off the right half of that tree,
much like I tore that phone book in week zero, throwing half of the problem
away.
Here I can throw essentially half of the tree away and search only what remains
and then repeat that process again, and again, and again, whittling
the tree down by half every time.
So therein lies our logarithmic running time.
Therein lies the height of the tree, so long
as I am good about keeping the tree balanced.
There's a danger.
Suppose that I go ahead and start building this tree myself in code
and I'm a little sloppy about doing that.
And I go ahead and I insert, for instance, let's say the number 33.
And it's the first node in my tree, so I'm
going to put it right up here at the top.
And now suppose that the next number that just happens
to get inserted into this tree is 44.
Well, where does it go?
Well, it has no children yet, but it is bigger,
so it should probably go over here.
So, yeah, I'll draw 44 there.
Now, suppose that the inputs to this problem
are such that 55 is inserted next.
Where does it go?
All right, 55, it's bigger, so it should go over here.
And then 66 is inserted next.
All right, it goes over here-- never mind that.
So, what's happening to my binary search tree?
Well, first of all, is it a binary search tree?
It is because this node is bigger than its left child,
if any-- there just isn't any-- and it's less than its right child.
How about here, 44?
It's bigger than its left child, if any-- because there is none--
and it's smaller than its right child.
The same thing is true for 55, the same thing is true for 66.
So, this is a binary search tree and yet somehow what does it look like?
It looks like a linked list, right?
It's at a weird angle.
I've been drawing everything horizontally,
but that's a meaningless artistic detail.
It devolves potentially into a linked list.
And so, binary search trees if they are balanced, so to speak,
if they are built in the right order or built with the right insertion
algorithm such that they do have this balanced height,
this logarithmic height, do afford us the same logarithmic running time
that the phone book example did and our binary search of an array did.
But we have to do a little bit more work in order
to make sure that these trees are balanced.
And we won't go into detail as to the algorithmics
of keeping the tree balanced.
But realize, again, there's going to be this trade-off.
Yes, you can use a binary search tree or trees more generally to store numbers.
Yes, they can allow you to achieve that same binary or logarithmic running time
that we've gotten so used to with arrays,
but they also give us dynamism such that we
can keep adding or even removing nodes.
But, but, but, but it turns out we're going
to have to think a lot harder about how to keep these things balanced.
And indeed, in higher level CS courses, courses
on data structures and algorithms will you
explore concepts along exactly those lines.
How would you go about implementing insert and delete into a tree
so that you do maintain this balance?
And there is yet more variance on these kinds of trees
that you'll encounter accordingly.
But for our purposes, let's consider how you would implement the tree itself
independent of how you might implement those actual algorithms.
Let me propose this type of node.
Again, notice just the very generic term in programming
where it's usually like a container for one or more other things, and this time
those things are an integer-- we'll call it n
but it could be called anything-- and two pointers.
And instead of next, I'm going to just by convention call them left and right.
And as before, notice that I do need to declare struct node up
here or some word up here.
But by convention I'm just going to do typedef struct node, because C reads
things top to bottom, left to right.
So if I want to refer to a node inside of a node,
I need to have that vocabulary, per this first line, even though later on I
just want to call this whole darn thing a node.
And so, that's the distinction.
This actually has the side effect of creating a data
type by two different names.
One is called struct node, and you can literally in your code
write struct node something, struct node something.
It just feels unnecessarily verbose, so typedef
allows you to simplify this as just node, which
refers to the same structure.
But this is necessary for this innermost implementation detail.
So, now that we have the ability with a structure
to represent this thing, what can we actually do with it?
Well, here is where recursion from a few weeks ago
actually gets really compelling.
When we introduced that sigma example a while ago and talked in the abstract
about recursion, frankly, it's kind of hard to justify it early on, unless you
actually have a problem that lends itself to recursion in a way that
makes sense to use recursion and not just iteration,
loops-- for loops, while loops, do while, and the like.
And here we actually have a perfect incarnation of that.
What does it mean to search a binary search tree?
Well, suppose I'm searching for a number n
and I'm being given a pointer to the root of the tree,
and I'll call it tree.
So, just like when I was searching a linked list,
I'm given two things, the number I'm looking for
and a pointer to the first thing in the data structure-- the first thing
in a linked list or the first thing in a tree.
And in this case, we would call that first thing
in a tree a root, generally speaking.
So, the first thing I had better do in my search function
is check, wait a minute.
If tree equals equals null, don't do anything.
Do not risk touching any pointers, because as you may have gleaned already
or soon will with some of CS50's problems,
you will cause quite probably a segmentation fault,
a memory-related problem in your program such that it just crashes.
It literally says segmentation fault on this screen
if you touch memory that you should not.
And you should not touch null values.
You should not go to null values.
You should not do star of any value that itself might be null.
And so, if tree equals equals null is super, super important here,
because I want to make sure to immediately
say, well, if you hand me null, that's like handing me no tree whatsoever.
So, my answer is obviously false.
N can't be in a non-existent tree.
But we need that condition up top, because the next case
is [? noticed through ?] the following.
Else if n-- the value we're looking for-- is less
than the value of n in this node-- tree, recall,
doesn't refer to the whole thing, per se, in this context.
It refers to the current node that we've been
past, which at the beginning of the story is the root of the tree.
So, if the number n in the root of the tree is greater than the number
we're looking for, we want to go to the left.
Else we want to go to the right and search the right subtree.
So, what's the syntax here?
If the n we're looking for, like 44, is less
than the value at the current node, 55, then what do we want to do?
We want to call search, still searching for the same number n
but searching on the left subtree.
And how do you pass in a pointer to the left tree?
Well, you have in tree a pointer to the root node.
Tree arrow left just means go to my left child and past that value in instead,
pass its address in instead.
Meanwhile, if the number you're looking for
is actually greater than the value in the current node, search
the right subtree, else return true.
Because if the list is not null-- if there is actually a list and the number
you're looking for is not less than the current node
and it's not greater than the current node, it must be the current node,
so you can return true.
But there's one important detail here.
I didn't just call search.
I called return search in each of these two middle cases.
Why is that?
Well, this is where recursion gets potentially a little mind bending.
Recursion is the act of a function calling itself.
Now, in and of itself, that sounds bad, because if a function calls itself,
why wouldn't it call itself again, and again, and again, and again, and again,
and just do this infinitely many times such
that now you get a stack overflow where all of those frames on the stack
hit the heap and bad things happen.
But no, recursion works beautifully so long as every time you recurse,
every time a function calls itself it takes a smaller byte of the problem.
Or rather, put another way, it throws away
half of the problem, as in this case, and looks only at a remaining half.
Because if you keep shrinking, shrinking, shrinking, shrinking
the problem, you will eventually hit this base case
where either there is no more tree or you're looking right at the node
that you want to find.
And so, by returning search and tree left,
or returning search and tree right, you're deferring the answer.
When you, the search function, are called and asked
is the number 44 in this tree, you might not
know because the node you're looking at at the beginning of the story
was again 55.
But you know who does know?
I bet my left child will know the answer to that if I just
ask it by passing it-- passing to search a pointer to it, my left child,
and passing in that same number 44.
So, saying return search is like saying I don't know.
Ask my left child.
Or I don't know, ask my right child and let me return as my answer
whatever my child's answer is instead.
So, you could do this same function using iteration.
But you could solve it arguably much more elegantly here using recursion,
because a data structure like this-- like a binary search tree,
which again is recursively defined-- each node is conceptually
identical, if numerically different from the others,
allows us to apply this algorithm, this recursive algorithm
to that particular data structure.
Now, let's look at a more concrete incarnation
of trees that allows us to do something pretty neat and pretty real world.
Indeed, this is another problem borne of a real world domain of compression.
We talked a couple weeks ago about encryption,
the art of concealing or scrambling information.
Compression, meanwhile, is the art of taking something that's this big
and compressing it to make it smaller, ideally without losing any information.
It's pretty easy to take a 10 page essay that's
maybe-- that was supposed to be a five page essay
and just remove paragraphs from it or remove sentences from it.
But that changes the meaning of the paper, makes it a worse paper,
even though you're compressing it by making it smaller.
No, what most students would typically do, if you've written 10 pages
and it needs to fit into five, you really, really, really
shrink the font size or increase the margins.
Or maybe more realistically you write a five page paper that's
supposed to be a 10 page paper, and so you increase the font size
or increase the margins so as to expand or decompress the essay.
So, similarly here, what if we wanted to compress text,
but we want to do it losslessly in a way that we
don't lose any information by just throwing away
characters, or paragraphs, or pages, but we
want to use the system with which we're familiar from week zero.
So ASCII, again, is just this code, this mapping of letters to numbers.
And so, A is-- capital A is 65 and that's some pattern of bits,
but it's some pattern of 8 bits-- 7 historically,
but really 8 bits in practice So every one
of the characters in the English alphabet, at least here,
takes up 8 bits.
Now, that sounds fine.
That allows us to express as many as 256 possible characters, which
is more than enough for English characters, plus some punctuation
and so forth.
But it seems wasteful.
I type A, E, and I, maybe O and U pretty often.
I use the values often-- the vowels often.
B and D, I feel like I use those a lot.
I don't really type Q all that much, Z all that much.
So, there are certain letters that I just
feel like I don't type them that often, and indeed,
probably if we analyzed a dictionary, we wouldn't see them as frequently
as other letters.
Indeed, if you've ever played or watched Wheel of Fortune,
certainly all the contestants on that show
know which are the most popular letters in English words.
And it seems silly and perhaps inefficient--
certainly for a computer scientist-- that we are not somehow
embracing the fact that some letters are more commonly used than others,
and yet we are just blindly using 8 bits, the same amount of memory,
for every darn letter in our alphabet.
Why?
If you keep writing a certain letter again and again,
why not use fewer bits for the more popular letters,
and more bits for the less popular letters
so that at least you're optimizing for the common case, so to speak?
Well, it turns out that someone named Huffman years ago did
figure this out and introduced what's generally known as Huffman coding.
And, at first glance, it's a little similar in spirit to something
some of you might have grown up learning a little something about called Morse
code, but it's better in a couple of ways.
Morse code typically transmitted with electrical signals or audible signals.
It has dots and dashes where a dot is a quick beep
and a dash is a slightly longer beep, and you
can use those series of dots and dashes, as per this chart here,
to represent letters of the alphabet and some numbers.
The one problem, though, as efficient as this seems-- and then by efficient
I mean look at E. Mr. Morse realized that is super popular, so he
used literally the shortest symbol for it, just a dot, a simple blip,
to represent an E. And, meanwhile, as I kind of imagined,
Z is not that common, so dash, dash, dot,
dot is longer than just a single dot.
So Z is probably less popular, and that's why we did this.
And Y may be even less popular-- dash, dot, dash--
I don't know why I'm using this voice.
But it's longer than E, so we optimized for the shorter characters.
Unfortunately, suppose that you receive the message dot, dot, dot, dot, dot,
dot, so six dots in a row, and I technically paused in between them.
Six dots, what message did I just send you?
Six dots.
So, I wanted to say hi, so I said dot, dot, dot, dot, which is H,
and then dot, dot which is I. I should not have paused between them,
because the whole point of Morse code is to do this as quickly as possible,
even though you probably do want to pause to resolve ambiguity,
and indeed, that's the problem.
I wanted to send you hi, H-I, but maybe I
just sent you E, E, E, E, E, E, six Es in a row,
because those two were just dots.
So, in other words, Morse code is not immediately decodable
when you're reading, or hearing, or seeing the dots and dashes come
over the wire, so to speak, because there's these ambiguities,
unless this transmitter does indeed pause,
as I accidentally did there, to give you a moment to take your breath
and realize, oh, that was an H. That's an I. As opposed to E, E, E, E, E, E.
So, it's not necessarily the best system in so far as some letters
share prefixes with other letters.
In other words, I, dot dot, has a common prefix with E. Both of them
start with a single dot.
It just so happens that I is a little longer,
and that can lead potentially to ambiguity,
and it certainly means that the transmitter should probably slow down.
So, the whole system is meant to be super fast, super efficient,
but you probably should pause between certain letters
so that the recipient doesn't get confused as
to the message you're actually sending.
Well, thankfully Huffman coding-- which as we'll see in a moment
is based on trees-- does not have that ambiguity.
It is a immediately decodable.
And suppose for the sake of discussion, as per this example here,
you just have a whole bunch of text that you want to transmit.
This is meaningless.
There's no pattern in these As, and E, B, C, Ds, and Es,
but if you go through and count them up, each these letters-- A, B, C, D, E--
occur with some frequency in this text.
So, it's meant to be representative of an essay, or a message,
or whatever that you want to send to someone.
Indeed, if you count up all of the As, Bs, Cs, Ds, and Es, and divide
by the total number of letters, it turns out
that 20% of the characters in that random string are As, 10% are Bs,
10% are Cs, 15% are Ds, and 45% are Es, so
it's roughly consistent with what I'm claiming,
which is that it E is pretty popular.
So, intuitively, [? it ?] would be really nice
if I had an algorithm that came up with some representation of bits
that's not just 8 bits for every darn letter
but that is a few bits for the popular letters
and more bits for the less popular letters,
so I optimize, again, so to speak, for the common case.
So, by this logic E, hopefully, should have a pretty short encoding
in binary, and A, and B, and C, and D should have slightly longer encoding,
so that again if I'm using E a lot I want to send as few bits as possible.
But I need this algorithm to be repeatable.
I don't want to just arbitrarily come up with something
and then have to tell you in advance that, hey, we're using this David Malan
system for binary.
We want an algorithmic process here.
And what's nice about trees is that it's one way of seeing and solving
exactly that.
So, Huffman proposed this.
If you have a forest of nodes, so to speak, a whole bunch of trees-- each
of size one, no children-- think of them as each having a weight or a frequency.
So, I've drawn five circles here, per this snippet
from a popular textbook that has 10%, 10%, 15%, 20%, 45% equivalently in each
of those nodes.
And I've just labeled the leaves as B, C, D, A, E,
deliberately from left to right because it will make my tree look prettier,
but technically the lines could cross and it's not a big deal in reality.
We just need to be consistent.
So, Huffman proposed this.
In order to figure out the so-called Huffman tree for this particular text,
in order to figure out what to encode it's letters as with zeros and ones,
go ahead and take the two smallest nodes and combine them with a new root node.
So in other words, B and C were both 10%.
Those are the smallest nodes.
Let's go ahead and combine them with a new root node
and add together their weights, so 10% plus 10% is 20%.
And then arbitrarily, but consistently, label
the left child's edge or arrow as a 0 and the right arrow's edge as a 1.
Meanwhile, repeat.
So, now look for the two smallest nodes.
And I see a 20%-- so ignore the children now.
Only look at the roots of these trees.
And there's now four trees, one of which has children, three of which don't.
So, now look at the smallest roots now and you can go left to right here.
There's a 20%, there's a 15%, there's a 20%, and a 45%.
So, I'm not sure which one to go with, so you just
have to come up with some rule to be consistent.
I'm going to go with the ones on the left,
and so I'm going to combine the 20% with the 15%, the 20% on the left.
Combine their weights.
That gives me 35% in a new root, and again label the left branch 0
and the right branch a 1.
Now, it's not ambiguous.
Let's combine 35% and 20% with a new root that's 55%.
Call its left branch 0, its right branch 1.
And now 55% and 45%, combine those and give us a 1.
So why did I just do this?
Well now I have built up the so-called Huffman tree for this input text
and Huffman proposed the following.
To figure out what patterns of zeros and ones
to use to represent A, B, C, D, E, simply
follow the paths from the root to each of those leaves.
So what is the encoding for A?
Start at the root and then look for A-- 0, 1, so 0,
1 shall be my binary encoding for A in this world.
How about B?
0, 0, 0, 0 shall be my binary encoding for B. How about C?
0, 0, 0, 1 shall be my encoding for C. How about D?
0, 0, 1.
And beautifully, how about E?
1.
So, to summarize, what has just happened?
E was the most popular letter, B and C, were the least popular letters.
And if we summarize these, you'll see that, indeed,
B and C got pretty long encodings, but E got the shortest encoding.
And it turns out mathematically you will now
have a system for encoding letters of the alphabet that is optimal,
that is you will use as few bits as possible
because you are biasing things toward short representations
for popular letters, longer representations for less
popular letters.
And mathematically this gives you the most efficient encoding
for the original text without losing any information.
In other words, now if Huffman wanted to send a secret message
to someone in class or over the internet, he and that recipient
simply have to agree on this scheme in advance
and then use these encoding to transmit those messages.
Because when someone receives 0, 1 or 0, 0, 0, 0 or 0, 0, 0, 1 from Mr. Huffman,
they can use that same look-up table, if you will,
and say, oh, he just sent me an A or, oh, he just sent me a B or C. So,
you have to know what tree Huffman built up.
And, indeed, what typically happens in actual computers is
when you use Huffman coding to compress some body of text
like we just have here, you store the compressed text
by storing your As, Bs, Cs, Ds, and Es and other letters
using these new encoding, but you somehow
have to embed in that file in the compressed file the tree itself
or this cheat sheet of encodings.
So, with compression-- maybe you're compressing a Microsoft Word
file, or a dot TXT file, or any other type of file,
you have to store not just the compressed text using these shorter
representation-- not 8-bit ASCII, but these shorter representations-- but you
also somewhere, maybe at the beginning of the file or at the end of the file,
somewhere where someone else can find it, you need to store this mapping
or you need to store the tree itself in some digital form.
And so, it's possible by this logic that you
might try to compress a really small file,
and that file could actually become bigger
because you're storing a tree inside the file to--
with which to recover the original information.
Or better yet, most algorithms or most actual
compression programs will realize, wait a minute, if compressing this file
is actually going to make it bigger, let's just not compress it at all
and leave it alone untouched.
So, what if you take a compressed file and compress it again, and compress it
again, and compress it again?
A dangerous assumption to get into is, well, I could just maybe keep
compressing that video file again, and again, and again, and again,
and I can maybe compress my big essay, or my big video file,
or big music file to just maybe one bit.
Right?
That's the logical extreme, just keep compressing,
compressing, compressing, compressing.
But, of course, that can't possibly make sense,
because if you compress some file down to just a single bit, 0 or 1,
you've clearly thrown away information and can't possibly recover it all.
So, at some point, too, you've hit this lower bound on the size of the file
until you need to start throwing actual information away.
At some point, the file just has so much entropy, appears to be so random,
there really is no pattern to start to leverage to compress it further.
And so, there generally is some maximum amount of compression
you can apply to something.
So, how would we represent this?
Let's whip out a C struct here.
So, this time each of the nodes in a Huffman tree
need a little something different.
They need, at least in the leaves, some kind of character
to remember the symbol.
Now, technically only the leaves need to know what symbols they are,
so it's a little redundant to have this in every node,
but we can keep things simple and use the same type of node for everything.
Float frequency, I could use an integer and treat it exactly as a percentage,
or I can use a float as the nodes were with 0.1 and 0.45 and so forth,
and I'll call that frequency.
And then each of those nodes needs a left child potentially
and a right child potentially.
And, again, I'll call these things a node.
So, again, it's getting a little more involved this node, but it still allows
me to represent it ultimately in C.
And now, it's time to pursue lastly the holy grail of data structures,
if you will.
Thus far, we've been solving problems, creating new problems,
trying to solve those again.
And the problems we've been exploring this week are things like dynamism,
if we want to be able to grow or shrink our data structure.
Malloc and pointers give us that flexibility
but might cost us a bit more time, because we
have to keep things sorted differently or we
have to follow all of those pointers.
And so, a lot of the algorithms we've been discussing today
at least have-- like linear time, searching, or inserting, or deleting
potentially like in a linked list.
Better still would be something logarithmic
like a balanced binary search tree, so still preserving that nice binary
aspect from week zero.
But the holy grail of a data structure for its operations
is Big O of 1 so to speak, constant time.
If you are searching, or inserting, or deleting, and somehow changing
a data structure, wouldn't it be amazing if every darn operation
takes just one step, or maybe two steps, or three
steps but a constant number of steps?
Now, it might be a little naive for us to expect
that we can store an arbitrary amount of data in some fancy way
that we get constant time, but maybe just maybe if we're
clever we can get close to that.
So, let's introduce a step toward that.
It turns out there exists in this world things called hash tables.
And a hash table can be implemented in any number of ways,
but you can think of it really as just an array.
So, for instance, this might be a way of representing a hash table called table,
whose first location is bracket zero and whose last location is bracket
n minus 1 for however long this is.
And I just left it as blanks.
I don't even know what this hash table might want to store.
It could be numbers, it could be names, it could be letters,
it could be anything we want.
But hash table has this nice theoretical property
that if well-designed and thought through,
you can maybe just maybe get constant look up time in it.
And let's do a simple example of a hash table.
Hash tables are often nicely thought of as buckets,
so we borrowed these from the loading dock outside just a little moment ago,
and we've attached thanks to Arturo some of these signs to them.
This is going to be Z, so I'll just put this over here.
This is going to be C, so I'll put this over here, and B here, and A.
And we thought we might get chased away by the folks on the loading dock,
so we didn't bother getting D through Y, So we'll just pretend
that we have 26 such buckets here.
And suppose that the goal at hand is-- I don't know,
it's like at the end of an exam, so we've
got our old blue books that a class might use for students
writing essays in some class.
And it's time for the students to come submit their blue books.
Now, we could just collect them all and make a big mess as would generally
be the case, or we can be a little more methodical to at least make
our jobs easier.
Now, at the end of the day, what's going to be interesting about hash tables
is that there's going to be this distinction
between actual benefits and theoretical benefit, or lack thereof.
So, we'll come to that in just a moment, but here's A, B, C, D, and Z.
And you know what?
I just am going to ask the students in this class-- there are so
many people in the room after an exam, I just
want them to at least make my life 1/26 as difficult
by putting all the As over there, all the Bs here, all the Cs here,
all the Zs here, so that I don't have a massive mountain of As through Zs
that I have to sift through individually.
It would just be nice if they do the first pass
of bucketizing the values based on the first letter in their last name.
In other words, my hash function, my algorithm,
is going to be for each student to consider his or her last name,
look at the first letter they're in, and put his or her exam
in the appropriate bucket.
So, here is, for instance, someone with the letter
C. I'm going to put that blue book in here.
Here's someone with the letter A. That one's going to go here.
Letter Z?
This one's going to go over here.
Letter B?
This is going to go over here.
C, and B, and F-- Z, I mean, and all of [? the ?] [? letters ?] of the alphabet
in between.
So, hashing really has this visual and conceptual equivalence
of putting something in this bucket, putting something in that bucket,
putting something in this other bucket, ultimately
bucketizing all of your elements.
And you can think of this, frankly, as just an array,
but it's not just an array with one spot.
It looks I can stack multiple numbers or multiple blue books inside
of that array.
So, we're going to have to come back to that, because this clearly
can't be an array.
Normally, the array would be filled the moment you put one value in it.
But this hashing is the interesting part.
The juicy ingredient today is if I take into account as input what it is I'm
trying to store, use some piece of that information to decide where to put it,
that's an algorithm, because I can repeat that process,
so long as it's not random.
You go over here, you go over here.
That's amazing.
Wow, OK, pushing my luck.
OK, so I'm not just randomly putting things here.
I'm actually giving some thought as to where I'm putting things,
and that makes the algorithm deterministic, repeatable, predictable
so that if you insert something now, you can absolutely
find it if present later.
Unfortunately, if our hash table does look
like this, just a simple array from bracket 0 to bracket n minus 1 dot,
dot, dot in between, and it's just an array
for integers or an array for strings or whatever, once you put something here,
or here, or here, that's it.
There is no more room to put another element there
wide as I might have drawn this table.
If there's an int there, that's it.
So, what could you do?
Suppose that you do have an array structure like this,
and that is unacceptable.
You have a whole bunch of elements here and this table looks like this,
and you consider this table like this.
And maybe it's just where you're supposed to take attendance or put
people's names.
So, if you say, oh, Alice is here today.
Let me go ahead and hash on Alice's name and put her where the As should go.
Oh, Zoe is here, Z-O-E, so we'll put her down there.
And then who else?
Alex is here.
Dammit, Alex, no room for you in our hash table,
because Alice is already there.
This is stupid.
If we have data we want to insert into this data structure,
it would seem that I have 24 available spots into which I could put Alex
and yet I'm just stubbornly trying to put him where only the As belong.
So, why don't I, in this kind of scenario, I need to put Alex in here.
I clearly have space.
You know what?
Let me just probe the array looking for the first available spot.
OK, Alex, you're just going to go here, and if someone else like Erin appears,
fine.
You just are going to go over here.
So, you try to put the letter As where you want them to go,
but if there's already someone there, just
probe deeper into the data structure looking for the first available slot.
So, this is a general technique in programming called linear probing
whereby you have a data structure.
If you hash to some location like the letter A there's a collision,
something is there, you probe further in the data structure just
looking for some place you can put it.
So, you get close to constant time decision-making.
Put A here, put Z here.
And because this is an array, you have random access with your square bracket
notation, but if you have lots of As and not too many Zs, or Bs, or Ds,
it's possible this approach could devolve back into linear time.
So, in the ideal we have one A, one B, one Z, and everything
in between, that's constant time.
We have our holy grail, constant time operations for a data structure,
but not if we want to support insertion of other elements,
even those that hash to the same location.
So, what's the fix?
Well, if the problem is that we've already
made room-- we already have used this space for Alice, you know what?
If we need to put someone else here, why don't we just create
dynamically some more space?
We have malloc now.
We have dynamic memory allocation.
Why don't we just extend our data structure laterally, horizontally--
artistically here-- so that, yes, you try to go to that first location.
But if there's multiple people that are meant to go there,
multiple values, go ahead and just link them together,
thereby merging the idea of a hash table and a linked list with a data structure
that might look like this.
So, this is an example, somewhat arbitrary, of 31 days out of a month.
And if you actually hash on people's birth dates,
as I think this author did, you can think of your hash table
still as an array.
But that array does not store strings, it does not store integers.
It only stores pointers, 31 total in this case-- some of which
might be null, per the vertical diagonal slash--
but those pointers in turn point to the beginning of linked lists.
So, if multiple people were born on the fourth of some month,
you would put J. Adams and W. Floyd in a linked list at that location.
If both Aaron, and Alex, and Alice, and other students with the names A
all belong at that first location in my previous table, that's fine.
Just string them together with a linked list.
Much like with these buckets, at the end of the day, I'm still creating piles.
And at the end of the day, I still have to go through them all, ultimately.
But each of these piles is 1/26 the size of it
would have been if everyone just came up at the end of the exam
and just piled all their books in the same pile.
So, whereas, these algorithms at the end of the day
are still devolving, if you will-- or these data structures
are devolving, if you will, into linear time operations,
in the worst case if these things just get really long and stringy,
at least in actuality they might be as good as 1/31 as long or 1/26 as tall.
And so, now there's this dichotomy in this week five
of asymptotic running time, the theoretical running
time that we've really been belaboring and the actual running time.
Just because something is n squared does not mean it's bad.
If there's only a few elements, n squared is great.
It's going to happen super fast if your computer is 1 gigahertz,
or 2 gigahertz, or faster these days.
N squared in and of itself isn't bad.
It just gets really bad when your data gets large.
But in practice, even n squared divided by 2 is actually better than n squared.
So, a couple weeks ago when I was saying don't worry about the lower order
terms, the constant terms, focus only on n squared
and not n or anything you're dividing by, that's fine theoretically,
but in actuality you're going to feel that kind of difference.
So, here's one last data structure that we'll call a trie-- so trie,
short for retrieval somehow, T-R-I-E, but pronounced try.
And this one is cool because this now is really
like a weird offspring of these data structures from today.
But it's a tree each of whose nodes is in an array.
And a trie is really good for storing words like words in a dictionary.
Indeed, one of the problem I had for you in CS50
is going to be to implement a spell checker, which effectively means build
a dictionary in memory, and you'll be challenged
to spell check words as fast as you can, storing
as many as 100,000 English words somehow in your computer's memory
and answering questions of the form is this a word, is this a word,
is this a word.
That's, after all, what spell checking is.
So, a trie is kind of interesting in that-- and this is an excerpt of an,
artist's rendition there of-- the root node
here represents this-- is this rectangle here, and that of course
looks like an array.
And notice what's implicit in this.
If this is location A and this is location Z,
the author here has just decided to only show you
those letters that matter for the sake of discussion.
But the fact that the M location here is not blank
means there's a pointer there.
Indeed, what are these arrays?
They are arrays of pointers to other nodes.
So, the fact that M is not null and it leads to this node, and notice that A
is not null and it leads to this node, and then this node, and then this node.
And this is where the artist is just taking some liberties.
This tree would be monstrously wide, because all of these arrays
are so darn wide, so he or she is just showing you the width-- or the element
that we care about, M, A, X, W, E, L, L, and then some special sentinel symbol
delta, but it could be anything.
This is null, really.
This is how using a trie a programmer could store the name Maxwell,
M-A-X-W-E-L-L, by simply leaving little bread crumbs, if you will,
from one node to another such that each of those elements in the array is
a pointer to another array.
And if you keep following these pointers, following the bread crumbs
and you eventually find yourself at this special sentinel value--
and actually, it wouldn't be null, it would be like a Boolean saying true.
This is a word you can just by storing a single yes or no at this location way
down here, implicitly reveal that M-A-X-W-E-L was in fact a word.
Let's follow another.
So, let's say Turing, T-U-R-I-N-G, check, Boolean true.
Turing is in this dictionary as well.
So, if there are bread crumbs that lead to null, that word is not in here.
So, apparently there is no names starting with A through L,
and there is no one after U through Z or some of the letters in between,
because those pointers are implicitly and pictorially null.
But let's consider, then, what is the running time of inserting or looking up
a name and [? in a trie? ?] Thus far, pretty much all of the data
structures we've talked about have pretty slow running times,
linear in the worst case.
So, if we used an array to store people's names
or we used to linked list to store people's names, in the worst case
we had linear running time, unless maybe we sort things, but even
then that costs us some time.
So, linear may be logarithmic was the best we could do.
And even with a hash table, whereby, maybe we
store Maxwell at the M location in our table,
he might still have a link list of a whole bunch of other M people.
That, again, can devolve into something linear, a linear linked list.
But what about a hash table?
To answer the question is Maxwell in a trie-- sorry, what about to trie?
To answer the question is Maxwell in a trie, what do we do?
We start at the root and we follow the pointer that represents m,
and then we follow the pointer there that represents A, then X, W, E, L, L,
and we look for at the end of that series of steps a true false value.
And if it's true, yes, Maxwell is here.
What about Turing?
Well, we start at the pointer that represents
T, then U, R, I, N G, then check.
Oh, true.
Turing is in there.
Let's look for David.
No, false.
There's not even a pointer there.
David is not in this dictionary.
So, how many steps did that each take?
To tell whether Maxwell was in the dictionary,
was M-A-X-W-E-L-L and then look at the Boolean, so that was eight steps.
And to look up Turing was T-U-R-I-N-G. And then that Boolean,
that was seven steps.
Those numbers have nothing to do with how many words are already in the trie.
There might be-- and there's only a couple dozen here--
there are a dozen or so here-- there might be thousands of actual words
in this dictionary, but we're still going to find Alan Turing by way
of T-U-R-I-N-G Boolean seven steps, and M-A-X-W-E-L-L Boolean, eight steps.
It doesn't matter how many other data elements are in this trie.
And that's what's powerful, because if there
is an upper bound on the number of letters in an English
word-- which is kind of true.
I've rarely typed words that are longer than I don't
know 10 characters, 15 characters.
At some point there might exist these words,
but no one actually says or types these words.
Those are effectively constants.
The maximum length of a word in English is surely some constant,
because there is one word that's the longest.
That's a constant value, which means inserting a name,
or searching for a name, or removing a name from a trie
does depend on the length of the name, but it does not
depend on how many pieces of data are already in the data structure.
And as such, it is constant time.
So, now in C, we have a whole bunch of new syntax
with which to represent data structures, namely actual structs in C,
and we have pointers, and we have malloc with which
we can build more interesting shapes, if you will, in memory.
And we now have a number of abstract data types and actual data structures
we can build using these ingredients with which we can now solve problems
that are going to demand all the more resources, all the more time, all
the more space, in which case efficiency and good design
is going to be ever more important.
All this and more next time.
AUDIENCE: She thought she was doing the right thing.
[AUDIO PLAYBACK]
-Tell me more.
-David was sure it had to be the muppet, something called muppet mode,
but the pressure was too much.
-This is Mario in muppet mode.
Take 23.
[HONKING]
-What's happening?
I thought this is what you always wanted,
to star in the walkthrough videos, to have all of YouTube's eyes
watching you.
[HONKING]
-Yes, you are.
You have to be.
Now stand up straight, tuck in your shirt, look into the camera!
Take it again from the top.