字幕列表 影片播放
In this video we'll begin our discussion of hash tables; we'll focus first on the
support operations, and on some of the canonical applications. So hash tables are
insanely useful. If you want to be a serious programmer or a computer scientist
you really have no choice but to learn about hash tables. I'm sure many of you
have used them in your own programs in the past in fact. Now on the one hand what's
funny is they don't actually do that many things in terms of the number of supported
operations, but what they do, do they do really, really well. So what is a hash
table? Well conceptually, ignoring all of the aspects of the implementation, you may
wanna think of a hash table as an array. So one thing that arrays do super well is
support immediate random access. So if you're wondering what's the position
number seventeen of some array, boom, with a couple of machine instructions you can
find out, wanna change the contents of position number 23 in some array? Done, in
constant time. So let's think about an application in which you want to remember
your friends phone numbers. So if you're lucky your friends parents were all u nu,
unusually unimaginative people and all of your friends names are integers let's say
between one and 10,000. So if this is the case then you can just maintain an array
of link 10,000. And to store the phone number of say, your best friend, 173, you
can just use position 173 of this modest sized array. So this array based solution
would work great, even if your friends change over time, you gain some here you
lose some there, as long as all your friends names happen to be integers
between 1-10,000. Now, of course, your friends have more interesting names:
Alice, Bob, Carol, whatever. And last names as well. So in principal you could
have an array with one position in the array for every conceivable name you might
encounter, with at least 30 letters set. But of course this array would be way too
big. It would be something like 26 raised to the thirtieth power and you could never
implement it. So what you'd really want is you'd want an array of reasonable size,
say, you know ballpark the number of friends that you'd ever have, so say in
the thousands or something, where it's positions are indexed not by the numbers,
not integers. [inaudible] Between one and 10,000, but rather by your friends Names
And what you'd like to do is you'd like to have random access to this array based on
your friend's name. So you just look up the quote unquote Alice position of this
array and. Boom, there would be Alice's phone number in constant time. And this,
on a conceptual level is basically what a hash table, can do for you. So there's a
lot of magic happening under the hood of a hash table and that's something we'll
discuss to some extent in other videos. So you have to have this mapping between the
keys that you care about, like your friends' names, and, numerical positions
of some array. That's done by what's called a hash function, but properly
implemented, this is the kind of functionality that hash tables gives you,
So like an array with its positions indexed by the keys that you're storing.
So you can think of the purpose of the hash table as to maintain a possibly
evolving set of stuff. Where of course the set of things that you're maintaining, you
know, will vary with the application. It can be any number of things. So if you're
running an e-commerce website, maybe you're keeping track of transactions. You
know, again, maybe you're keeping track of people, like for example, your friends and
various data about them. So maybe you're keeping track of I-P addresses, for
example if you wanna know, who was, were there unique visitors to your websites.
And so on. So a little bit more formally, you know, the basic operations, you need
to be able to insert stuff into a hash table. In many, but not all applications,
you need to be able to delete stuff as well. And typically the most important
operation is look-up. And for all these three operation you do it in a key based
way. Where as usual a key should just be a unique identifier for the record that
you're concerned with. So, for example, for employees you might be using social
security numbers. For transactions you might have a transaction ID number. And
then IP addresses could act as their own key. And so sometimes all you're doing is
keeping track of the keys themselves. So, for example, in IP addresses, maybe you
just want to remember a list of IP addresses. You don't actually have any
associated data but in many applications, you know, along with the key, is a bunch
of other stuff. So along with the employee's social security number, you
gotta remember a bunch of other data about that employee. But when you do the insert,
when you do the delete, when you do the look up, you do it based. On this key, and
then for example, on look up you feed the key into the hash table and the hash table
will spit back out all of the data associated with that key. We sometimes
hear people refer to data structures that support these operations as a dictionary.
So the main thing the hash table is meant to support is look up in the spirit of a
dictionary. I find that terminology a little misleading actually. You know, most
dictionaries that you'll find are in alphabetical order. So they'll support
something like binary search. And I want to emphasis something a hash table does
not do is maintain an ordering on the elements that it supports. So if you're
storing stuff and you do want to have order based operations, you wanna find the
minimum or the maximum, or something like that, a hash table's probably not the
right data structure. You want something more. You wanna look at a heap or you
wanna look at a, a search tree. But for applications in which all you have to do
is basically look stuff up you gotta, you gotta know what's there and what's not,
then there should be a light bulb that goes off in your head. And you can say,
let me consider a hash table, that's probably the perfect data structure for
this application. Now, looking at this menu-supported operations, you may be left
kinda unimpressed. Alright, so a hash table, in some sense, doesn't do that many
things; but again, what it does, it does really, really well. So, to first order.
What hash tables give you is the following amazing guarantee. All of these operations
run in constant time. And again this is in the spirit of thinking of a hash table as
just like an array. Where its positions are conveniently indexed by your keys, So
just like an array supports random access in constant time, you can see if, you
know, there's anything in the array position, and what it is. As similarly a
hash table will let you look up based on the key in constant time. So what is the
fine print? Well, there's basically two caveats. So the first thing is that hash
tables are easy to implement badly. And if you implement them badly you will not get
this guarantee. So this guarantee is for properly implemented hash tables. Now, of
course if you're just using a hash table from a well known library, it's probably a
pretty good assumption that it's properly implemented. You'd hope. But in the event
that you're forced to come up with your own hash table and your own hash function
and unlike many of the other data structures we'll talk about, some of you
probably will have to do that at some point in your career. Then you'll get this
guarantee only if you implement it well. And we'll talk about exactly what that
means in other videos. So the second caveat is that, unlike most of the
problems that we've solved in this course, hash tables don't enjoy worst case
guarantees. You cannot say for a given hash table that for every possible data
set you're gonna get cost and time. What's true is that for non-pathological data,
you will get cost and time operations in a properly implemented hash table. So we'll
talk about both of these issues a bit more in other videos, but for now just high
order bits are, you know, hash tables, constant time performance, subject to a
couple of caveats. So now that I've covered the operations that hash tables
support and the recommend way to think about them, let's turn our attention to
some applications. All of these applications are gonna be in some sense,
you know, kinda trivial uses of hash tables, but they're also all really
practical. These come up all the time. So the first application we'll discuss, which
again is a conical one, is removing duplicates from a bunch of stuff, Also
known as the deduplication problem. So in the De-duplication problem, the input is
essentially a stream of objects. Where, when I say a stream I have kinda, you know
two different things in mind as canonical examples. So first of all you can imagine
you have a huge file. So you have, you know, a log of everything that happened on
some website you're running. Or all of the transactions that were made in a store on
some day, And you do a pass through this huge file. So you're just in the middle of
some outer for loop going line by line through this massive file. The other
example of a stream that I had in mind, is, where you're getting new data over
time. So here, you might imagine that you're running software to be deployed on
an internet router. And data packets are coming through this router at a constant
extremely fast rate. And so you might be looking at, say, the IP addresses and the
sender, and use your data packet which is going through your router. So it would be
another example of a stream of objects. And now, what do you gotta do? What you
gotta do is you gotta ignore the duplicates. So remember just the distinct
objects that you see in this stream. And I hope you find it easy to imagine why you
might want to do this task in various applications. So, for example, if you're
running a website you might want to keep track of the distinct visitors that you
ever saw in a given day or a given week. If you're doing something like a web
crawl, you might want to identify duplicate documents and only remember them
once. So, for example, it would be annoying if in search results both the top
link and the second link both led to identical pages at different URLs, okay,
so search engines obviously want to avoid that, so you want to detect duplicate web
pages and only report unique ones. And the solution using a hash table is laughably
simple. So every time a new object arrives in the stream, you look it up. If it?s
there, then it?s a duplicate and you ignore it. If it?s not there, then this is
a new object and you remember it. Qed, that's it. And so then after the string
completes, so for example after you finish reading some huge file, if you just want
to report all of the unique objects, hash tables generally support a linear scan
through them and you can just report all of the distinct objects when this stream
finishes. So let's move on to a second application slightly less trivial maybe
but still quite easy, and this is the subject of Programming Projects number
five. So this is a problem called the two sum problem. You're given as input an
array of N number. These images are in no particular order. You're also given a
target sum, which I'll call T. And what you want to know is are there two integers
from amongst these N you are given that sum to T. Now the most obvious and naive
way to solve this problem is just to go over all N, choose two pairs of integers
in the input, and check each one separately. So that's clearly a quadratic
time algorithm. But now, of course, we need to ask, can we do better? And, yes,
we can. And first of all let's see what you'd do if you couldn't use any data
structures. So if you were clever, but you didn't use any data structures like a hash
table, here would be a reasonable improvement over the naive one. So the
first step of a better solution is to sort A upfront, For example, using word sort or
heap sort, something that runs in end log and time. So you may be asking about the
motivation for sorting. Well, again, you know, one thing is just, you know whenever
you're trying to do better than N squared; you might think that sorting your data
somehow helps. Right and you can sort of do it almost for free in N log N time.
Now, why would sorting the array up front help us? Well, then the clever insight is
that for each entry of the array a, say the first entry, now we know what we're
looking for to achieve this given target, right. If the target that we're trying to
get to is summed to 100 and the first entry in the sorted array is 43, then we
know we're looking for a 57 somewhere else in. This now sorted array. And we know
that searching a sorted array is pretty easy, right. That just binary search. That
just takes logarithmic time. So for each of the n array entries, we can look for a
complementary. Entry, namely of reach X we can look for T - X using binary search.
And to use binary search takes log N time. So the sorting upfront speeds up this
entire batch of N searches. So that's why it's a win. So, in the second step,
because we do a linear number of binary searches, again, this is just n, the
number of searches, times log-n, the time per search. So, this is just another theta
of N log N factor. Alright, so that's pretty cool. You, I don't think you could
come up with this N log N solution without having some basic, facility with
algorithms. This is already a really nice improvement over the naive N squared. But
we can do even better. It is no reason we're stuck with an N log N lower bound
for the [inaudible] problem. Obviously, because the array is unsorted, we have to
look at all the integers. So we're not gonna do better than linear time. But we
can do linear time via a hash table. So a good question you might ask at this point
is what's the clue about this problem, about this task that suggests we want to
use a hash table. Well, so hash tables are going to dramatically speed up any
application where the bulk of the word is just repeated look-ups. And if we examine
this n log n solution, once we have this idea of doing a search for T minus X for
each value of X, we realize actually, you know, the only thing we needed the sorted
array for was to support look-ups. That's all binary search here is doing, is just
looking stuff up. So we say, ah-ha. All of the work here in step two is from repeated
look-ups. We're paying an exorbitant relatively, logarithm per amount of time
per look-up, whereas hash tables can do them in cost and time. So, repeated
look-ups, ding, ding, ding, let's use a hash table; and indeed that's what gives
us linear time in this problem. So from the amazing guarantee of hash tables, we
get the following amazing solution for the true [inaudible] problem, although again
this is subject to the same fine print about you better use it properly
implemented hash table and you better not have pathological data. So rather than
sorting, you just insert everything in the array into a hash table. So insertions
cost time. So this is gonna be linear time rather than the end log [inaudible] we
were paying before. Once all the stuff is in the hash table, we just do the same
thing as in the n log-n solution. For each x in the array, we look for its matching
elements, t-x in the hash table using the cost and time look-up operation exported
by the hash table. And of course if for some X, you do find the matching element T
minus X. Then you can just report X and T minus X. That proves that there is indeed
a pair of integers of target sum T. If for every single element of the input array A,
you fail to find this matching element T minus X in the hash table. Then, for sure
there is no pair of integers in the input that sums to T. So this solves the problem
correctly. Moreover, constant time insertion, so that means this first step
is going to be O of end time. And constant time look-up. So that means that the
second step is also gonna be linear time. That leaves subjects to the caveats that
we discussed on the previous slide. So it's kind of amazing how many different
applications of computer science boil down in their essence to repeated look up
operations. Therefore, having a super fast look up operation, like that supported by
a hash table, permits these applications to scale to fantastic sizes. It's really
amazing, and it drives a lot of modern technology. So let me just mention a
couple examples. Again, if you look around or do some research on the web, you'll
quickly find many more. So originally what prompted researchers to think hard about
data structures that support super fast look ups, was back when people were first
building compilers. So this is a long time ago. This is in the fifties or so. And
these repeated look ups to figure out, you know, what has and has not been defined
before was, was emerging as a bottleneck in compilers. Back in the early days of
programming languages. And that was one of the early applications of hash tables. Was
to support super fast look ups to speed up compile time. To keep track of the
function of variable names and things like that. Hash table technology is also super
useful for software on routers in the Internet. So, for example, you might want
to block network traffic from certain sources. So, for example, maybe you
suspect that a certain IP address has been taken over by spammers and so any traffic
coming from that IP address you just want to ignore. And you don't wanna even let it
get to the end host, to the computer on someone's desktop, or to someone's mobile
device but rather inside the internet. You wanna just drop packets that are coming
certain, certain centers. So what is that problem boil down to? Well, you might have
a blacklist of IP addresses that you're refusing traffic from and then the tasks
faced by the router is really the look up problem. So if data packet comes in at
some insanely fast data rate, and when you wanna. You immediately, just look up, is
this in the blacklist or not, and if it is in the blacklist then you drop the packet,
if it?s not, then you let it go through. So a very different application is for
speeding up search algorithms. And when I say a search algorithm, what I'm thinking
about here is something like a chess playing program. So something that does
game tree exploration. So we've already talked a fair amount about graph search in
this class, but in our discussion of breadth first and depth first search, we
were thinking about graphs that you could basically write down. You could store them
in the main memory of your machine or, in the worst case, on some big cluster. So
maybe graphs, you know, about the size of the web graph or possibly smaller. But in
a context of something like a chess playing program the graph that you're
interested in is way, way, way bigger than the web graph. So what's the graph we care
about for a chess playing program? Well, the nodes of the graph are going to
correspond to all possible configurations of chess pieces On a chess board. So every
chess board that you might ever encounter in a game of chess. So that's a. Massive,
massive number of configurations. And you're never gonna be able to write down
these vertices. The edges in this graph are going to take you from one
configuration to another. And there gonna correspond to legal moves. So if you can
move a bishop from. One place to another place, and you get from one configuration
to another configuration, there's an edge in the graph corresponding to that move.
Now you can't write down this graph. So you can't implement breadth versus depth
versus search exactly as we discussed it before. But, you'd still like to do graph
exploration, right? So you'd like to have your computer program, reason about the at
least short term ramifications of your possible next move. So that will
correspond to searching through this graph. Now, how are you gonna, it's
remembering graphs search a really important property was you don't want to
do redundant work, you don't want to re-explore things you've already explored.
That would be silly and might lead into loops and so on. And you can't write down
the graph just remembering where you've been, is suddenly a non-trivial problem;
but what is remembering where you've been, fundamentally? Fundamentally that's a look
up operation. So that is exactly what hash tables are for. So to be a little more
concrete, you know, one where you use the hash table in, say, a chess playing
program, is you'd stake, take the initial configuration. You would sort of imagine
trying all possible moves from this configuration. And then you'd try, you'd
sort of have all moves from your opponent and then you'd have all your moves in
response. And you would always remember, as you were doing this reasoning, every
chessboard configuration you'd ever looked at before and you'd stick in the hash
table. And before you go exploring some configuration, you'd look it up in your
hash table to see if you've already explored it in the past. And if you have,
you don't bother. You've already seen it. All right. So chess playing programs
operate by exploring systematically as many configurations as they'd have time
for. You know, obviously, in a budget of three minutes or whatever you don't wanna
waste any work exploring any given configuration more than once. How do you
remember where you've been? Well everything you've explored you stick in a
hash table Before you explore a configuration you look it up in a hash
table and see if you've already done it. So these of course are just scratching the
surface. I just wanted to highlight a couple, you know, fairly different looking
applications, you know to convince you that hash tables come up all the time. And
the reason they come up all the time is because you know the need for fast
look-ups comes up all the time. It's kind of amazing how much technology is being
driven just by you know repeated fast look-ups. So as homework I encourage you
to just sort of think about you know your own life, or think about technology out
there in the world, and come up with some. You know, guesses about where probably
hash tables are making something out there running blazingly fast. I think it won't
take you more than a few minutes to come up with some good examples.