Placeholder Image

字幕列表 影片播放

  • In this video we'll begin our discussion of hash tables; we'll focus first on the

  • support operations, and on some of the canonical applications. So hash tables are

  • insanely useful. If you want to be a serious programmer or a computer scientist

  • you really have no choice but to learn about hash tables. I'm sure many of you

  • have used them in your own programs in the past in fact. Now on the one hand what's

  • funny is they don't actually do that many things in terms of the number of supported

  • operations, but what they do, do they do really, really well. So what is a hash

  • table? Well conceptually, ignoring all of the aspects of the implementation, you may

  • wanna think of a hash table as an array. So one thing that arrays do super well is

  • support immediate random access. So if you're wondering what's the position

  • number seventeen of some array, boom, with a couple of machine instructions you can

  • find out, wanna change the contents of position number 23 in some array? Done, in

  • constant time. So let's think about an application in which you want to remember

  • your friends phone numbers. So if you're lucky your friends parents were all u nu,

  • unusually unimaginative people and all of your friends names are integers let's say

  • between one and 10,000. So if this is the case then you can just maintain an array

  • of link 10,000. And to store the phone number of say, your best friend, 173, you

  • can just use position 173 of this modest sized array. So this array based solution

  • would work great, even if your friends change over time, you gain some here you

  • lose some there, as long as all your friends names happen to be integers

  • between 1-10,000. Now, of course, your friends have more interesting names:

  • Alice, Bob, Carol, whatever. And last names as well. So in principal you could

  • have an array with one position in the array for every conceivable name you might

  • encounter, with at least 30 letters set. But of course this array would be way too

  • big. It would be something like 26 raised to the thirtieth power and you could never

  • implement it. So what you'd really want is you'd want an array of reasonable size,

  • say, you know ballpark the number of friends that you'd ever have, so say in

  • the thousands or something, where it's positions are indexed not by the numbers,

  • not integers. [inaudible] Between one and 10,000, but rather by your friends Names

  • And what you'd like to do is you'd like to have random access to this array based on

  • your friend's name. So you just look up the quote unquote Alice position of this

  • array and. Boom, there would be Alice's phone number in constant time. And this,

  • on a conceptual level is basically what a hash table, can do for you. So there's a

  • lot of magic happening under the hood of a hash table and that's something we'll

  • discuss to some extent in other videos. So you have to have this mapping between the

  • keys that you care about, like your friends' names, and, numerical positions

  • of some array. That's done by what's called a hash function, but properly

  • implemented, this is the kind of functionality that hash tables gives you,

  • So like an array with its positions indexed by the keys that you're storing.

  • So you can think of the purpose of the hash table as to maintain a possibly

  • evolving set of stuff. Where of course the set of things that you're maintaining, you

  • know, will vary with the application. It can be any number of things. So if you're

  • running an e-commerce website, maybe you're keeping track of transactions. You

  • know, again, maybe you're keeping track of people, like for example, your friends and

  • various data about them. So maybe you're keeping track of I-P addresses, for

  • example if you wanna know, who was, were there unique visitors to your websites.

  • And so on. So a little bit more formally, you know, the basic operations, you need

  • to be able to insert stuff into a hash table. In many, but not all applications,

  • you need to be able to delete stuff as well. And typically the most important

  • operation is look-up. And for all these three operation you do it in a key based

  • way. Where as usual a key should just be a unique identifier for the record that

  • you're concerned with. So, for example, for employees you might be using social

  • security numbers. For transactions you might have a transaction ID number. And

  • then IP addresses could act as their own key. And so sometimes all you're doing is

  • keeping track of the keys themselves. So, for example, in IP addresses, maybe you

  • just want to remember a list of IP addresses. You don't actually have any

  • associated data but in many applications, you know, along with the key, is a bunch

  • of other stuff. So along with the employee's social security number, you

  • gotta remember a bunch of other data about that employee. But when you do the insert,

  • when you do the delete, when you do the look up, you do it based. On this key, and

  • then for example, on look up you feed the key into the hash table and the hash table

  • will spit back out all of the data associated with that key. We sometimes

  • hear people refer to data structures that support these operations as a dictionary.

  • So the main thing the hash table is meant to support is look up in the spirit of a

  • dictionary. I find that terminology a little misleading actually. You know, most

  • dictionaries that you'll find are in alphabetical order. So they'll support

  • something like binary search. And I want to emphasis something a hash table does

  • not do is maintain an ordering on the elements that it supports. So if you're

  • storing stuff and you do want to have order based operations, you wanna find the

  • minimum or the maximum, or something like that, a hash table's probably not the

  • right data structure. You want something more. You wanna look at a heap or you

  • wanna look at a, a search tree. But for applications in which all you have to do

  • is basically look stuff up you gotta, you gotta know what's there and what's not,

  • then there should be a light bulb that goes off in your head. And you can say,

  • let me consider a hash table, that's probably the perfect data structure for

  • this application. Now, looking at this menu-supported operations, you may be left

  • kinda unimpressed. Alright, so a hash table, in some sense, doesn't do that many

  • things; but again, what it does, it does really, really well. So, to first order.

  • What hash tables give you is the following amazing guarantee. All of these operations

  • run in constant time. And again this is in the spirit of thinking of a hash table as

  • just like an array. Where its positions are conveniently indexed by your keys, So

  • just like an array supports random access in constant time, you can see if, you

  • know, there's anything in the array position, and what it is. As similarly a

  • hash table will let you look up based on the key in constant time. So what is the

  • fine print? Well, there's basically two caveats. So the first thing is that hash

  • tables are easy to implement badly. And if you implement them badly you will not get

  • this guarantee. So this guarantee is for properly implemented hash tables. Now, of

  • course if you're just using a hash table from a well known library, it's probably a

  • pretty good assumption that it's properly implemented. You'd hope. But in the event

  • that you're forced to come up with your own hash table and your own hash function

  • and unlike many of the other data structures we'll talk about, some of you

  • probably will have to do that at some point in your career. Then you'll get this

  • guarantee only if you implement it well. And we'll talk about exactly what that

  • means in other videos. So the second caveat is that, unlike most of the

  • problems that we've solved in this course, hash tables don't enjoy worst case

  • guarantees. You cannot say for a given hash table that for every possible data

  • set you're gonna get cost and time. What's true is that for non-pathological data,

  • you will get cost and time operations in a properly implemented hash table. So we'll

  • talk about both of these issues a bit more in other videos, but for now just high

  • order bits are, you know, hash tables, constant time performance, subject to a

  • couple of caveats. So now that I've covered the operations that hash tables

  • support and the recommend way to think about them, let's turn our attention to

  • some applications. All of these applications are gonna be in some sense,

  • you know, kinda trivial uses of hash tables, but they're also all really

  • practical. These come up all the time. So the first application we'll discuss, which

  • again is a conical one, is removing duplicates from a bunch of stuff, Also

  • known as the deduplication problem. So in the De-duplication problem, the input is

  • essentially a stream of objects. Where, when I say a stream I have kinda, you know

  • two different things in mind as canonical examples. So first of all you can imagine

  • you have a huge file. So you have, you know, a log of everything that happened on

  • some website you're running. Or all of the transactions that were made in a store on

  • some day, And you do a pass through this huge file. So you're just in the middle of

  • some outer for loop going line by line through this massive file. The other

  • example of a stream that I had in mind, is, where you're getting new data over

  • time. So here, you might imagine that you're running software to be deployed on

  • an internet router. And data packets are coming through this router at a constant

  • extremely fast rate. And so you might be looking at, say, the IP addresses and the

  • sender, and use your data packet which is going through your router. So it would be

  • another example of a stream of objects. And now, what do you gotta do? What you

  • gotta do is you gotta ignore the duplicates. So remember just the distinct

  • objects that you see in this stream. And I hope you find it easy to imagine why you

  • might want to do this task in various applications. So, for example, if you're

  • running a website you might want to keep track of the distinct visitors that you

  • ever saw in a given day or a given week. If you're doing something like a web

  • crawl, you might want to identify duplicate documents and only remember them

  • once. So, for example, it would be annoying if in search results both the top

  • link and the second link both led to identical pages at different URLs, okay,

  • so search engines obviously want to avoid that, so you want to detect duplicate web

  • pages and only report unique ones. And the solution using a hash table is laughably

  • simple. So every time a new object arrives in the stream, you look it up. If it?s

  • there, then it?s a duplicate and you ignore it. If it?s not there, then this is

  • a new object and you remember it. Qed, that's it. And so then after the string

  • completes, so for example after you finish reading some huge file, if you just want

  • to report all of the unique objects, hash tables generally support a linear scan

  • through them and you can just report all of the distinct objects when this stream

  • finishes. So let's move on to a second application slightly less trivial maybe

  • but still quite easy, and this is the subject of Programming Projects number

  • five. So this is a problem called the two sum problem. You're given as input an

  • array of N number. These images are in no particular order. You're also given a

  • target sum, which I'll call T. And what you want to know is are there two integers

  • from amongst these N you are given that sum to T. Now the most obvious and naive

  • way to solve this problem is just to go over all N, choose two pairs of integers

  • in the input, and check each one separately. So that's clearly a quadratic

  • time algorithm. But now, of course, we need to ask, can we do better? And, yes,

  • we can. And first of all let's see what you'd do if you couldn't use any data

  • structures. So if you were clever, but you didn't use any data structures like a hash

  • table, here would be a reasonable improvement over the naive one. So the

  • first step of a better solution is to sort A upfront, For example, using word sort or

  • heap sort, something that runs in end log and time. So you may be asking about the

  • motivation for sorting. Well, again, you know, one thing is just, you know whenever

  • you're trying to do better than N squared; you might think that sorting your data

  • somehow helps. Right and you can sort of do it almost for free in N log N time.

  • Now, why would sorting the array up front help us? Well, then the clever insight is

  • that for each entry of the array a, say the first entry, now we know what we're

  • looking for to achieve this given target, right. If the target that we're trying to

  • get to is summed to 100 and the first entry in the sorted array is 43, then we

  • know we're looking for a 57 somewhere else in. This now sorted array. And we know

  • that searching a sorted array is pretty easy, right. That just binary search. That

  • just takes logarithmic time. So for each of the n array entries, we can look for a

  • complementary. Entry, namely of reach X we can look for T - X using binary search.

  • And to use binary search takes log N time. So the sorting upfront speeds up this

  • entire batch of N searches. So that's why it's a win. So, in the second step,

  • because we do a linear number of binary searches, again, this is just n, the

  • number of searches, times log-n, the time per search. So, this is just another theta

  • of N log N factor. Alright, so that's pretty cool. You, I don't think you could

  • come up with this N log N solution without having some basic, facility with

  • algorithms. This is already a really nice improvement over the naive N squared. But

  • we can do even better. It is no reason we're stuck with an N log N lower bound

  • for the [inaudible] problem. Obviously, because the array is unsorted, we have to

  • look at all the integers. So we're not gonna do better than linear time. But we

  • can do linear time via a hash table. So a good question you might ask at this point

  • is what's the clue about this problem, about this task that suggests we want to

  • use a hash table. Well, so hash tables are going to dramatically speed up any

  • application where the bulk of the word is just repeated look-ups. And if we examine

  • this n log n solution, once we have this idea of doing a search for T minus X for

  • each value of X, we realize actually, you know, the only thing we needed the sorted

  • array for was to support look-ups. That's all binary search here is doing, is just

  • looking stuff up. So we say, ah-ha. All of the work here in step two is from repeated

  • look-ups. We're paying an exorbitant relatively, logarithm per amount of time

  • per look-up, whereas hash tables can do them in cost and time. So, repeated

  • look-ups, ding, ding, ding, let's use a hash table; and indeed that's what gives

  • us linear time in this problem. So from the amazing guarantee of hash tables, we

  • get the following amazing solution for the true [inaudible] problem, although again

  • this is subject to the same fine print about you better use it properly

  • implemented hash table and you better not have pathological data. So rather than

  • sorting, you just insert everything in the array into a hash table. So insertions

  • cost time. So this is gonna be linear time rather than the end log [inaudible] we

  • were paying before. Once all the stuff is in the hash table, we just do the same

  • thing as in the n log-n solution. For each x in the array, we look for its matching

  • elements, t-x in the hash table using the cost and time look-up operation exported

  • by the hash table. And of course if for some X, you do find the matching element T

  • minus X. Then you can just report X and T minus X. That proves that there is indeed

  • a pair of integers of target sum T. If for every single element of the input array A,

  • you fail to find this matching element T minus X in the hash table. Then, for sure

  • there is no pair of integers in the input that sums to T. So this solves the problem

  • correctly. Moreover, constant time insertion, so that means this first step

  • is going to be O of end time. And constant time look-up. So that means that the

  • second step is also gonna be linear time. That leaves subjects to the caveats that

  • we discussed on the previous slide. So it's kind of amazing how many different

  • applications of computer science boil down in their essence to repeated look up

  • operations. Therefore, having a super fast look up operation, like that supported by

  • a hash table, permits these applications to scale to fantastic sizes. It's really

  • amazing, and it drives a lot of modern technology. So let me just mention a

  • couple examples. Again, if you look around or do some research on the web, you'll

  • quickly find many more. So originally what prompted researchers to think hard about

  • data structures that support super fast look ups, was back when people were first

  • building compilers. So this is a long time ago. This is in the fifties or so. And

  • these repeated look ups to figure out, you know, what has and has not been defined

  • before was, was emerging as a bottleneck in compilers. Back in the early days of

  • programming languages. And that was one of the early applications of hash tables. Was

  • to support super fast look ups to speed up compile time. To keep track of the

  • function of variable names and things like that. Hash table technology is also super

  • useful for software on routers in the Internet. So, for example, you might want

  • to block network traffic from certain sources. So, for example, maybe you

  • suspect that a certain IP address has been taken over by spammers and so any traffic

  • coming from that IP address you just want to ignore. And you don't wanna even let it

  • get to the end host, to the computer on someone's desktop, or to someone's mobile

  • device but rather inside the internet. You wanna just drop packets that are coming

  • certain, certain centers. So what is that problem boil down to? Well, you might have

  • a blacklist of IP addresses that you're refusing traffic from and then the tasks

  • faced by the router is really the look up problem. So if data packet comes in at

  • some insanely fast data rate, and when you wanna. You immediately, just look up, is

  • this in the blacklist or not, and if it is in the blacklist then you drop the packet,

  • if it?s not, then you let it go through. So a very different application is for

  • speeding up search algorithms. And when I say a search algorithm, what I'm thinking

  • about here is something like a chess playing program. So something that does

  • game tree exploration. So we've already talked a fair amount about graph search in

  • this class, but in our discussion of breadth first and depth first search, we

  • were thinking about graphs that you could basically write down. You could store them

  • in the main memory of your machine or, in the worst case, on some big cluster. So

  • maybe graphs, you know, about the size of the web graph or possibly smaller. But in

  • a context of something like a chess playing program the graph that you're

  • interested in is way, way, way bigger than the web graph. So what's the graph we care

  • about for a chess playing program? Well, the nodes of the graph are going to

  • correspond to all possible configurations of chess pieces On a chess board. So every

  • chess board that you might ever encounter in a game of chess. So that's a. Massive,

  • massive number of configurations. And you're never gonna be able to write down

  • these vertices. The edges in this graph are going to take you from one

  • configuration to another. And there gonna correspond to legal moves. So if you can

  • move a bishop from. One place to another place, and you get from one configuration

  • to another configuration, there's an edge in the graph corresponding to that move.

  • Now you can't write down this graph. So you can't implement breadth versus depth

  • versus search exactly as we discussed it before. But, you'd still like to do graph

  • exploration, right? So you'd like to have your computer program, reason about the at

  • least short term ramifications of your possible next move. So that will

  • correspond to searching through this graph. Now, how are you gonna, it's

  • remembering graphs search a really important property was you don't want to

  • do redundant work, you don't want to re-explore things you've already explored.

  • That would be silly and might lead into loops and so on. And you can't write down

  • the graph just remembering where you've been, is suddenly a non-trivial problem;

  • but what is remembering where you've been, fundamentally? Fundamentally that's a look

  • up operation. So that is exactly what hash tables are for. So to be a little more

  • concrete, you know, one where you use the hash table in, say, a chess playing

  • program, is you'd stake, take the initial configuration. You would sort of imagine

  • trying all possible moves from this configuration. And then you'd try, you'd

  • sort of have all moves from your opponent and then you'd have all your moves in

  • response. And you would always remember, as you were doing this reasoning, every

  • chessboard configuration you'd ever looked at before and you'd stick in the hash

  • table. And before you go exploring some configuration, you'd look it up in your

  • hash table to see if you've already explored it in the past. And if you have,

  • you don't bother. You've already seen it. All right. So chess playing programs

  • operate by exploring systematically as many configurations as they'd have time

  • for. You know, obviously, in a budget of three minutes or whatever you don't wanna

  • waste any work exploring any given configuration more than once. How do you

  • remember where you've been? Well everything you've explored you stick in a

  • hash table Before you explore a configuration you look it up in a hash

  • table and see if you've already done it. So these of course are just scratching the

  • surface. I just wanted to highlight a couple, you know, fairly different looking

  • applications, you know to convince you that hash tables come up all the time. And

  • the reason they come up all the time is because you know the need for fast

  • look-ups comes up all the time. It's kind of amazing how much technology is being

  • driven just by you know repeated fast look-ups. So as homework I encourage you

  • to just sort of think about you know your own life, or think about technology out

  • there in the world, and come up with some. You know, guesses about where probably

  • hash tables are making something out there running blazingly fast. I think it won't

  • take you more than a few minutes to come up with some good examples.

In this video we'll begin our discussion of hash tables; we'll focus first on the

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級 美國腔

14 1 散列表的操作與應用 19分鐘 (14 1 Hash Tables Operations and Applications 19 min)

  • 38 6
    kaka 發佈於 2021 年 01 月 14 日
影片單字