字幕列表 影片播放 列印英文字幕 The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. JULIAN SHUN: Hi, good afternoon, everyone. So today, we're going to be talking about graph optimizations. And as a reminder, on Thursday, we're going to have a guest lecture by Professor Johnson of the MIT Math Department. And he'll be talking about performance of high-level languages. So please be sure to attend the guest lecture on Thursday. So here's an outline of what I'm going to be talking about today. So we're first going to remind ourselves what a graph is. And then we're going to talk about various ways to represent a graph in memory. And then we'll talk about how to implement an efficient breadth-first search algorithm, both serially and also in parallel. And then I'll talk about how to use graph compression and graph reordering to improve the locality of graph algorithms. So first of all, what is a graph? So a graph contains vertices and edges, where vertices represent certain objects of interest, and edges between objects model relationships between the two objects. For example, you can have a social network, where the people are represented as vertices and edges between people mean that they're friends with each other. The edges in this graph don't have to be bi-directional. So you could have a one-way relationship. For example, if you're looking at the Twitter network, Alice could follow Bob, but Bob doesn't necessarily have to follow Alice back. The graph also doesn't have to be connected. So here, this graph here is connected. But, for example, there could be some people who don't like to talk to other people. And then they're just off in their own component. You can also use graphs to model protein networks, where the vertices are proteins, and edges between vertices means that there's some sort of interaction between the proteins. So this is useful in computational biology. As I said, edges can be directed, so their relationship can go one way or both ways. In this graph here, we have some directed edges and then also some edges that are directed in both directions. So here, John follows Alice. Alice follows Peter. And then Alice follows Bob, and Bob also follows Alice. If you use a graph to represent the world wide web, then the vertices would be websites, and then the edges would denote that there is a hyperlink from one website to another. And again, the edges here don't have to be bi-directional because website A could have a link to website B. But website B doesn't necessarily have to have a link back. Edges can also be weighted. So you can have a weight on the edge that denotes the strength of the relationship or some sort of distance measure corresponding to that relationship. So here, I have an example where I am using a graph to represent cities. And the edges between cities have a weight that corresponds to the distance between the two cities. And if I want to find the quickest way to get from city A to city B, then I would be interested in finding the shortest path from A to B in this graph here. Here's another example, where the edge weights now are the costs of a direct flight from city A to city B. And here the edges are directed. So, for example, this says that there's a flight from San Francisco to LA for $45. And if I want to find the cheapest way to get from one city to another city, then, again, I would try to find the shortest path in this graph from city A to city B. Vertices and edges can also have metadata on them, and they can also have types. So, for example, here's the Google Knowledge Graph, which represents all the knowledge on the internet that Google knows about. And here, the nodes have metadata on them. So, for example, the node corresponding to da Vinci is labeled with his date of birth and date of death. And the vertices also have a color corresponding to the type of knowledge that they refer to. So you can see that some of these nodes are blue, some of them are red, some of them are green, and some of them have other things on them. So in general, graphs can have types and metadata on both the vertices as well as the edges. Let's look at some more applications of graphs. So graphs are very useful for implementing queries on social networks. So here are some examples of queries that you might want to ask on a social network. So, for example, you might be interested in finding all of your friends who went to the same high school as you on Facebook. So that can be implemented using a graph algorithm. You might also be interested in finding all of the common friends you have with somebody else-- again, a graph algorithm. And a social network service might run a graph algorithm to recommend people that you might know and want to become friends with. And they might use a graph algorithm to recommend certain products that you might be interested in. So these are all examples of social network queries. And there are many other queries that you might be interested in running on a social network. And many of them can be implemented using graph algorithms. Another important application is clustering. So here, the goal is to find groups of vertices in a graph that are well-connected internally and poorly-connected externally. So in this image here, each blob of vertices of the same color corresponds to a cluster. And you can see that inside a cluster, there are a lot of edges going among the vertices. And between clusters, there are relatively fewer edges. And some applications of clustering include community detection and social networks. So here, you might be interested in finding groups of people with similar interests or hobbies. You can also use clustering to detect fraudulent websites on the internet. You can use it for clustering documents. So you would cluster documents that have similar text together. And clustering is often used for unsupervised learning and machine learning applications. Another application is connectomics. So connectomics is the study of the structure, the network structure of the brain. And here, the vertices correspond to neurons. And edges between two vertices means that there's some sort of interaction between the two neurons. And recently, there's been a lot of work on trying to do high-performance connectomics. And some of this work has been going on here at MIT by Professor Charles Leiserson and Professor Nir Shavit's research group. So recently, this has been a very hot area. Graphs are also used in computer vision-- for example, in image segmentation. So here, you want to segment your image into the distinct objects that appear in the image. And you can construct a graph by representing the pixels as vertices. And then you would place an edge between every pair of neighboring pixels with a weight that corresponds to their similarity. And then you would run some sort of minimum cost cut algorithm to partition your graph into the different objects that appear in the image. So there are many other applications. And I'm not going to have time to go through all of them today. But here's just a flavor of some of the applications of graphs. So any questions so far? OK, so next, let's look at how we can represent a graph in memory. So for the rest of this lecture, I'm going to assume that my vertices are labeled in the range from 0 to n minus 1. So they have an integer in this range. Sometimes, your graph might be given to you where the vertices are already labeled in this range, sometimes, not. But you can always get these labels by mapping each of the identifiers to a unique integer in this range. So for the rest of the lecture, I'm just going to assume that we have these labels from 0 to n minus 1 for the vertices. One way to represent a graph is to use an adjacency matrix. So this is going to be n by n matrix. And there's a 1 bit in i-th row in j-th column if there's an edge that goes from vertex I to vertex J, and 0 otherwise. Another way to represent a graph is the edgeless representation, where we just store a list of the edges that appear in the graph. So we have one pair for each edge, where the pair contains the two coordinates of that edge. So what is the space requirement for each of these two representations in terms of the number of edges m and the number of vertices n in the graph? So it should be pretty easy. Yes. AUDIENCE: n squared for the [INAUDIBLE] and m for the [INAUDIBLE]. JULIAN SHUN: Yes, so the space for the adjacency matrix is order n squared because you have n squared cells in this matrix. And you have 1 bit for each of the cells. For the edge list, it's going to be order m because you have m edges. And for each edge, you're storing a constant amount of data in the edge list. So here's another way to represent a graph. This is known as the adjacency list format. And idea here is that we're going to have an array of pointers, 1 per vertex. And each pointer points to a linked list storing the edges for that vertex. And the linked list is unordered in this example. So what's the space requirement of this representation? AUDIENCE: It's n plus m. JULIAN SHUN: Yeah, so it's going to be order n plus m. And this is because we have n pointers. And the number of entries across all of the linked lists is just equal to the number of edges in the graph, which is m. What's one potential issue with this sort of representation if you think in terms of cache performance? Does anyone see a potential performance issue here? Yeah. AUDIENCE: So it could be [INAUDIBLE].. JULIAN SHUN: Right. So the issue here is that if you're trying to loop over all of the neighbors of a vertex, you're going to have to dereference the pointer in every linked list node. Because these are not contiguous in memory. And every time you dereference linked lists node, that's going to be a random access into memory. So that can be bad for cache performance. One way you can improve cache performance is instead of using linked lists for each of these neighbor lists, you can use an array. So now you can store the neighbors just in this array, and they'll be contiguous in memory. One drawback of this approach is that it becomes more expensive if you're trying to update the graph. And we'll talk more about that later. So any questions so far? So what's another way to represent the graph that we've seen in a previous lecture? What's a more compressed or compact way to represent a graph, especially a sparse graph? So does anybody