Placeholder Image

字幕列表 影片播放

  • Hello, everyone.

  • Welcome back to Mining of Massive Datasets.

  • We're going to continue our discussion of Clustering today,

  • by looking at the k-Means Algorithm.

  • Recall that we previously, looked at another means,

  • way of doing clustering called Agglomerative Hierarchical Clustering.

  • Now, that algorithm worked very well in terms of producing clusters, but

  • unfortunately, it doesn't work very well for

  • large da, datasets because of its computational complexity.

  • The k-Means Algorithm, or the k-Means family of Algorithms actually,

  • is a way of addressing that problem, of creating Algorithms that

  • are computational tractable and work with very large datasets.

  • Now the k-means Algorithm assumes, a Euclidean space and

  • the Euclidean distance.

  • And the first step, is to pick k, the number of clusters.

  • Now, for now let's assume that we pick a number k, and

  • say that's the number of clusters that we finally want.

  • Towards the end, I'll ca, I'll show you how to actually, pick this this value k.

  • For now, let's just assume that k is the given.

  • Now, we're going to initialize k clusters, by picking one point per cluster.

  • For example, we could just pick k points at random, one, and

  • assign one point to each cluster.

  • And that would be the one way of picking the points.

  • Later on we'll examine other ways of doing this much better.

  • Now that we have clusters populated with these k-points picked at random,

  • here's how we're going to proceed.

  • We're really go through all the points in our datasets.

  • And for each point, we're going to place it in the cluster.

  • To whose centroid it's closest.

  • So, you're going to find the cluster with the closest centroid to the data point,

  • and then we'll assign that data point to that cluster.

  • Okay? And we're going to do this for

  • each data point.

  • Now, when we assign a whole bunch of new data points to a cluster,

  • the centroid of a cluster might change.

  • Because of all the new data points that have been added with the cluster.

  • So, we're going to go up,

  • update the locations of the centroids of each of the k clusters,

  • by taking into account the new data points that have been added to those clusters.

  • Now, once we do this, we'll find that the centroids of,

  • of the k clusters have moved, and

  • now a point that was close to one cluster may be closer to another cluster.

  • So, we need to go through and reassign all points to their closest centroid.

  • And sometimes this moves points between the clusters.

  • And notice that we, we may have to do poi, you know, steps 2 and 3 again and again.

  • And it, thi, this is exactly, what we're going to do.

  • We're going to repeat steps 2 and 3, until Convergence.

  • And what we mean by Convergence, is that the k centroids don't move any further,

  • and the point don't move en, any further across the across the centroids.

  • At that point, we have a stable k-means clustering.

  • So let's look at an example.

  • So, here's a set of points and we are going to do obviously,

  • say that k is equal to 2, so we're looking to find two clusters in this dataset.

  • Now in, in Round 1, I'm at random going to pick two points since k equal 2 and

  • call those a centroid of two clusters.

  • Let's say, I pick these two points so

  • the, the point that I've highlighted with the in pink here and

  • the point that I've highlighted in blue here as the two centroids.

  • I've picked this totally arbitrarily and at random.

  • Now, what I'm going to do is I'm going to go through, all the data points and for

  • each data point, I'm going to assign it to the centroid that it's closest to.

  • So, observe that this point for example here, is close to the red cluster.

  • So, I'm going to assign it to the red cluster.

  • Whereas, this point here is closer to the blue cluster than to the red cluster, so

  • I'm going to assign it to the blue cluster.

  • And, you know, when I go through all the data points and do this.

  • Here is the clustering that I end up with.

  • These two points here end up in the you know, in, in the pink cluster here and

  • this set of points are closer to the the blue point here, and so

  • they end up in a blue cluster.

  • So, that's round 1 of k-means clustering.

  • Now, what I'm going to do now is, because I've

  • created these two new clusters I'm going to compute the new centroids.

  • And when I compute the new centroids, the the centroid of

  • the pink cluster moves over a little bit to the left to this new position here.

  • And the centroid of the blue cluster moves to that new position.

  • Now, is because I have new centroids, I'm going to go through in round 2, I'm

  • going to go through each data point again and assign it to it's closest centroid.

  • And when I do that,observe that, point for example this point here

  • is now closer to the the pink cluster than it is to the blue cluster.

  • And so it moves from the blue cluster to the pink cluster.

  • Okay?

  • [SOUND] And so we have this new clustering and the end of round 2.

  • Now that, we have the new clustering, I'm going to recompute the centroids once

  • again, and when I recompute the centroids, the the centroids migrate further.

  • And because the centroids are migrated further,

  • I have to recompute the, the assignment of points to the centroids again.

  • And when I do that these two points here are then

  • are closer to the the pink cluster, so they move to the to the pink cluster and

  • and we end up with the clustering show here at the end of Round 3.

  • Okay?

  • Now, as it turns out, in this case, Round 3 is the last round.

  • a, at this point, the centroids and

  • the, points don't migrate any further, and we have a stable clustering.

  • So, what we have here is the, clustering, the k-means clustering for

  • k is equal to 2, for, for this dataset.

  • Observe that I started from a completely random assignment of you know, of,

  • of centroids and actually migrated the centroids to where they finally ended up.

  • Now, the important question here is, how to we pick the right value of k?

  • In the previous example, we arbitrarily picked k is equal to 2 and

  • that actually turned out to be the right value for that datasets because you

  • can see that there are roughly, two different clusters of data points here.

  • But in general, how do we know what the right value of k is up front?

  • So that we can pick the number of clusters up front.

  • So, since we don't know the right value of k,

  • the obvious answer is to try different values of k.

  • And see what looks good.

  • Now the question, now we have to decide.

  • What do we mean, by what looks good?

  • One obvious answer is to look at the average distance of points from

  • the centroid as k increases.

  • Right? And

  • let's see let's look at an example to see what we mean here.

  • So here's here's an example with a bunch of data points, and

  • they've actually picked k is equal to 2.

  • Because I pi, pick k equal 2, we have two clusters.

  • And you, you can see that, that's a centroid off of the first cluster.

  • And you can see, there's a lot of points that are very far away

  • from the centroid in this case.

  • Right?

  • Now, if, if k is too small, as in this case, then there are going to be lots of

  • points that are a large distance away from the centroid.

  • So, the average distance from the distance is going to be fairly large.

  • Now, I made k equal 3.

  • and, as it turns out, this is the right value for

  • this dataset, and you can see that the, distances to the,

  • centroid shrink quite a lot, when I went from, k equal 2 to k equal 3.

  • That splits the, top cluster into two, and that shrinks the average value, or the,

  • or the average distance of the centroid significantly.

  • Right?

  • And the and but if I increase k, further.

  • Let's say, I make k equal 4 and that splits that cluster, further.

  • Now, that does shrink the average sums of the centroid, but

  • not as much as when we went from k equal 2, to k equal 3.

  • Right? So, when we make k, too large,.

  • It does decrease the average distance of the centroid.

  • But not, by not as much.

  • Now, we can see this very clearly if I plot a graph.

  • Right, if I plot a graph, where the x axis is k, the number of clusters.

  • And the y axis is the average distance to the centroid.

  • You can see that as k increases.

  • The average distance to the centroid keeps falling.

  • But, at some point, it's, it, you know, the, there's a knee of the curve and

  • the average distance to the centroid falls very slowly.

  • The obvious thing to do is to pick a value of k,

  • that's close to the knee of the curve, where you get a more,

  • you know, a fairly low distance to the average distance to the centroid.

  • Without too high, a value of k.

  • So, in this case, the best value of k, is the one that's, that's shown in this,

  • shown in the picture.

  • The final question we need to address, with k-means clustering is the picking of

  • the k initial point to initialize the k clusters.

  • In the examples, that we've seen so far, we've picked the.

  • The initial k points entire completely at random.

  • Now that model not a very bad example, but in general it might not work for

  • quite so bad.

  • If you make for example pick k point that all happen to be in the same cluster.

  • In which case the final clustering, won't reflect the actual clustering of the data.

  • Right, or we might pick points that are outliers that are not near,

  • near any of the near clusters.

  • So, the the final clustering depends on the initial k point that they picked.

  • And so it's important that they pick the right

  • k points to start the clustering from.

  • The first approach, to picking the initial k point is Sampling.

  • So remember that the data set is very large and so we

  • can't actually run a complicated algorithm like hierarchical clustering on it.

  • But what we can do is to sample the data and take a smaller sample of the data.

  • And then using the sample of the data, we can then another algorithm like

  • Hierarchical Agglomerative Clustering, in which we covered in the last lecture.

  • And we can run that algorithm until we obtain k clusters.

  • And then we can pick a point from each of the k clusters.

  • For example, we could pick the point that's closest to the centroid for

  • each cluster.

  • And we could call those, our initial k k centroids.

  • Another approach is to not resort to another clustering algorithm, but

  • to just pick a dispersed set of points.

  • Points that are as far away from each other.

  • In the dataset, as possible.

  • So, one approach to this, is to first pick a, the first point entirely at random.

  • Now, the second point, we pick to

  • be the point that's as far away from the first point as possible, in the dataset.

  • The third point, we pick to be the point that is as far away from both

  • point one and point two as possible.

  • In general, we pick you know,

  • once we've picked the first few points, we pick the next point to be the one

  • who's minimum distance from the already selected points, is as large as possible.

  • And then, we repeat this, until we pick k points.

  • So, this ensures that the k point, that's we've picked, are as far apart or

  • as disperse as possible, you, you know, in the, in the data set.

  • And that gives us a better chance of finding the right clustering.

  • Let's consider the Complexity of k-means clustering.

  • Now in each round, we have to examine each input point exactly,

  • once to find the closest centroid to that point and assign it to that centroid.

  • Now, since there are endpoints and there are k centroids, right?

  • We have to compute the distance of each point from each centroid, so

  • the algorithm is order k times N for endpoints and k clusters.

  • Now each round is it's, it's it's it's a Complexity k times N.

  • Now this is actually, not bad because you know when N is really large and

  • k is fairly small, we have a linear algorithm in N.

  • Each round is actually, linear in N.

  • But the real problem is that, the number of rounds to convergence can be really,

  • really large.

  • There is actually, no theoretical limit on the number of rounds to for

  • the algorithms to convert to converge.

  • And the number of rounds can be really, really large.

  • So, the algorithm could spend actually,

  • a really long time getting to getting to convergence.

  • So the question is, if the dataset is really,

  • really large and we don't want to go through this large number of rounds?

  • Can we actually, do something like k-means clustering in a single pass over the data?

  • Because we have a really large dataset,

  • and we don't want to scan it multiple times.

  • We're going to answer this question in the next segment.

Hello, everyone.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

6 3 k均值算法 12 49. (6 3 The k Means Algorithm 12 49)

  • 14 2
    HaoLang Chen 發佈於 2021 年 01 月 14 日
影片單字