Placeholder Image

字幕列表 影片播放

  • In cluster analysis, that’s how a cluster would look in two-dimensional space.

  • There are two dimensions, or two features based on which we are performing clustering.

  • For instance, theAgeandMoney spent’.

  • Certainly, it makes no sense to have only one cluster, so let me zoom out of this graph.

  • Here’s a nice picture of clusters.

  • We can clearly see two clusters.

  • I’ll also indicate their centroids.

  • If we want to identify three clusters, this is the result we obtain.

  • And that’s more or less how clustering works graphically.

  • Okay.

  • How do we perform clustering in practice?

  • There are different methods we can apply to identify clusters.

  • The most popular one is k-means, so that’s where we will start.

  • Let’s simplify this scatter to 15 points, so we can get a better grasp of what happens.

  • Cool.

  • Here’s how k-means works.

  • First, we must choose how many clusters we’d like to have.

  • That’s where this method gets its name from.

  • K stands for the number of clusters we are trying to identify.

  • I’ll start with two clusters.

  • The next step is to specify the cluster seeds.

  • A seed is basically a starting centroid.

  • It is chosen at random or is specified by the data scientist based on prior knowledge

  • about the data.

  • One of the clusters will be the green cluster, the other one the orange cluster.

  • And these are the seeds.

  • The following step is to assign each point on the graph to a seed.

  • Which is done based on proximity.

  • For instance, this point is closer to the green seed than to the orange one.

  • Therefore, it will belong to the green cluster.

  • This point, on the other hand, is closer to the orange seed, therefore, it will be a part

  • of the orange cluster.

  • In this way, we can color all points on the graph, based on their Euclidean distance from

  • the seeds.

  • Great!

  • The final step is to calculate the centroid of the green points and the orange points.

  • The green seed will move closer to the green points to become their centroid and the orange

  • will do the same for the orange points.

  • From here, we would repeat the last two steps.

  • Let’s recalculate the distances.

  • All the green points are obviously closer to the green centroid, and the orange points

  • are closer to the orange centroid.

  • What about these two?

  • Both of them are closer to the green centroid, so at this step we will reassign them to the

  • green cluster.

  • Finally, we must recalculate the centroids.

  • That’s the new result.

  • Now, all the green points are closest to the green centroid and all the orange ones to

  • the orange.

  • We can no longer reassign points, which completes the clustering process.

  • This is the two-cluster solution.

  • Alright.

  • So that’s the whole idea behind clustering?

  • In order to solidify your understanding, we will redo the process.

  • In the beginning we said that with k-means clustering, we must specify the number of

  • clusters prior to clustering, right?

  • What if we want to obtain 3 clusters?

  • The first step involves selecting the seeds.

  • Let’s have another seed.

  • Well use red for this one.

  • Next, we must associate each of the points with the closest seed.

  • Finally, we calculate the centroids of the colored points.

  • We already know that k-means is an iterative process.

  • So, we go back to the step where we associate each of the points with the closest seed.

  • All orange points are settled, so no movement there.

  • What about these two points?

  • Now they are closer to the red seed, so they will go into the red cluster.

  • That’s the only change in the whole graph.

  • In the end, we recalculate the centroids and reach a situation where no more adjustments

  • are necessary using the k-means algorithm.

  • We have reached a three-cluster solution.

  • This is the exact algorithm, which was used to find the solution of the problem you saw

  • at the beginning of the lesson.

  • Here’s a Python generated graph, with the three clusters colored.

  • I am sorry they are not the same, but you get the point.

  • That’s how we would usually represent the clusters graphically.

  • Great!

  • I think we have a good basis to start coding!

  • We are going to cluster these countries using k-means in Python.

  • Plus, well learn a couple of nice tricks along the way.

  • Cool.

  • Let’s import the relevant libraries.

  • They are pandas, NumPy, MatPlotLib, dot, PyPlot, and Seaborn.

  • As usual, I will set the style of all graphs to the Seaborn one.

  • In this course, we will rely on scikit-learn for the actual clustering.

  • Let’s import k-means from SK learn, dot, cluster.

  • Note that both the ‘K’ and the ‘M’ in k-means are capital.

  • Next, we will create a variable calleddata’, where we will load the CSV file: ‘3.01.

  • Country clusters’.

  • Let’s see what’s inside.

  • Weve got Country, Latitude, Longitude, and Language.

  • Let’s see how we gathered that data.

  • Country and language are clear.

  • What about the latitude and longitude values?

  • These entries correspond to the geographic centers of the countries in our dataset.

  • That is one way to represent location.

  • I’ll quickly give an example.

  • If you Google: ‘geographic center of US’, youll get a Wikipedia article, indicating

  • it to be some point in South Dakota with a latitude of 44 degrees and 58 minutes North,

  • and a longitude of 103 degrees and 46 minutes West.

  • Then we can convert them todecimal degreesusing some online converter like the one provided

  • by latlong.net.

  • It’s important to know that the convention is such that North and East are positive,

  • while West and South are negative.

  • Okay.

  • So that’s what we did.

  • We got the decimal degrees of the geographic centers of the countries in the sample.

  • That’s not optimal as the choice of South Dakota was biased by Alaska and Hawaii, but

  • youll see that won’t matter too much for the clustering.

  • Right.

  • Let’s quickly plot the data.

  • If we want our data to resemble a map we must set the axes to reflect the natural domain

  • of latitude and longitude.

  • Done.

  • If I put the actual map next to this one, you will quickly notice that this methodology,

  • while simple, is not bad at all.

  • Alright.

  • Let’s do some clustering.

  • As we did earlier, our inputs will be contained in a variable called ‘x’.

  • We will start by clustering based on location.

  • So, we want ‘X’ to contain the latitude and the longitude.

  • I’ll use the pandas methodiloc’.

  • We haven’t mentioned it before and you probably don’t know that, butilocis a method

  • which slices a data frame.

  • The first argument indicates the row indices we want to keep, while the secondthe

  • column indices.

  • I want to keep all rows, so I’ll put ‘cоlumnsas the first argument.

  • Okay.

  • Remember that pandas indices start from 0.

  • From the columns, I needLatitudeandLongitude’, or columns 1 and 2.

  • So, the appropriate argument is: 1, columns, 3.

  • This will slice the 1st and the 2nd columns out of the data frame.

  • Let’s print x, to see the result.

  • Exactly as we wanted it.

  • Next, I’ll declare a variable called k-means.

  • K-means is equal to capital ‘K’, capital ‘M’, and lowercaseeans’, brackets,

  • 2.

  • The right-side is actually the k-means method that we imported from sk-learn.

  • The value in brackets is the number of clusters we want to produce.

  • So, our variable ‘k-meansis now an object which we will use for the clustering itself.

  • Similar to what weve seen with regressions, the clustering itself happens using thefit

  • method.

  • K-means, dot, fit, of x.

  • That’s all we need to write.

  • This line of code will apply k-means clustering with 2 clusters to the input data from X.

  • The output indicates that the clustering has been completed with the following parameters.

  • Usually though, we don’t need to just perform the clustering but are interested in the clusters

  • themselves.

  • We can obtain the predicted clusters for each observation using thefit predictmethod.

  • Let’s declare a new variable called: ‘identified clusters’, equal to: kmeans, dot, fit predict,

  • with input, X. I’ll also print this variable.

  • The result is an array containing the predicted clusters.

  • There are two clusters indicated by 0 and 1.

  • You can clearly see that the first five observations are in the same cluster, zero, while the last

  • one is in cluster one.

  • Okay.

  • Let’s create a data frame so we can see things more clearly.

  • I’ll call this data framedata with clustersand it will be equal todata’.

  • Then I’ll add an additional column to it calledCluster’, equal toidentified

  • clusters’.

  • As you can see, we have our table with the countries, latitude, longitude, language,

  • but also cluster.

  • It seems that the USA, Canada, France, UK, and Germany are in cluster: zero, while Australia

  • is alone in cluster: one.

  • Cool!

  • Finally, let’s plot all this on a scatter plot.

  • In order to resemble the map of the world, the y-axis will be the longitude, while the

  • x-axislatitude.

  • But that’s the same graph as before, isn’t it?

  • Let’s use the first trick.

  • In matplotlib, we can set the color to be determined by a variable.

  • In our case that will becluster’.

  • Let’s write: ‘c’ equalsdata with clustersofcluster’.

  • We have just indicated that we want to have as many colors for the points as there are

  • clusters.

  • The default color map is not so pretty, so I’ll set the color map torainbow’.

  • ‘c mapequalsrainbow’.

  • Okay.

  • We can see the two clustersone is purple and the other is red.

  • And that’s how we perform k-means clustering.

  • What if we wanted to have three clusters?

  • Well, we can go back to the line where we specified the desired number of clusters,

  • and change that to 3.

  • Let’s run all cells.

  • There are three clusters as wanted – 0, 1, and 2.

  • From the data frame we can see that USA and Canada are in the same cluster, France, UK

  • and Germany in another, and Australia is alone once again.

  • What about the visualization?

  • There are three colors, representing the three different clusters.

  • Great work!

  • It seems that clustering is not that hard after all!

  • Let’s continue the problem from the last lecture.

  • As you can see we had one other piece of information that we did not uselanguage.

  • In order to make use of it, we must first encode it in some way.

  • The simplest way to do that is by using numbers.

  • I’ll create a new variable calleddata mapped’, equal to data, dot, copy.

  • Next, I’ll map the languages using the usual method.

  • Data mapped, Language, equals, data mapped, language, dot, map

  • And I’ll set English to 0, French to 1, and German to 2.

  • Note that this is not the optimal way to encode them, but it will work for now.

  • Cool.

  • Next, let’s choose the features that we want to use for clustering.

  • Did you know that we can use a single feature?

  • Well, we certainly can.

  • Let x be equal to: ‘data mapped’, dot, ‘ i loc’, colons, comma, 3, colons, 4.

  • I am basically slicing all rows, but only the last colon.

  • What we are left with is this.

  • Now we can perform clustering.

  • I have the same code ready, so I’ll just use it.

  • We are running kmeans clustering with 3 clusters.

  • Run, Run, Run, Run, and we are done.

  • The plot is unequivocalthe three clusters are: USA, Canada, UK and Australia in the

  • first one, France in the second, and Germany in the third.

  • That’s precisely what we expected, right?

  • English, French, and German.

  • Great!

  • By the way, we are still using the longitude and latitude as axes of the plot.

  • Unlike regression, when doing clustering, you can plot the data as you wish.

  • The cluster information is contained in theclustercolumn in the data frame and

  • is the color of the points on the plot.

  • Can we use both numerical and categorical data in clustering?

  • Sure!

  • Let’s go back to our input data, X, and take the last 3 Series instead of just one.

  • Run, Run, Run, Run.

  • Okay.

  • This time the three clusters turned out to be based simply on geographical location,

  • instead of language and location.

  • Weve been juggling with the number of clusters for too long.

  • Isn’t there a criterion for setting the proper number of clusters?

  • Luckily for us, there is!

  • Probably the most widely adopted criterion is the so-calledElbow method’.

  • What’s the rationale behind it?

  • Well, remember that clustering was about minimizing the distance between points in a cluster,

  • and maximizing the distance between clusters?

  • Well, it turns out that for k-means, these two occur simultaneously.

  • If we minimize the distance between points in a cluster, we are automatically maximizing

  • the distance between clusters.

  • One less thing to worry about.

  • Now, thedistance between points in a clustersounds clumsy, doesn’t it?

  • That distance is measured in sum of squares and the academic term is: ‘within-cluster

  • sum of squares’, or WCSS.

  • Not much better, but at least the abbreviation is nice.

  • Okay.

  • Similar to SST, SSR, and SSE from regressions, WCSS is a measure developed within the ANOVA

  • framework.

  • If we minimize WCSS, we have reached the perfect clustering solution.

  • Here’s the problem.

  • If we have the same 6 countries, and each one of them is a different cluster, so a total

  • of 6 clusters, then WCSS is 0.

  • That’s because there is just one point in each cluster, or NO within-cluster sum of

  • squares.

  • Furthermore, the clusters are as far as they can possibly be.

  • Imagine this with 1 million observations.

  • A 1 million-cluster solution is definitely of no use.

  • Similarly, if all observations are in the same cluster, the solution is useless and

  • WCSS is at its maximum.

  • There must be some middle ground

  • Applying some common sense, we easily reach the conclusion, that we don’t really want

  • WCSS to be minimized.

  • Instead, we want it to be as low as possible, while we can still have a small number of

  • clusters, so we can interpret them.

  • Alright.

  • If we plot WCSS against the number of clusters we get this pretty graph.

  • It looks like an elbow, hence the name.

  • The point is that the within-cluster sum of squares is a monotonously decreasing function,

  • which is lower for a bigger number of clusters.

  • Here’s the big revelation.

  • In the beginning WCSS is declining extremely fast.

  • At some point, it reachesthe elbow’.

  • Afterwards, we are not reaching a much better solution in terms of WCSS, by increasing the

  • number of clusters.

  • For our case, we say that the optimal number of clusters is 3, as this is the elbow.

  • That’s the biggest number of clusters for which we are still getting a significant decrease

  • in WCSS.

  • Hereafter, there is almost no improvement.

  • Cool.

  • How can we put that to use?

  • We need two pieces of informationthe number of clusters, K, and the WCSS for a

  • specific number of clusters.

  • K is set by us at the beginning of the process, while there is an sklearn method that gives

  • us the WCSS.

  • For instance, to get the WCSS for our last example, we just write:

  • Kmeans, dot, inertia, underscore.

  • To plot the elbow, we actually need to solve the problem with 1,2,3, and so on, clusters

  • and calculate WCSS for each of them.

  • Let’s do that with a loop.

  • First, I’ll declare an empty list called WCSS.

  • For i in range 1 to 7, as we have a total of 6 observations, colons.

  • Kmeans equals: kmeans, with capital K and M, of i

  • Next, I want to fit the input data X, using k means, so: kmeans, dot, fit, X

  • Then, we will calculate the WCSS for the iteration using the inertia method.

  • Let WCSS underscoreiterbe equal to kmeans, dot, inertia.

  • Finally, we will add the WCSS for the iteration to the WCSS list.

  • A handy method to do that is append.

  • If you are not familiar with it, just pick the list, dot, append and in brackets you

  • can include the value you’d like to append to the list.

  • So, WCSS, dot, append, brackets, WCSS iter.

  • Cool, let’s run the code.

  • WCSS should be a list which contains the within-cluster sum of squares for 1 cluster, 2 clusters,

  • and so on until 6.

  • As you can see, the sequence is decreasing with very big leaps in the first 2 steps,

  • and much smaller ones later on.

  • Finally, when each point is a separate cluster, we have a WCSS equal to 0.

  • Let’s plot that.

  • We have WCSS, so let’s declare a variable callednumber clusterswhich is also

  • a list from 1 to 6.

  • number clustersequals range, 1, 7.

  • Cool.

  • Then, using some conventional plotting code, we get the graph.

  • Finally, we will use theElbow methodto decide the optimal number of clusters.

  • There are two points which can be the elbow.

  • This one and that one.

  • A 3-cluster solution is definitely the better one, as after it there is not much to gain.

  • A 2-cluster solution in this case, would be suboptimal as the leap from 2 to 3 is very

  • big in terms of WCSS.

  • It wouldn’t be data science if there wasn’t this very important topicproblems with,

  • issues with, or limitations of X.

  • Well, let’s look at the pros and cons of k-means clustering.

  • The pros are already known to you even if you don’t realize it.

  • It is simple to understand and fast to cluster.

  • Moreover, there are many packages that offer it, so implementation is effortless.

  • Finally, it always yields a result.

  • No matter the data, it will always spit out a solution, which is great!

  • Time for the cons.

  • We will dig a bit into them as they are very interesting to explore.

  • Moreover, this lecture will solidify your understanding like no other.

  • The first con is that we need to pick K. As we already saw, the Elbow method fixes that

  • but it is not extremely scientific, per se.

  • Second, k-means is sensitive to initialization.

  • That’s a very interesting problem.

  • Say that these are our points.

  • If werandomlychoose the centroids here and here, the obvious solution is one

  • top cluster and one bottom cluster.

  • However, clustering the points on the left in one cluster and those on the right in another

  • is a more appropriate solution.

  • Now imagine the same situation but with much more widely spread points.

  • Guess what?

  • Given the same initial seeds, we get the same clusters, because that’s how k-means works.

  • It takes the closest points to the seeds.

  • So, if your initial seeds are problematic, the whole solution is meaningless.

  • The remedy is simple.

  • It is called kmeans plus, plus.

  • The idea is that a preliminary iterative algorithm is ran prior to kmeans, to determine the most

  • appropriate seeds for the clustering itself.

  • If we go back to our code, we will see that sklearn employs kmeans plus, plus by default.

  • So, we are safe here, but if you are using a different package, remember that initialization

  • matters.

  • A third major problem is that k-means is sensitive to outliers.

  • What does this mean?

  • Well, if there is a single point that is too far away from the rest, it will always be

  • placed in its own one-point cluster.

  • Have we already experience that?

  • Well, of course we have.

  • Australia was the sole cluster in almost all the solutions we had for our country clusters

  • example.

  • It is so far away from the others that it is destined to be its own cluster.

  • The remedy?

  • Just get rid of outliers prior to clustering.

  • Alternatively, if you do the clustering and spot one-point clusters, remove them and cluster

  • again.

  • A Fourth con.

  • Kmeans produces spherical solutions.

  • This means that on a 2D plane that we have seen, we would more often see clusters that

  • look like circles, rather than elliptic shapes.

  • The reason for that is that we are using Euclidean distance from the centroid.

  • This is also why outliers are such a big issue for k-means.

  • Finally, we have standardization.

In cluster analysis, that’s how a cluster would look in two-dimensional space.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

K Means聚類。K均值聚類的優缺點: (K Means Clustering: Pros and Cons of K Means Clustering)

  • 3 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字