字幕列表 影片播放 列印英文字幕 In cluster analysis, that’s how a cluster would look in two-dimensional space. There are two dimensions, or two features based on which we are performing clustering. For instance, the ‘Age’ and ‘Money spent’. Certainly, it makes no sense to have only one cluster, so let me zoom out of this graph. Here’s a nice picture of clusters. We can clearly see two clusters. I’ll also indicate their centroids. If we want to identify three clusters, this is the result we obtain. And that’s more or less how clustering works graphically. Okay. How do we perform clustering in practice? There are different methods we can apply to identify clusters. The most popular one is k-means, so that’s where we will start. Let’s simplify this scatter to 15 points, so we can get a better grasp of what happens. Cool. Here’s how k-means works. First, we must choose how many clusters we’d like to have. That’s where this method gets its name from. K stands for the number of clusters we are trying to identify. I’ll start with two clusters. The next step is to specify the cluster seeds. A seed is basically a starting centroid. It is chosen at random or is specified by the data scientist based on prior knowledge about the data. One of the clusters will be the green cluster, the other one the orange cluster. And these are the seeds. The following step is to assign each point on the graph to a seed. Which is done based on proximity. For instance, this point is closer to the green seed than to the orange one. Therefore, it will belong to the green cluster. This point, on the other hand, is closer to the orange seed, therefore, it will be a part of the orange cluster. In this way, we can color all points on the graph, based on their Euclidean distance from the seeds. Great! The final step is to calculate the centroid of the green points and the orange points. The green seed will move closer to the green points to become their centroid and the orange will do the same for the orange points. From here, we would repeat the last two steps. Let’s recalculate the distances. All the green points are obviously closer to the green centroid, and the orange points are closer to the orange centroid. What about these two? Both of them are closer to the green centroid, so at this step we will reassign them to the green cluster. Finally, we must recalculate the centroids. That’s the new result. Now, all the green points are closest to the green centroid and all the orange ones to the orange. We can no longer reassign points, which completes the clustering process. This is the two-cluster solution. Alright. So that’s the whole idea behind clustering? In order to solidify your understanding, we will redo the process. In the beginning we said that with k-means clustering, we must specify the number of clusters prior to clustering, right? What if we want to obtain 3 clusters? The first step involves selecting the seeds. Let’s have another seed. We’ll use red for this one. Next, we must associate each of the points with the closest seed. Finally, we calculate the centroids of the colored points. We already know that k-means is an iterative process. So, we go back to the step where we associate each of the points with the closest seed. All orange points are settled, so no movement there. What about these two points? Now they are closer to the red seed, so they will go into the red cluster. That’s the only change in the whole graph. In the end, we recalculate the centroids and reach a situation where no more adjustments are necessary using the k-means algorithm. We have reached a three-cluster solution. This is the exact algorithm, which was used to find the solution of the problem you saw at the beginning of the lesson. Here’s a Python generated graph, with the three clusters colored. I am sorry they are not the same, but you get the point. That’s how we would usually represent the clusters graphically. Great! I think we have a good basis to start coding! We are going to cluster these countries using k-means in Python. Plus, we’ll learn a couple of nice tricks along the way. Cool. Let’s import the relevant libraries. They are pandas, NumPy, MatPlotLib, dot, PyPlot, and Seaborn. As usual, I will set the style of all graphs to the Seaborn one. In this course, we will rely on scikit-learn for the actual clustering. Let’s import k-means from SK learn, dot, cluster. Note that both the ‘K’ and the ‘M’ in k-means are capital. Next, we will create a variable called ‘data’, where we will load the CSV file: ‘3.01. Country clusters’. Let’s see what’s inside. We’ve got Country, Latitude, Longitude, and Language. Let’s see how we gathered that data. Country and language are clear. What about the latitude and longitude values? These entries correspond to the geographic centers of the countries in our dataset. That is one way to represent location. I’ll quickly give an example. If you Google: ‘geographic center of US’, you’ll get a Wikipedia article, indicating it to be some point in South Dakota with a latitude of 44 degrees and 58 minutes North, and a longitude of 103 degrees and 46 minutes West. Then we can convert them to ‘decimal degrees’ using some online converter like the one provided by latlong.net. It’s important to know that the convention is such that North and East are positive, while West and South are negative. Okay. So that’s what we did. We got the decimal degrees of the geographic centers of the countries in the sample. That’s not optimal as the choice of South Dakota was biased by Alaska and Hawaii, but you’ll see that won’t matter too much for the clustering. Right. Let’s quickly plot the data. If we want our data to resemble a map we must set the axes to reflect the natural domain of latitude and longitude. Done. If I put the actual map next to this one, you will quickly notice that this methodology, while simple, is not bad at all. Alright. Let’s do some clustering. As we did earlier, our inputs will be contained in a variable called ‘x’. We will start by clustering based on location. So, we want ‘X’ to contain the latitude and the longitude. I’ll use the pandas method ‘iloc’. We haven’t mentioned it before and you probably don’t know that, but ‘iloc’ is a method which slices a data frame. The first argument indicates the row indices we want to keep, while the second – the column indices. I want to keep all rows, so I’ll put ‘cоlumns’ as the first argument. Okay. Remember that pandas indices start from 0. From the columns, I need ‘Latitude’ and ‘Longitude’, or columns 1 and 2. So, the appropriate argument is: 1, columns, 3. This will slice the 1st and the 2nd columns out of the data frame. Let’s print x, to see the result. Exactly as we wanted it. Next, I’ll declare a variable called k-means. K-means is equal to capital ‘K’, capital ‘M’, and lowercase ‘eans’, brackets, 2. The right-side is actually the k-means method that we imported from sk-learn. The value in brackets is the number of clusters we want to produce. So, our variable ‘k-means’ is now an object which we will use for the clustering itself. Similar to what we’ve seen with regressions, the clustering itself happens using the ‘fit’ method. K-means, dot, fit, of x. That’s all we need to write. This line of code will apply k-means clustering with 2 clusters to the input data from X. The output indicates that the clustering has been completed with the following parameters. Usually though, we don’t need to just perform the clustering but are interested in the clusters themselves. We can obtain the predicted clusters for each observation using the ‘fit predict’ method. Let’s declare a new variable called: ‘identified clusters’, equal to: kmeans, dot, fit predict, with input, X. I’ll also print this variable. The result is an array containing the predicted clusters. There are two clusters indicated by 0 and 1. You can clearly see that the first five observations are in the same cluster, zero, while the last one is in cluster one. Okay. Let’s create a data frame so we can see things more clearly. I’ll call this data frame ‘data with clusters’ and it will be equal to ‘data’. Then I’ll add an additional column to it called ‘Cluster’, equal to ‘identified clusters’. As you can see, we have our table with the countries, latitude, longitude, language, but also cluster. It seems that the USA, Canada, France, UK, and Germany are in cluster: zero, while Australia is alone in cluster: one. Cool! Finally, let’s plot all this on a scatter plot. In order to resemble the map of the world, the y-axis will be the longitude, while the x-axis – latitude. But that’s the same graph as before, isn’t it? Let’s use the first trick. In matplotlib, we can set the color to be determined by a variable. In our case that will be ‘cluster’. Let’s write: ‘c’ equals ‘data with clusters’ of ‘cluster’. We have just indicated that we want to have as many colors for the points as there are clusters. The default color map is not so pretty, so I’ll set the color map to ‘rainbow’. ‘c map’ equals ‘rainbow’. Okay. We can see the two clusters – one is purple and the other is red. And that’s how we perform k-means clustering. What if we wanted to have three clusters? Well, we can go back to the line where we specified the desired number of clusters, and change that to 3. Let’s run all cells. There are three clusters as wanted – 0, 1, and 2. From the data frame we can see that USA and Canada are in the same cluster, France, UK and Germany in another, and Australia is alone once again. What about the visualization? There are three colors, representing the three different clusters. Great work! It seems that clustering is not that hard after all! Let’s continue the problem from the last lecture. As you can see we had one other piece of information that we did not use – language. In order to make use of it, we must first encode it in some way. The simplest way to do that is by using numbers. I’ll create a new variable called ‘data mapped’, equal to data, dot, copy. Next, I’ll map the languages using the usual method. Data mapped, Language, equals, data mapped, language, dot, map And I’ll set English to 0, French to 1, and German to 2. Note that this is not the optimal way to encode them, but it will work for now. Cool. Next, let’s choose the features that we want to use for clustering. Did you know that we can use a single feature? Well, we certainly can. Let x be equal to: ‘data mapped’, dot, ‘ i loc’, colons, comma, 3, colons, 4. I am basically slicing all rows, but only the last colon. What we are left with is this. Now we can perform clustering. I have the same code ready, so I’ll just use it. We are running kmeans clustering with 3 clusters. Run, Run, Run, Run, and we are done. The plot is unequivocal – the three clusters are: USA, Canada, UK and Australia in the first one, France in the second, and Germany in the third. That’s precisely what we expected, right? English, French, and German. Great! By the way, we are still using the longitude and latitude as axes of the plot. Unlike regression, when doing clustering, you can plot the data as you wish. The cluster information is contained in the ‘cluster’ column in the data frame and is the color of the points on the plot. Can we use both numerical and categorical data in clustering? Sure! Let’s go back to our input data, X, and take the last 3 Series instead of just one. Run, Run, Run, Run. Okay. This time the three clusters turned out to be based simply on geographical location, instead of language and location. We’ve been juggling with the number of clusters for too long. Isn’t there a criterion for setting the proper number of clusters? Luckily for us, there is! Probably the most widely adopted criterion is the so-called ‘Elbow method’. What’s the rationale behind it? Well, remember that clustering was about minimizing the distance between points in a cluster, and maximizing the distance between clusters? Well, it turns out that for k-means, these two occur simultaneously. If we minimize the distance between points in a cluster, we are automatically maximizing the distance between clusters. One less thing to worry about. Now, the ‘distance between points in a cluster’ sounds clumsy, doesn’t it? That distance is measured in sum of squares and the academic term is: ‘within-cluster sum of squares’, or WCSS. Not much better, but at least the abbreviation is nice. Okay. Similar to SST, SSR, and SSE from regressions, WCSS is a measure developed within the ANOVA framework. If we minimize WCSS, we have reached the perfect clustering solution. Here’s the problem. If we have the same 6 countries, and each one of them is a different cluster, so a total of 6 clusters, then WCSS is 0. That’s because there is just one point in each cluster, or NO within-cluster sum of squares. Furthermore, the clusters are as far as they can possibly be. Imagine this with 1 million observations. A 1 million-cluster solution is definitely of no use. Similarly, if all observations are in the same cluster, the solution is useless and WCSS is at its maximum. There must be some middle ground… Applying some common sense, we easily reach the conclusion, that we don’t really want WCSS to be minimized. Instead, we want it to be as low as possible, while we can still have a small number of clusters, so we can interpret them. Alright. If we plot WCSS against the number of clusters we get this pretty graph. It looks like an elbow, hence the name. The point is that the within-cluster sum of squares is a monotonously decreasing function, which is lower for a bigger number of clusters. Here’s the big revelation. In the beginning WCSS is declining extremely fast. At some point, it reaches ‘the elbow’. Afterwards, we are not reaching a much better solution in terms of WCSS, by increasing the number of clusters. For our case, we say that the optimal number of clusters is 3, as this is the elbow. That’s the biggest number of clusters for which we are still getting a significant decrease in WCSS. Hereafter, there is almost no improvement. Cool. How can we put that to use? We need two pieces of information – the number of clusters, K, and the WCSS for a specific number of clusters. K is set by us at the beginning of the process, while there is an sklearn method that gives us the WCSS. For instance, to get the WCSS for our last example, we just write: Kmeans, dot, inertia, underscore. To plot the elbow, we actually need to solve the problem with 1,2,3, and so on, clusters and calculate WCSS for each of them. Let’s do that with a loop. First, I’ll declare an empty list called WCSS. For i in range 1 to 7, as we have a total of 6 observations, colons. Kmeans equals: kmeans, with capital K and M, of i Next, I want to fit the input data X, using k means, so: kmeans, dot, fit, X Then, we will calculate the WCSS for the iteration using the inertia method. Let WCSS underscore ‘iter’ be equal to kmeans, dot, inertia. Finally, we will add the WCSS for the iteration to the WCSS list. A handy method to do that is append. If you are not familiar with it, just pick the list, dot, append and in brackets you can include the value you’d like to append to the list. So, WCSS, dot, append, brackets, WCSS iter. Cool, let’s run the code. WCSS should be a list which contains the within-cluster sum of squares for 1 cluster, 2 clusters, and so on until 6. As you can see, the sequence is decreasing with very big leaps in the first 2 steps, and much smaller ones later on. Finally, when each point is a separate cluster, we have a WCSS equal to 0. Let’s plot that. We have WCSS, so let’s declare a variable called ‘number clusters’ which is also a list from 1 to 6. ‘number clusters’ equals range, 1, 7. Cool. Then, using some conventional plotting code, we get the graph. Finally, we will use the ‘Elbow method’ to decide the optimal number of clusters. There are two points which can be the elbow. This one and that one. A 3-cluster solution is definitely the better one, as after it there is not much to gain. A 2-cluster solution in this case, would be suboptimal as the leap from 2 to 3 is very big in terms of WCSS. It wouldn’t be data science if there wasn’t this very important topic – problems with, issues with, or limitations of X. Well, let’s look at the pros and cons of k-means clustering. The pros are already known to you even if you don’t realize it. It is simple to understand and fast to cluster. Moreover, there are many packages that offer it, so implementation is effortless. Finally, it always yields a result. No matter the data, it will always spit out a solution, which is great! Time for the cons. We will dig a bit into them as they are very interesting to explore. Moreover, this lecture will solidify your understanding like no other. The first con is that we need to pick K. As we already saw, the Elbow method fixes that but it is not extremely scientific, per se. Second, k-means is sensitive to initialization. That’s a very interesting problem. Say that these are our points. If we ‘randomly’ choose the centroids here and here, the obvious solution is one top cluster and one bottom cluster. However, clustering the points on the left in one cluster and those on the right in another is a more appropriate solution. Now imagine the same situation but with much more widely spread points. Guess what? Given the same initial seeds, we get the same clusters, because that’s how k-means works. It takes the closest points to the seeds. So, if your initial seeds are problematic, the whole solution is meaningless. The remedy is simple. It is called kmeans plus, plus. The idea is that a preliminary iterative algorithm is ran prior to kmeans, to determine the most appropriate seeds for the clustering itself. If we go back to our code, we will see that sklearn employs kmeans plus, plus by default. So, we are safe here, but if you are using a different package, remember that initialization matters. A third major problem is that k-means is sensitive to outliers. What does this mean? Well, if there is a single point that is too far away from the rest, it will always be placed in its own one-point cluster. Have we already experience that? Well, of course we have. Australia was the sole cluster in almost all the solutions we had for our country clusters example. It is so far away from the others that it is destined to be its own cluster. The remedy? Just get rid of outliers prior to clustering. Alternatively, if you do the clustering and spot one-point clusters, remove them and cluster again. A Fourth con. Kmeans produces spherical solutions. This means that on a 2D plane that we have seen, we would more often see clusters that look like circles, rather than elliptic shapes. The reason for that is that we are using Euclidean distance from the centroid. This is also why outliers are such a big issue for k-means. Finally, we have standardization.
B1 中級 K Means聚類。K均值聚類的優缺點: (K Means Clustering: Pros and Cons of K Means Clustering) 3 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字