Placeholder Image

字幕列表 影片播放

  • Hi, I'm Adriene Hill, and welcome back to Crash Course Statistics.

  • In the last episode, we talked about using Machine Learning with data that already has

  • categories that we want to predict.

  • Like teaching a computer to tell whether an image contains a hotdog or not.

  • Or using health information to predict whether someone has diabetes.

  • But sometimes we don't have labels.

  • Sometimes we want to create labels that don't exist yet.

  • Like if we wanted to use test and homework grades to create 3 different groups of students

  • in your Stats course.

  • If you group similar students together, you can target each group with a specific review

  • session that addresses its unique needs.

  • Hopefully leading to better grades!

  • Because the groups don't already exist, we call this Unsupervised Machine Learning

  • since we can't give our models feedback on whether they're right or not.

  • There are noTruecategories to compare our groups with.

  • Putting data into groups that don't already exist might seem kinda weird but today we'll

  • explore two types of Clustering--the main type of Unsupervised Machine Learning: k-means

  • and Hierarchical clustering.

  • And we'll see how creating new groups can actually help us a lot.

  • INTRO

  • Let's say you own a pizza restaurant.

  • You've been collecting data on your customers' pizza eating habits.

  • Like how many pizzas a person orders a week.

  • And the average number of toppings they get on their pizzas.

  • You're rolling out a new coupon program and you want to create 3 groups of customers

  • and make custom coupons to target their needs.

  • Maybe 2-for-1 five-topping medium pizzas.

  • Or 20% off all plain cheese pizza.

  • Or free pineapple topping!

  • So let's use k-means to create 3 customer groups.

  • First, we plot our data:

  • All we know right now is that we want 3 separate groups.

  • So, what the k-means algorithm does is select 3 random points on your graph.

  • Usually these are data points from your set, but they don't have to be.

  • Then, we treat these random points as the centers of our 3 groups.

  • So we call themcentroids”.

  • We assign each data point (the points in black) to the group of the centroid that it's closest to.

  • This point here is closest to the Green center.

  • So we'll assign it to the green group.

  • Once we assign each point to the group it's closest to, we now have three groups, or clusters.

  • Now that each group has some members, we calculate the current centroid for each group.

  • And now that we have the new centroids we'll repeat this process of assigning every point

  • to the closest centroid and then recalculating the new centroids.

  • The computer will do this over and over again until the centroidsconverge”.

  • And here, converge means that the centroids and groups stop changing, even as you keep

  • repeating these steps .

  • Once it converges, you have your 3 groups, or clusters.

  • We can then look at the clusters and decide which coupons to send.

  • For example, this group doesn't order many pizzas each week but when they do, they order

  • a LOT of toppings.

  • So they might like theBuy 3 toppings get 2 freecoupon.

  • Whereas this group, who orders a lot of simple pizzas, might like the “20% off Medium-2

  • topping-Pizzascoupon.

  • (This is probably also the pineapple group since really, there aren't that many things

  • that pair well with pineapple and cheese.)

  • If you were a scientist, you might want to look at the differences in health outcomes

  • between the three pizza ordering groups.

  • Like whether the group that orders a lot of pizza has higher cholesterol.

  • You may even want to look at the data in 5 clusters instead of 3.

  • And k-means will help you do that.

  • It will even allow you to create 5 clusters of Crash Course Viewers based on how many

  • Raccoons they think they can fight off, and the number of Pieces of Pizza they claim to

  • eat a week.

  • This is actual survey data from you all.

  • A K-means clustering created these 5 groups.

  • We can see that this green group is PRETTY confident that they could fight off a lot

  • of raccoons.

  • But 100 raccoons?

  • No.

  • On the other hand, we also see the light blue group.

  • They have perhaps more reasonable expectations about their raccoon fighting abilities, they

  • also eat a lot of pizza each week.

  • Which makes me wondercould they get the pizza delivery folks to help out if we go

  • to war with the raccoons?

  • Unlike the Supervised Machine Learning we looked at last time, you can't calculate

  • theaccuracyof your results because there's no true groups or labels to compare.

  • However, we're not totally lost.

  • There's one method called the silhouette score can help us determine how well fit our

  • clusters are even without existing labels.

  • Roughly speaking, the silhouette score measures clustercohesion and separationwhich

  • is just a fancy way of saying that the data points in that cluster are close to each other,

  • but far away from points in other clusters.

  • Here's an example of clusters that have HIGH silhoutte scores.

  • And here's an example of clusters that have LOW silhouette scores.

  • In an ideal world, we prefer HIGH silhouette scores, because that means that there are

  • clear differences between the groups.

  • For example, if you clustered data from lollipops and Filet Mignon based on sugar, fat, and

  • protein content the two groups would be VERY far apart from each other, with very little

  • overlap--leading to high silhouette scores.

  • But if you clustered data from Filet Mignon and a New York Strip steak, the data would

  • probably have lower silhouette scores, because the two groups would be closer together - there'd

  • probably be more overlap.

  • Putting data into groups is useful, but sometimes, we want to know more about the structure of

  • our clusters.

  • Like whether there are subgroups--or subclusters.

  • Like in real life when we could look at two groups: people who eat meat and those who

  • don't.

  • The differences between the groups' health or beliefs might be interesting, but we also

  • know that people who eat meat could be broken up into even smaller groups like people who

  • do and don't eat red meat.

  • These subgroups can be pretty interesting too.

  • A different type of clustering called Hierarchical Clustering allows you to look at the hierarchical

  • structure of these groups and subgroups.

  • For example, look at these ADORABLE dogs.

  • We could use hierarchical clustering to cluster these dogs into groups.

  • First, each dog starts off as its own group.

  • Then, we start merging clusters together based on how similar they are.

  • For example, we'll put these two dogs together to form one cluster, and these two dogs together

  • to form another.

  • Each of these clusters--we could call this oneRetrieversand this oneTerriers”,

  • is made up of smaller clusters.

  • Now that we have 2 clusters, we can merge them together, so that all the dogs are in

  • one cluster.

  • Again, this cluster is made up of a bunch of sub clusters which are themselves made

  • up of even smaller sub clusters.

  • It's turtles I mean clusters all the way down.

  • This graph of how the clusters are related to each other is called a dendrogram.

  • The further up the dendrogram that two clusters join, the less similar they are.

  • Golden and Curly Coated Retrievers connect lower down than Golden Retrievers and Cairn

  • Terriers.

  • One compelling application of hierarchical clustering is to look for subgroups of people

  • with Autism Spectrum Disorder--or ASD.

  • Previously, disorders like Autism, Aspergers and Childhood Disintegrative Disorder (CDD)

  • were considered separate diagnoses, even though they share some common traits.

  • But, in the latest version of the Diagnostic and Statistical Manual of Mental Disorders--or

  • DSM - these disorders are now classified as a single disorder that has various levels

  • of severity, hence the Spectrum part of Autism Spectrum Disorder.

  • ASD now applies to a large range of traits.

  • Since ASD covers such a large range, it can be useful to

  • create clusters of similar people in order to better understand Autism and provide more

  • targeted and effective treatments.

  • Not everyone with an ASD diagnosis is going to benefit from the same kinds and intensities

  • of therapy.

  • A group at Chapman University set out to look more closely at groups of people with ASD.

  • They started with 16 profiles representing different groups of people with an ASD diagnosis.

  • Each profile has a score between 0 and 1 on 8 different developmental domain.

  • Low scores in one of these domains means it might need improvement.

  • Unlike our pizza example which had only 2 measurements--# of pizza toppings and # of

  • pizzas ordered per week--this time we have 8 measurements.

  • This can make it tough to visually represent the distance between clusters.

  • But the ideas are the same.

  • Just like two points can be close together in 1 or 2 dimensions, they can be close together

  • in 8 dimensions.

  • When the researchers looked at the 16 profiles, they grouped them together based on their

  • 8 developmental domain scores.

  • In this case, we take all 16 profiles and put each one in their owncluster”, so

  • we have 16 clusters, each with one profile in them.

  • Then, we start combining clusters that are close together.

  • And then we combine those , and we keep going until every profile is in one big cluster.

  • Here's the dendrogram.

  • We can see that there are 5 major clusters, each made up of smaller clusters.

  • The research team used radar graphs, which look like this, to display each cluster's

  • 8 domain scores on a circle.

  • Low scores are near the center, high scores near the edge of the circle.

  • This main cluster, which they called Cluster E, has scores consistent with someone who

  • is considered high functioning.

  • Before the change to the DSM, individuals in the cluster might have been diagnosed with

  • Asperger's.

  • The Radar graph here shows the scores for the 6 original data points that were put in

  • Cluster E. While there are some small differences, we can see that overall the patterns look

  • similar.

  • So Cluster E might benefit from a less intense therapy plan, while other Clusters with lower

  • scores--like Cluster D--may benefit from more intensive therapy.

  • Creating profiles of similar cases might allow care providers to create more effective, targeted

  • therapies that can more efficiently help people with an ASD diagnosis.

  • If an individual's insurance only covers say 7 hours of therapy a week, we want to

  • make sure it's as effective as possible.

  • It can also help researchers and therapists determine why some people respond well to

  • treatments, and others don't.

  • The type of hierarchical clustering that we've been doing so far is called Agglomerative,

  • or bottom-up clustering.

  • That's because all the data points start off as their own cluster, and are merged together

  • until there's only one.

  • Often, we don't have structured groups as a part of our data, but still want to create

  • profiles of people or data points that are similar.

  • Unsupervised Machine Learning can do that.

  • It allows us to use things that we've observed--like the tiny stature of Terriers, or raccoon-fighting

  • confidence --and create groups of dogs, or people that are similar to each other.

  • While we don't always want categorize people, putting them into groups can help give them

  • better deals on pizza, or better suggestions for books or even better medical interventions.

  • And for the record, I am always happy to help moderately confident raccoon fighting pizza

  • eaters fight raccoons.

  • Just call me. Thanks for watching. I'll see you next time.

Hi, I'm Adriene Hill, and welcome back to Crash Course Statistics.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

無監督的機器學習。Crash Course Statistics #37 (Unsupervised Machine Learning: Crash Course Statistics #37)

  • 1 1
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字