字幕列表 影片播放 列印英文字幕 It's becoming increasingly common to start using machine learning or AI driven techniques to make decisions The world over so for example, you know credit checks health checks, and these can be life-changing right, so it's really important we get this right you could find yourself turned down through a mortgage on your dream house because quite literally The computer says no Let's talk a little bit about classification. So now we have a data set where we've got labels All right, so we've got some input features or input Attributes or dimensions lots of instances and we've got some labels for these attributes All right, and so we've got for example books and the type of book or music and the genre with the music Things that we want to start to try and classify So supervised learning is the idea that we've got labels for our data. So we're still gonna have instances We're gonna have attributes or dimensions to our instances. But we've also now got labels for our data and so Classification is the process of learning how to correctly assign these labels to these instances before we start talking about classifiers Let's talk a little bit about the learning process and machine learning process we want to use it's not enough to say I've got my data set and I can correctly predict all of the classes right because Then someone will ask well what happens if we have any new data that we haven't seen before right? Maybe you've got some medical data and you can correct me Diagnose all of the diseases but a new patient comes along and you could incorrectly diagnose the disease, right? That's not helped anyone What we need is a regimented way of training and testing these approaches so that we know how well they apply in the real world So what we're going to do is we've got some data set just like before Where we've got some instances and we've got some attributes this way and so, you know We might have a lot of attributes a few it doesn't really matter and we also now have our labels which we often call Y right but this is going to be a vector of all of the Labels for data, so this could be label one-one B's could be a few twos down here And this could be a few three So this is a bit like our tennis example where we had this is the weather outlook and are we going to play? Tennis today, right? Yes, or no so that you could have multiple labels or just two for binary classification It's not enough just to train a classifier over all this data We want to make sure that this classifier will work properly when we apply a new data to it So what we're going to do is we're going to separate this data into training sets And testing sets so we're going to train on the training set Then we're going to test as we go on the validation set and then right at the end when we're finished we're going to do a final test on our test set The reason we do this is it's a very safe way to make sure that we don't accidentally gain the system We don't accidentally report incredibly good results on the training set but that's because we all just show the Machine those things so we hold out the validation of a test set for later to make Sure that it will generalize now exactly how much of your data goes in the training validation and testing set is really up to you right typically You might use something like 70% for training 15% for validation of 15% for testing that will be quite a reasonable way of doing it So what are some good classifiers we could use given that we've done this right? Let's imagine. We've got our instances We've got our attributes and we split them up probably randomly into training validation and testing What we want to do is train our classifier on the training set and then test it on the validation and testing sets to see How we're getting on so what algorithms could we use? Let's start with a simplest. One of all zero are in zero Are we just take the most common label and that's what we predict every time. It's V You've got five minutes until the deadline just hand something in Approach to machine learning in the case of playing tennis or not playing tennis we could say well I play tennis more than I didn't So we'll just assume that I'm going to play tennis and predict. Yes all the time All right, regardless of what the weather is this is not a good way to perform machine learning But I suppose it does give you a baseline accuracy, right? If you're baseline of just yet saying yes to everything is sixty percent accuracy Then if your machine learning doesn't perform at least a 60 percent, we know we've got a real problem We can go one better than that We can use one R one R is where we pick one of our attributes We made classification based only on that and then we pick the best of those attributes I mean, it's slightly better than 0 R but not a lot So you'll find you will find references to bees in military too a little bit but not very much Because we use much more powerful approaches to this. So let's talk about one example classifier is very popular and that's KNN or k nearest neighbor let's imagine. We've got a to Attribute data set. So I like to draw in two dimensions. It's just a little easier for me And so we've got attribute one and attribute two, and we've got some different data points in here now Don't forget also that each of these is going to have a prediction as well so if this one Is going to have let's say a label if we did play tennis when we want to test a new data point an unseen data So a new person comes along who may or may not play tennis. They're going to appear over here We measure them and we find the K number of nearest neighbors to this point So that's this one this one this one this one and this one so this will be 1 2 3 4 5 6 this would be K of 6 and then we take the majority vote or the Average of these responses so if four out of six of these people play tennis, this would be assigned to play tennis So the output is what in the existing data set. Have we already seen nearby? And can we use that to make a prediction? So this is quite a good approach obviously choosing K is a little bit difficult to do Right and this starts to get very very slow when you've got hundreds and hundreds of dimensions finding for K nearest points to a point When you've got tens of thousands of dimensions or tens of thousands of instances, it's not easy to do even with good data structures Why it starts to get slow quite quickly nevertheless. This is an effective and popular approach Are there any alternatives there is one decision trees. All right, now I like decision trees They have a nice benefit that once we created a decision tree Which is just a series of decisions on is the data this yes, is it this? No, once we've done all that we can actually look at the rules and say ok. That's how a decision was made And that's quite a good rule set. So kind of a way of lighting a sort of if-else Programming language, but you're doing it automatically let's draw out another data set So we've got our instances down here and we've got our attributes here and remember for each of our instances We're going to have some label that we're trying to output All right So here well You know 1 2 3 4 5 6 and so on So let's imagine but this is a credit score by a credit check So you've got actually boots based on how much money you've got how much you spent me to me if you already have other loans and What we want to do is make a decision as to whether you should be allowed more credit or not, right? So the answer is yes or no quite simply so a decision tree is going to partition the data up based on the attributes So let's say the first rule is credit rating credit rating You know greater than or equal to 5 question mark and if the answer is yes We continue if the answer is no Then we actually output a leaf node here Which says credit denied here we say, okay, so the credit ratings are by five. It's not a no yet Now we say okay do they earn? More than let's say 10,000 a year or something like that And if the answer is yes, we proceed to the next stage if it's no then they don't earn enough credit denied This is what a decision tree does now you don't have to design this yourself. There are algorithms to produce decision trees for you The way they will work is they will pick one of these attributes at each level that best separates for data out so for example you've got a lot of different instances of yes and no decisions in your training set is credit rating the best way of separating out the yeses and anodes and One of them is going to be best for each individual step and we can use all of them in a tree structure like this until we get to a series of leaf nodes which end up with only yeses and Only nose and then is very simple to apply this when you data comes along we apply these rules and we get to a decision a decision tree is going to be Equivalent to programming a bunch of carefully chosen if statements but of course the benefit is that you can do this over a huge number of Attributes very very quickly without having to do all this yourself, right? So yes, it's not much better than doing it yourself, but it's much quicker. So let's have a look at this in some code we're going to change and use a different piece of software today because for things like classification and Prediction we're going to use Weka it's a very simple tool that makes applying things like decision trees. Very very simple And it has some of the same data cleaning processes as our does but in a graphical form, we've already prepared our credit report right so we've got credit data where we have a number of inputs things like how much money do they make whether they've Defaulted on any credit before we have these in a file so I'm gonna go in here I'm gonna find my file. It's gonna be in here right now. You can load up various file types JSON files For example, we're gonna load a CSV. It's our credit data. So we have about 600 rows of Whether or not people I think it was Japan this data originally came from were given credit or not So we have things like age debt Marital status whether they're a customer at the bank already Whether they've got a driving license what their current credit score is and you can see that what Weka has done is load all these Work out whether they're nominal or values numerical values already So for example credit score is a numerical value And you can see here a quick histogram that shows the different types and whether they've been approved for credit Approved at the bottom Weka has interpreted as the output or the classification that we're trying to achieve Alright, so in this data set we have 307 you can almost see that font 307 approved and 383 Denied credit. So let's train up a decision tree and see how it does. So we only go to classify We're going to select a decision tree. So we're going to choose we could choose 0r That's not so gonna go down to trees and j48, which is your standard decision tree We're gonna use a percentage split and we're going to select 70% for our training set. This one doesn't have a validation set We're gonna be predicting whether one what they were approved and then we're gonna train up like this what happens this weapon will train the decision tree and then it will produce for us some measurements of its accuracy you Can see it's correctly classified 85% of the testing set which is good. I mean, it means a lot to these people So maybe those 15% could be a bit aggrieved and then we get a confusion matrix down here So we're saying that of the yeses a 76 were correctly allowed credit and 22 were denied incorrectly and if the noes a hundred were correctly denied and nine were accidentally allowed, right? So that's the ever we can see here now The nice thing about decision trees is we can now look at these rules and see what they are So we can go into visualized tree And so you can see that the most important attribute that is decided on is whether or not they defaulted on a loan Prior to this. So anyone that defaulting on a loan before is immediately denied credit if they Haven't default on a loan then it starts to look at whether they were employed and if they are It's going to give them credit All right. It's a simple rule system and it's the best it can do given the amount of data We've got if they aren't employed, but it's going to look at their income Maybe they're self-employed gonna make a decision then whether they're married where they live and their income again Right, so you can use attributes multiple times to make complex decision making processes So this is a very simple tree Which actually has performed pretty well on this data set and it's not a huge data set for 85% That's not too bad Once you've used a classifier so KNN or a decision tree to classify your data You want to know really as how well as it performs on your testing set so you could quite simply calculate accuracy So what is the percentage of the time that we were correct iein? Obviously that's going to be hard to do for many classes, but for credit yes or no 85 percent is not bad Right if our if our average was guessing at 50% it's quite a lot better than that there's another type of classified as perhaps a little bit more common these days and a little bit more powerful with decision trees and that's The support vector machine. So what is a support vector machine? well what we're going to try and do is Separate our classes based on a line or plane or some separation in the attributes that we have But what we're going to do is try and maximize a separation between these two classes to make our decision more effective So let's imagine we have two attributes just like before so this is actually because one misses attribute two Don't forget this is labeled training data. So we know which classes either been already. This is not like clustering So maybe we have some data over here and we have maybe some data over here Now obviously this is our quite an easy one We're going to try to find a decision boundary between these two classes that maximizes a separation So for example one decision boundary we could pick will be this one here Right, but it's not perfect because it's very close to this point here and it's very close to this point here So these are on the fringes are being misclassified Right and you've got to think that this is just a training set if we start to bring in testing data that may appear around Here or around here. Maybe that's the stuff that gets misclassified So what a support vector machine will do is pick a line between these data points Where the distance to the nearest point is maximized these nearest points are called support vectors, right? So this Margin here is going to be as big as we can get it so you can imagine if we move this around the margins going To get bigger and smaller now the nice thing about support vector machines in a kind of almost reverse PCA approach You can convert this into a higher dimensional space and perform quite complicated Separation of things aren't really obviously separable like this things that are essentially we have to have a nonlinear decision made, right? So not a simple line something more complex like a curve a lot of the time we're going to look at precision and recall So recall is a measure off for all the positive things But all the people that should have been granted credit how when even actually were like so we should have said yes How many times did we actually say? Yes, right And that's a measure of how good is our algorithm at spotting That class and precision is of the ones it spotted what percentage of them were, correct? You can imagine a situation where your recall might be very high because you've just said yes to everyone right? So yes You spotted every single person that should have got credit But also your precision is low because you were giving it to loads of people who shouldn't have had it, right? So a really good algorithm is going to be one that has a very high precision and a very high recall Right, and we combine these measures into one score? F1 or F score and this is going to be a value between Norton one. Where one is Absolutely, perfect. And zero is doesn't work at all. Where did our training data come from in this case? We've got our train date off Internet, right? But if you're a credit agency Then what you're going to do is you're going to use humans to make these initial decisions Then you're going to train a machine and you're going to test to see whether it can do as well as people can do right Maybe there's nuance there that this decision tree couldn't capture those 15 percent of people that were misclassified Is there something we could have done better to help those people? So what you'll find it happens in practically is your trainer system But maybe you don't rely on it entirely maybe for the very obvious Yes is we can use a decision tree or some other classifier to just say yeah Those people are fine Maybe for the obvious knows we can say no They're not going to get credit But for the edge cases the people in the middle, maybe that's when we bring a human into the loop So in our data set for our training examples We're going to have all of the attributes and then we're crucially gonna have an already known label for that data But says yes that person was denied credit or they were allow credit. Right? So we're going to use those training examples of input attributes and output yes or no decisions to train our Classifier and then we're going to test the results and whether or not it'll work when we use our unseen test data for unknown cases Classifiers let us put groups into discreet labels yes or no a B or C Depending on what our situation is. They're very powerful and as long as you've got enough training data We should be able to use them to make real-life decisions What we want to do going forward is start to move from just yes or no to can we actually produce output values You know, can we regress actual values out of the these algorithms? Let's talk a little bit about something more powerful That's artificial neural networks now Anytime in the media at the moment when you see the term AI what they're actually talking about is machine learning and what they're talking About is some large neural network. Now. Let's keep it a little bit smaller for this but let's imagine what
A2 初級 數據分析8:數據的分類 - Computerphile (Data Analysis 8: Classifying Data - Computerphile) 9 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字