字幕列表 影片播放 列印英文字幕 So now we're going to talk about something that is kind of a specific part of Big Data So the velocity part huge amounts of data being generated all the time, which essentially is a data stream So that's a flow of instances so you could have a flow of images coming in have a flow Video coming in or just a flow of essentially lines to go into a database the thing about the dynamic data Is that the patterns within it can change so if we've got for example a static machine learning model? That's not going to deal very well with a changing pattern happening in the data We build a single model at the start. We use it to make predictions on later data the model Accuracy can kind of degenerate over time as that data changes The problem of kind of designing algorithms to deal with this real time data There's been a research topic for kind of several years now and there's several real world applications on top of that as well so if you think about Banks trying to detect fraud as patterns change of different forwards occurring They want their models to kind of be able to update all the time similar for intrusion detection systems and computer networks They want to be able to update And keep on top of what is happening Ideally, you would want this to happen automatically so minimum interference from humans, because otherwise they've got to spot when changes are happening We just want the machines to be able to do it by themselves So if you think about a traditional classification problem on a static batch of data You assume you have all of that data there already. You have your training test set and you have instances with Features which X and then there's some unknown function f of X which gives you the class label and you want to find a hypothesis that gives you the best prediction possible So what kind of approximates this function as well as possible? So you have a red class and a green class and we have instances that look like this our function f of X may create A class boundary that looks like this. So anything on this side is red. Anything on this side is green Our model doesn't know that but we use standard machine learning techniques decision trees new or networks Whatever you want and it learns a boundary That looks like that and so that will do okay on whatever dates that we have It's not effect, but it may get the results that we want. This is static classifications. We already have all our data So we've got our data we've done our machine learning This is the decision boundary that we've learnt. The dotted line is what is actually the boundary this gives. Okay results Let's now say that this is happening in a data stream. So we get this data originally and we build this model But then later on we have a similar distribution of instance arriving However, what now happens is that some of these instances are now in reality in a different class so the true boundary is now here, but we still have our Model with this decision boundary and so we're now predicting instances here and here into the wrong class if we use that Exact same model. So what we would see in this case in Centage accuracy over time you would see at this change point Accuracy would plummet. So this problem here is called real concept drift. What is effectively happened here is that this function the unknown function has changed but we've kept our hypothesis our machine learning model exactly the same and so It starts to perform badly we can also have a similar problem called virtual drift and what would happen in this case is that the Target decision boundary has stayed the same from this original But the instances we now see in the stream are somewhere else in the feature space. Let's say we now see data like this so though the Kind of optimal decision boundary is in exactly the same place. We now have different data. That means that are predicted boundary It's going to give this instance as wrong because we haven't got a way of incorporating information from this instance into the original model that we built both of these will create this decrease in accuracy so we can also Look at the drift in the data streams in terms of the speed they happen so something that would give us an accuracy plot that Looks like this is called sudden drift we go from straight from one concept in the data stream So one decision boundary straight to another one another possible thing that could happen Is that our accuracy looks like this? So rather than this sudden switch this decision boundary gradually shifts save me your life if we're looking at a very very oversimplified Intrusion detection system. We have only two features that we're looking at in the original dataset anything with these features, this is a security Problem and intrusion anything on this side is good in this case What happens is that suddenly there's a new way of attacking the network and so suddenly What was here is now not good. So we see those patterns and we say ok No, that counts as an intrusion in this case what it means is that we see something that we've not seen before so the model hasn't been trained with any similar data and So it could get it, right it could fall somewhere up here and we correctly say this is bad but it could also fall in an area that we didn't learn the decision boundary so well, so Yeah, we get that prediction wrong. We just looked at what? The problems are with using a single static model when we're dealing with incoming data Over time the distribution changes and we start to see a decrease in accuracy on whatever model we built So what happens in kind of a stream machine learning algorithm would be so first of all You've got X arriving. This is your instance in our previous example, this would just have two values associated with it What would first happen is we make a prediction? So in the classification example, we classify this. Yes It's an intrusion. No, it's not intrusion using the current model that we have then what happens is we update whatever model we have using information from X and we'll talk about some of the ways that this is done in a second and One of the kind of caveats with stream machine learning is that you need for this to happen you? need to have The real class label if you're doing classification So in order to incorporate information from this instance into whatever model you've got you need to have that label there now in some cases It's very easy to say we've seen this data. This is what it's classified us And we do that immediately if we're thinking about Making weather predictions we can almost immediately say yes. This is what the weather is like it may be a day's delay But yeah, we can that's pretty immediate thing four things for example for detection You may see a pattern of data you may Predict it is not being fought and then suddenly two days later this person figures out that actually there's something wrong with their bank accounts They phone up and it does turn out to be fraud And so we'd only have the label for that data after that has happened The final bit is to update the model At this point and so the goal of updating the model over time is so that rather than having a performance plot That looks like this so we go from 95s and accuracy down to 20% accuracy We instead end up with something that okay We may drift a little bit here and have a tiny performance decrease But the model should very quickly recover back to the original level and we still have a high performance So that's the goal of this model update. There's various approaches we can take so the first one is explicit drift handling which means that we first of all detect when a drift happens in the data stream So to do that We have drift detection methods and these are usually statistical tests that look at some aspects of the data arriving So if the distribution of the data we see arriving and the distribution of the classes we see is changing If morph like that as a drift some of these we'll also look at the performance accuracy of the classifier So if the classifier performance suddenly drops we can say well, we've probably got a drift here We need to do something to the model to mitigate this Who spots that though? Is it, you know, is there an algorithm that actually spots that something's different to what it should be Yes, so there are various statistical tests that will do this That will kind of just measure things like the mean of the data arriving and be able to spot things that have changed basically So yeah, once we detected that a drift has happened We then want to take some action. The first thing that we could do is we could do a complete replacement of the model so we get rid of whatever model we had before and we We have taken chunk of recent data And we retrain the model on that and continue using that for predictions until we've hit another drift This is okay. But it means that we could be getting rid of some information in the previous model That is maybe still going to be useful in the future so then there are also methods that we'll look at specific parts of the model and say okay this specific part of it is Causing a performance decrease. So let's get rid of this we can then Learn from new instances something to replace this that will do it better basically so if you think of a decision tree If you can detect that there are certain branches in that decision tree that are no longer Making good predictions you can get rid of them and we grow the tree to perform better prune it. Yeah, exactly It is called pruning. You prune. Yeah, you prune the branches off the tree There are no longer performing as you want them to the alternative to explicit handling is to do implicit drift handling So rather than looking at the data or looking at the performance and saying something has changed we need to take action We're just continually taking action. There are various approaches to implicit drift handling So the first and probably most simple one is to use a sliding window So if we imagine we have the data stream with instances arriving like this We could say we have a sliding window of three instances and we learn a model off of them. We then Take the next three learn a model off of them. So as each instance arrives we get rid of the oldest instance And this makes the assumption that the oldest instances are the least relevant. This is usually the case It's kind of a valid assumption to make so this performs Okay the problem with this though is that it kind of provides a crisp cut off points every Instance within this window is treated with exactly the same Kind of impacts on the classifier. They were weighted the same so we can introduce instance weighting So that older instances will have a lower weight their impact on the classifier will be less So again, the more recent instances will be have the largest impact on the current model and then again these algorithms that we'll use instance weighting will usually have Some threshold. So once the weight gets below a certain point they say that's the instance gone We delete it presumably the windows can be larger or smaller Yes, so setting the window size is a pretty important parameter if you have a window, that is too large then Okay, you're getting a lot of data to construct your model from which is good and cents between learning more data usually good What it also means is that if there's very short-term drifts So this drift happens and then we don't learn from that drift if that makes sense because we see that all as one Chunk of the data again If you didn't set the window to be too small we can react very well to very short-term drifts in the stream But you then have a very limited amount of data to work on to construct the model So there are methods that will automatically adjust the window size. So during times of drift the window size will get smaller so we want to be very rapidly changing the model and then during times when everything is kind of very stable the Window will grow to be as large as possible so that we can Use as much data to construct this model as possible So the problem weird sliding windows and instance weighting is that you need all of those instances available to construct the model Continuously. So every time you add a new instance and delete another one you need to reconstruct that model and So the way we can get around this is by using single pass algorithms So we see each instance once use it to update the model and then get rid of that instance It's probably still in long-term permanent storage, but in terms of what is being accessed to construct this algorithm It's gone now in that respect then you've got information out of the instance, but you don't need the instance itself. Yeah, exactly So we see the instance we incorporate what we can from it into the current model We get rid of it and that instances impact is still in the model an example would be a decision tree So decision trees are kind of constructed by splitting nodes where we're going to get a lot of information gained from making a split on a certain attribute So as the data stream changes the information gained that we might get and some of these nodes may change So if we say get a new instance and it will say okay Now this actually makes this a split worth making We can make that split continue growing the tree and then that instance can go we don't need it anymore But we still have the information from it in our model So we've got our implicit and explicit drift handling appro. You can also have hybrids approaches So the explicit drift handling is very good at spotting sudden drift. So anytime there's a sudden change There'll be a sudden drop in performance that's very easy to pick up on with a simple statistical test But when we then add in the implicit drift handling on top of that It means that we can also deal very well with gradual drift So gradual drift is a bit more difficult to identify Simply because if you look at the previous instance or like say that 10 previous instances With a gradual drift, you're not going to see a significant change So it's a lot harder to detect by combining the implicit and explicit Drift timing methods we end up with a performance plot. That would look something like this We maintain pretty good performance for the entire duration of the data that's arriving the problems of a changing data distribution And not the only problems with streams and so if you can imagine a very high volume stream and high-speed got a lot of data arriving in a very short amount of time if You take a single instance of that data stream and it takes you like five seconds to process it But in that 5 seconds, you've had 10 more instances arrive. You're going to get a battery of instances very very quickly So you need to be the model update stage needs to be very quick to avoid getting any backlog. The second problem is that with? These algorithms we're not going to have the entire history of the stream available To create the current model so the models need to be For example the single path algorithms that can say we don't need the historical data that we have the information we need from it But we don't need to access these Because otherwise, you just end up with huge huge data sets Having to be used to create these models all the time And again these streams of potentially infinite We don't know when they're going to end and we don't know how much data they're going to end up containing Most of the kind of and well-known machine learning algorithms have been adapted in various ways to be suitable for streams So they now include update mechanisms. So they're more dynamic methods. So this includes but decision trees neural networks K nearest neighbors. There's also clustering algorithms have also been adapted. So basically any classic algorithm you can think of there's Multiple streaming versions of it now. So if you are interested in these streaming algorithms There's a few bits of software that you could look at for example, there's the Mower suite of algorithms which interfaces with the worker data mining tool kit This is free to download and use and includes implementations of a lot of popular streaming algorithms it also Includes ways to synthesize data streams so generate essentially a stream of data That you can then run the algorithms on and you can control the amount of drift that you get how certain it is and things like that and that's quite good to play around with to see the effects that Different kinds of drift can have on accuracy in terms of big data streams Specifically there's software such as the spark streaming module for Apache spark well There's also the more recent Apache flink that are designed to process very high volume data streams very quickly you just mentioned some yourself where people can download and have a play with but I mean in the real world as an industry and Websites and things that services that we use every day He was using these streaming algorithms. And so a lot of the big companies or most companies to be honest will be generating data Constantly that they want to model. So for example Amazon recommendations like what to watch next what to buy next they want to Understand changing patterns so that they can keep updating Whatever model they have to get the best recommendations again optimizing ads to suggest based on whatever Searching history you have that's another thing that is being done via this. So yeah, there are a lot of real-world applications for this stuff Now I've got the token so I can load a value in add the value emerged or into it and store it back and hand And now I've got the token again I can load something into its my register you and do the computation split across those machines So rather than having one computer going through I don't know a billion database records. You can have each computer going through
B1 中級 處理動態數據 - Computerphile (Dealing with Dynamic Data - Computerphile) 2 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字