Name: 處理動態數據 - Computerphile (Dealing with Dynamic Data - Computerphile)
Uploaded: 2021-01-14T10:13:30.000Z
Duration: 17 min 25 s
Description: 【看影片學英語】數萬部 YouTube 影片，搭配英漢字典即點即查，輕鬆掌握單字發音與用法，長久累積看電影不必再看字幕。

So now we're going to talk about something that is kind of a specific part of Big Data

情態副詞

So the velocity part huge amounts of data being generated all the time, which essentially is a data stream

So that's a flow of instances so you could have a flow of images coming in have a flow

Video coming in or just a flow of essentially lines to go into a database the thing about the dynamic data

Is that the patterns within it can change so if we've got for example a static machine learning model?

That's not going to deal very well with a changing pattern happening in the data

We build a single model at the start. We use it to make predictions on later data the model

Accuracy can kind of degenerate over time as that data changes

The problem of kind of designing algorithms to deal with this real time data

There's been a research topic for kind of several years now and there's several real world applications on top of that as well

Banks trying to detect fraud as patterns change of different forwards occurring

They want their models to kind of be able to update all the time similar for intrusion detection systems and computer networks

Ideally, you would want this to happen automatically so minimum interference from humans, because otherwise they've got to spot when changes are happening

We just want the machines to be able to do it by themselves

So if you think about a traditional classification problem on a static batch of data

You assume you have all of that data there already. You have your training test set and you have

Features which X and then there's some unknown

function f of X which gives you the class label and you want to find a

hypothesis that gives you the best prediction possible

So what kind of approximates this function as well as possible?

So you have a red class and a green class and we have instances that look like this our function f of X may create

A class boundary that looks like this. So anything on this side is red. Anything on this side is green

Our model doesn't know that but we use standard machine learning techniques decision trees new or networks

Whatever you want and it learns a boundary

That looks like that and so that will do okay on whatever dates that we have

It's not effect, but it may get the results that we want. This is static classifications. We already have all our data

So we've got our data we've done our machine learning

This is the decision boundary that we've learnt. The dotted line is what is actually the boundary this gives. Okay results

Let's now say that this is happening in a data stream. So we get this data originally and we build this model

But then later on we have a similar distribution of instance arriving

However, what now happens is that some of these instances are now in reality in a different class

so the true boundary is now here, but we still have our

Model with this decision boundary and so we're now predicting instances here and here into the wrong class if we use that

Exact same model. So what we would see in this case in

Centage accuracy over time you would see at this change point

Accuracy would plummet. So this problem here is called real concept drift. What is effectively happened here

is that this function the unknown function has changed but we've kept our hypothesis our machine learning model exactly the same and so

we can also have a similar problem called virtual drift and what would happen in this case is

Target decision boundary has stayed the same from this original

But the instances we now see in the stream are somewhere else in the feature space. Let's say we now see

Kind of optimal decision boundary is in exactly the same place. We now have different data. That means that are predicted boundary

It's going to give this instance as wrong because we haven't got a way of incorporating

information from this instance into the original model that we built both of these will create this decrease in accuracy so we can also

Look at the drift in the data streams in terms of the speed they happen so something that would give us an accuracy plot that

Looks like this is called sudden drift we go from straight from one concept in the data stream

So one decision boundary straight to another one another possible thing that could happen

So rather than this sudden switch this decision boundary gradually shifts save me your life if we're looking at a very very oversimplified

Intrusion detection system. We have only two features that we're looking at in the original dataset

Problem and intrusion anything on this side is good in this case

What happens is that suddenly there's a new way of attacking the network and so suddenly

What was here is now not good. So we see those patterns and we say ok

No, that counts as an intrusion in this case

what it means is that we see something that we've not seen before so the model hasn't been trained with any similar data and

So it could get it, right it could fall somewhere up here and we correctly say this is bad

but it could also fall in an area that we didn't learn the decision boundary so well, so

Yeah, we get that prediction wrong. We just looked at what?

The problems are with using a single static model when we're dealing with incoming data

Over time the distribution changes and we start to see a decrease in accuracy on whatever model we built

So what happens in kind of a stream machine learning algorithm would be so first of all

You've got X arriving. This is your instance in our previous example, this would just have two values associated with it

What would first happen is we make a prediction? So in the classification example, we classify this. Yes

It's an intrusion. No, it's not intrusion using the current model that we have then what happens is we update whatever model we have

using information from X and we'll talk about some of the ways that this is done in a second and

One of the kind of caveats with stream machine learning is that you need for this to happen you?

The real class label if you're doing classification

So in order to incorporate information from this instance into whatever model you've got you need to have that label there now in some cases

It's very easy to say we've seen this data. This is what it's classified us

And we do that immediately if we're thinking about

Making weather predictions we can almost immediately say yes. This is what the weather is like it may be a day's delay

But yeah, we can that's pretty immediate thing four things for example for detection

Predict it is not being fought and then suddenly two days later this person figures out that actually there's something wrong with their bank accounts

They phone up and it does turn out to be fraud

And so we'd only have the label for that data after that has happened

At this point and so the goal of updating the model over time is so that rather than having a performance plot

That looks like this so we go from 95s and accuracy down to 20% accuracy

We instead end up with something that okay

We may drift a little bit here and have a tiny performance decrease

But the model should very quickly recover back to the original level and we still have a high performance

So that's the goal of this model update. There's various approaches we can take so the first one is explicit drift handling

which means that we first of all detect when a drift happens in the data stream

We have drift detection methods and these are usually statistical tests that look at some aspects of the data arriving

So if the distribution of the data we see arriving and the distribution of the classes we see is changing

If morph like that as a drift some of these we'll also look at the performance accuracy of the classifier

So if the classifier performance suddenly drops we can say well, we've probably got a drift here

字幕列表影片播放

處理動態數據 - Computerphile (Dealing with Dynamic Data - Computerphile)

specific

essentially

pattern

basically