AI安全體操 - Computerphile (AI Safety Gym - Computerphile)

字幕列表影片播放

this episode has been brought to you by fast hosts.
Find out more about them later.
So I wanted to talk about this paper out of opening.
I benchmarking safe exploration in deep reinforcement, learning what comes with this.
There's a block post.
So I wanted to explain this because when I saw the blood post and I think a lot of computer file of us would have the same reaction, What I thought is, what the hell is a game going?
Some kind of meats basting tonight.
So it's a bunch of these environments, right?
That, um that allow you to trait train?
Yeah, the opening I Safety Jim Benchmark suite, which is a bunch of these environments that you can running your systems in and have them learn.
Is this anything to do with those?
Aye, Aye, Grid world retort.
Yeah, yeah, kind of.
In the same way that the Grid World's paper did, This paper introduces environments that people can use to test their A i systems, and this is focusing specifically on safe exploration.
On Dhe has a few differences.
They kind of complementary the environments in this air a little bit more complex, their continuous in time and in space in a way that their grid worlds all like very discreet.
You take turns and you move by one square.
Or, as in this case, it's a lot more like the joke.
Oh, where you actually have like a physics simulation that the simulated robots move around in.
So it's a slightly more complex kind of environment.
Um, but the idea is tohave in the same way as with grid worlds or anything else tohave a standardized set of environments so that you know, everybody's comparing like with like and you actually have sort of standardized measurements and you can benchmark.
You can compare different approaches and actually have metrics that tell you which ones doing better, which is like It's not super glamorous, but it's, ah, riel prerequisite for how progress actually gets made in the real world.
If you can't measure it, it's very hard to make progress or nervous you're making progress.
The problem of safe exploration is in reinforcement lining, which is one of the most important and popular ways of creating our systems for various types of problems.
Um, the system is interacting with an environment, and it's trying to maximize the amount of reward that it gets.
So you write a reward function, and then it's trying to get as much out of that as it can on dhe.
The way that it learns is by interacting with the environment.
And so this basically looks like trial and error, right?
It's doing things, and then it gets the reward signal back and it learns, Oh, that was a good thing to do.
That was a bad thing to do.
Um, and the problem with that is it's very difficult to do that safely, And it's kind of a fundamental problem because in order to do exploration, you have to be taking actions that you don't know what the result is going to be, right, that the only way that you can learn is by trying things that you're not sure about.
But if you're trying random things, some of those things are gonna be things that you really shouldn't be doing.
Inherent in any exploration is dangerous.
I mean, that's sort of Gaza territory for human explorers.
So Apollo 11 right?
Right, We don't about research.
We'd send Smith spaceships out.
We had an idea of what's what but it was still a dangerous thing to go on.
Land on the moon.
Exploring comes with danger, right?
Right.
Yeah, but there are There are safe ways to do it or there are safer ways to do it.
They could have tried to launch astronauts on the first thing that they ever sent to the moon.
They didn't do that because they knew how much they didn't know, and they didn't want to risk it until they actually had a pretty good understanding of what they were dealing with.
And it's kind of similar, Like if you look a TTE some of the standard reinforcement learning approaches to exploration, what they involve is doing things often.
What they involve is doing things completely at random, right?
You just, uh, especially at the beginning of the training process where you really don't understand the dynamics of the environment.
You just did flail and see what happens right?
And human beings actually pretty much do this as well.
It's just that when babies flail, they aren't really able to hurt anything.
But if you have a three ton robot arm flailing around trying to learn the dynamics of the environment, it could break itself.
It could hurt somebody.
But when you mentioned three ton robot arms flailing around, I'm guessing that the people who do that kind of development will have done some kind of simulation before they've built right.
There's two sides to it, right?
Part of the reason why we haven't had that much safe expiration researchers, because simulations good but also part of why we use simulations so much is because we don't know how to safely do it in the real world.
For very simple tasks, you can write good simulators that accurately represent all of the dynamics of the environment properly.
But if you want a system that's doing something complicated, like generally speaking with these with these robots, for example, you still don't go near them.
They don't smash themselves up, and they don't smash the environment because you've simulated that.
But while they're operating like, how do you write a simulation off?
How a human being actually moves in an environment with a robot like this is why you look at self driving cars?
They train them a huge amount in simulation, but it's not good enough.
It doesn't capture the complexity and the diversity of things that can happen in the real world.
And it doesn't capture the way that actual human drivers act and react.
So everyone who's trying to make self driving cars, they are driving millions and millions of riel world miles because they have to, because simulation doesn't cut it.
And that is a situation where now they're not just running reinforcement, learning on those cars, right?
We don't know how to safely explore in a self driving car type situation.
In the real world, trying random inputs to the controls is like not viable.
If you're using reinforcement learning, if you have something that you don't want, there's agent to do.
You give it a big penalty, right so you might build a reward function.
That's like I want you to get from here to here on Dhe.
You get points for getting there faster, but if you there's a collision, then you get minus large number of puts.
Sometimes people talk about this problem as though reward functions are like not able to represent the behavior that we actually want.
You know, people say you can't write a reward function that represents this, and it's like, Well, I mean, you can't write a reward function, but like you can't write like plausibly it.
It's possible, but like how you actually gonna do it?
Um, so, like, yeah, you're giving a big penalty to collisions, but like, how do you decide that penalty?
What should it be?
You have this problem, which is that in the real world, people are actually making a trade off between speed and safety all the time.
Everybody does.
Any time you go just a little bit after the light has turned red, right?
Or just a little bit before the light has turned green.
You're accepting some amount of risk for some amount of time.
If you go after it's gone red for long enough, you will meet someone who went a bit early on the green and, you know, teach each other things about the trade off between speed and safety that will stay with you for the rest of your life.
People talk about it like what we want is no crashes.
And that's not actually how it works, because that would correspond.
If you wanted that, that would correspond to sort of infinite negative reward for a collision.
And in that scenario, The car doesn't go anywhere.
If that was what we really thought, that the speed limit would be 0.1 miles an hour.
There is some, like, acceptable tradeoff between speed and safety that we want to make.
The question is, how do you actually pick the size of that punishment to make it sensible?
Like what?
How do you How do you find that?
Implicitly.
It's kind of difficult.
One approach that you can take to this, which is the one that this paper recommends is called constrained reinforcement.
Learning where you have your reward function.
And then you also have constraints on these cost functions.
So standard reinforcement learning You're just finding the policy that gets the highest reward.
Right?
Um, rising constrained reinforcement learning.
You're saying, given only the set of policies that crashes, uh, less than once per however many 1,000,000 miles, find the one of those that maximizes reward.
So you're you're you're maximizing reward within these constraints.
Yet reinforcement, learning and constrained reinforcement learning are both sort of frameworks, their ways of laying out a problem.
They're not like algorithms for tackling a problem.
They're they're they're a formalization.
That lets you develop algorithms, I guess, Like sorting or something.
You know, you've got a general idea of like you have a bunch of things and you want them to be in order.
But like how many there are, what kind of things there are, what the process is for comparing them.
And then there's different algorithms that can they can tackle it.
I haven't seen a proof for this, but I think that for any constraint, reinforcement learning, uh, set up, you could have one.
That was just a standard reward function.
But this is like a much more intuitive way of expressing these things.
So it kind of reminds me of there's a bit in Hitchhiker's Guide where somebody is like, Oh, you've got a solution?
No, but I've got a different name for the problem.
I mean, this is better than that, because it's a different way of formalizing the problem different way of sort of specifying what the problem is, and actually a lot of the time finding the right formalism is a big part of the battle, right?
The problem How do you explore safely is like onder defined.
You can't really do computer science on it.
You need something that's expressed in the language of mathematics, and that's what constrained reinforcement learning does.
It gives you a slightly more intuitive way of specifying, uh, rather than just having this one thing, which is this single value.
Are you doing well or not?
You get to specify, Here's the thing you should do.
And then here is also the thing or things that you shouldn't do.
Slightly more natural, slightly more human friendly formalism that makes it you would hope would make it easier to to write these functions to get the behavior that you want in the real world.
Um, it's also nice because if you're trying to learn, so I I did a video recently on my channel about reward modeling, where you actually learned the reward function rather than writing the reward function you have a part of.
Your part of your training system is actually learning what the reward function should be in real time, and the idea is that this might help with that as well, by it's kind of easier to learn these things separately rather than trying to learn several things at the same time.
And it also means you can transfer any more easily.
Like if you have a robot arm and it's making pens and you want to retrain it to make mugs or something like that, then it would be that you would have to just relearn the reward function completely.
But if you have a constraint that it's learned, that's like, don't hit humans.
That is actually the same between these two tasks.
So then it's only having to relearn the bits that are about the making the thing and the constraints.
It can just keep them from one to the next, so it should improve performance in training speed and also safety.
So it's kind of went with the other thing.
That's kind of different about various formulations of constrained reinforcement.
Learning is you care about what happens during training as well, right?
Standard reinforcement, learning.
You are just trying to find a policy that maximizes the reward and like how you do that is kind of up to you.
But what that means that standard reinforcement learning systems in the process of learning will do huge amounts of unsafe stuff right, whereas in a constrained reinforcement learning setting, you actually want to keep track off how often the constraints are violated during the training process.
And you also want to minimize that.
Which makes the problem much harder because it's not just maker and A I system that doesn't crash.
But it's like making our system that, in the process of learning how to drive at all crashes as little as possible.
Um, which just is Yeah, it makes the whole thing much more difficult.
And so we have these simplified environments that you can test your different approaches, and they're they're fairly kind of straightforward, reinforcement, learning type set ups.
You have these simulated robots.
There's four of them.
You've got point, which is just a little round robot with a square on the front that can turn and drive around car, which it's a similar sort of set up yet has differential drive.
So you have input to both of the wheel sort of tanks tearing type, setup, and that drives around and you have dog.
Oh, I kid you, not Dogg are, which is a quad repaired that walks around and then you have a bunch of these different environments which are basically like you have to go over here and press this button, and then when you press the button, a different button will light up.
You have to go out to impress that one and so on, or get to this point or like push this box to this point.
You know it's basic basic interactions, but then they also have these constraints built in which are things like hazards, which are like areas that you're not supposed to go into or vases.
They call them vases just like objects that you're not supposed to bump into.
And then the hardest one is gremlins, which are objects that you're not supposed to touch, but they move around as well.
The idea is, you're trying to create systems that can learn to get to the areas they're supposed to be all pushed box or press the buttons or whatever it is that they're trying to do while simultaneously avoiding all of these hazards.
And not breaking the bars is not bumping into the gremlins or whatever else, and that they can learn with a minimum of violating these constraints during the training process as well.
Um, which is a really interesting and quite hard problem, and then they provide some benchmarks, and they show that standard reinforcement learning agents suck at this, trying to do anything to learn.
They don't care about the looming exactly, exactly on.
Then there are a few different other approaches that do better.
This is really nice if you have ideas and again like the Grid World's thing, you can download this and have a go.
You can try training your own agents and see how well you can do on these benchmarks.
And if you can beat what open A.
I is done, you know, and then you've got something that's that's publishable that's going to advance the field.
So I really like this as a piece of work because it provides a foundation for more work going forward in there, kind of standardized understand of away.
Fast Host is a UK based Web hosting company, which offers a wide range of Web hosting products.
On other service is they aim to support UK businesses and entrepreneurs at all levels, providing effective on affordable hosting packages to suit any need, as you'd expect from someone called fast hosts.
They do domain names.
It's easy to register on.
They have a huge choice of domains with powerful management features included.
One thing they do offer is an e commerce website builder.
This provides a fast and simple way for any business to sell on mine.
It's a dragon drop interface, so it's easy to build a customized shop on the Web.
Even if you have no technical knowledge, you can create an online store and you can customize simply with dragon drop function.
Artie No designers old developed as required.
Fastow's data centers are based in the UK alongside their offices.
So whether you choose a lightweight Web hosting package or go for a fully fledged dedicated box, they're experts.
Support teams are available 24 7 Find out more by following the link in the description below.