TensorFlow概率。充滿信心地學習 (TF Dev Summit '19) (TensorFlow Probability: Learning with confidence (TF Dev Summit '19))

字幕列表影片播放

[MUSIC PLAYING]
JOSH DILLON: Hello, everyone.
I'm Josh Dillon, and I'm a lead on the TensorFlow probability
team.
And today, I'm going to talk to you about probability stuff
and how it relates to TensorFlow stuff.
So let's find out what that means.
OK, so these days, machine learning
often means specifying deep model architectures
and then fitting them under some loss.
And happily, Keras makes specifying model architecture
relatively easy.
But what about the loss?
Choosing the right loss is tough.
Improving one-- even a reasonable one--
can be even tougher.
And once you fit your model, how do you know it's good?
Does accuracy tell the full picture?
Why not use mean, entropy, mode?
Wouldn't it be great if there existed
some mathematical framework, which unified these ideas?
Better still, wouldn't it be nice
if it was plug and play with Keras and the rest of TF?
This would make comparing models easier by simply maximizing
likelihood and having readily available
evaluative statistics.
We can rapidly prototype different generating
assumptions and quickly reject the bad ones.
In short, wouldn't it be great if we could do this--
just say I want to maximize the log likelihood
and then summarize what I learned easily
and in a unified way?
So let's play with that idea.
Here, we have a data set-- these blue dots.
And our task-- our pretend task--
is to predict the y-coordinate from the x-coordinate.
And the way you might do this is specify some deep model.
And of course, you might choose the mean squared error
as your loss function.
OK.
But our wish here is to think probabilistically.
And so that means maximizing the log likelihood,
as indicated here with this lambda function--
the negative random variable log_prob under y.
And what we want, in addition to that,
is to get back a distribution-- a thing
that has attached to it statistics
that we can use to evaluate what we just learned.
If only such a thing were possible.
Of course, it is, and you can do this now.
Using TensorFlow probability distribution layers,
you can specify the model as part of your deep net.
And the loss now is actually part
of the model, sort of the way it used to be--
the way it's meant to be.
And so let's unpack what's happening here.
So we have two dense layers.
That's sort of business as usual.
The second one outputs one float,
and that one float is parameterizing
a normal distributions mean.
And that's being done through this distribution lambda layer.
In so being, we're able to find this line.
That looks great.
And the best part is, once we instantiate
this model with test points, we have back a distribution
instance--
for which you get not just the mean, which
is what you'd get today, but also--
get not just the mean, which is what you'd get today, but also
entropy variance, standard deviation, all of these things.
And you can even compare between this and other distributions,
as we'll see later.
But if we look at this data, something's
still a little fishy here, right?
Notice that as the magnitude of x increases, the variance of y
also seems to increase.
So that means that maybe our model's a little suspicious.
So since we're in this probabilistic framework
and we're no longer doing loss hacking--
we're actually building a model--
what can we do to fix this?
Answer-- learn the variance too.
It's actually pretty obvious.
If we're fitting a normal, why on earth
do we think that the variance would just be 1?
And by the way, that's what you're
doing when you use mean squared error.
And so now, to achieve this, all I had to do
is make my previous layer output two floats.
I pass one in as the mean to the normal, one
in as the standard deviation of the normal.
And presto chango, now I've learned the standard deviation
from the data itself.
That's what the green lines are.
So this is really cool, because now, if you're a statistician,
you would say, hey, I'm able to handle heteroscedasticity.
If you want a $10 word, you can call
this aleatoric uncertainty.
And what this really means is that you're
learning known unknowns.
It means that the data itself had variance,
and you learned it.
And it cost you basically nothing
to do but a few keystrokes.
And furthermore, the way in which you saw how to do this
was self-evident from the very fact
that you were using a normal distribution which
had this curious constant just sitting there.
So this is good.
But, hm, I don't know.
Is there enough data for which we can reliably
claim that this red line is actually the mean,
and these green lines are actually
the standard deviation?
How would we know if we have enough data?
Is there anything else we can do?
Of course, there is.
Why learn just a single set of weights?
A Keras dense slayer has two components-- a kernel matrix
and a bias vector.
What makes you think that those point estimates are
the best, especially given that your data set itself is random
and possibly inadequate to meaningfully and reliably learn
those point estimates?
Instead, if you use a TensorFlow probability
dense variational layer, you can actually learn a distribution
overweight.
This is the same as learning an ensemble that's
infinitely large.
But luckily, it doesn't take infinitely
long to train this ensemble.
In fact, it takes just a little bit longer
than what it took to train on the previous slides.
And as you can see here, all I had to do
is replace Keras.dense with TFP.dense variational layer,
and in so doing, achieve this kind
of Bayesian weight uncertainty.
The $10 word here is epistemic uncertainty.
But again, I like to think of it as unknown unknowns.
I'm not sure what my data is not telling me,
so I'm going to be careful in the bookkeeping I
make when tracking the weights that I learn.
As a consequence, of course, this
means that any model you make, any instantiation of this model
is now actually a random variable because the weight's
a random variable.
And that's why you see here all of the lines.
There are many lines.
We have an ensemble of them.
But if you were to average those and take the sample
standard deviation, say, then that
would give you an estimate of credible intervals
over your prediction.
So now you can go to your customer
and say, look, here's what I think would happen,
and here's how much you should trust me.
So this is great, right?
But we seem to have lost the heteroscedastic part.
Notice that the blue dots are still more
dispersed on the right-hand side.
So can we do both?
Of course, we can.
It's all modular.
I just have my dense operational layer output 2 floats, instead
of one, like we did before.
Feed that into my output layer, which is a normal distribution.
And presto chango, I'm learning both known and unknown
unknowns, and all it cost me was a few keystrokes.
And so what you see here now is an ensemble of standard
deviations associated with the known unknown parts--
the variance present or observable in the y-axis--
as well as a number or an ensemble
of these mean regressions.
OK, that's cool.
So I like where this is going.
But I have to ask, what makes you think a line is even
the right thing to fit here?
Is there another distribution we could choose, a richer
distribution, that would actually find
the right form of the data?
And of course, the answer is yes.
It's a Gaussian process.
By tossing in this fancy distribution,
it turns out that the data wasn't linear at all.
No wonder we had such a hard time fitting it.
It was sinusoidal, and the Gaussian process can see this.
How can the Gaussian process see this?
Because it treats the loss itself as a random variable.
Now, how could you do that, if you're just specifying mean
squared error as your loss?
You can't.
It has to be part of your model, and that's the power
of probabilistic modeling.
When you bake in these ideas into one model,
you get to move things around fluidly between weight
uncertainty and variance in the data,
and even uncertainty in the loss function you're fitting itself.
And so the question is, how can this all be so easy?
How did it all fit together?
It's TensorFlow Probability.
So TensorFlow Probability is a collection
of tools designed to make probabilistic reasoning
in TensorFlow easier.
It is not going to make your job easy.
It's just going to give you the tools you need
to express the ideas you have.
You still have to have domain knowledge and expertise.
But you can encode that domain knowledge and expertise
in a probabilistic formalism, and TFP
has the tools to do that.
Statisticians and data scientists
will be able to write and launch the same model.
Gone are the days of hacking your model in R and importing
it over to a faster language, like C++, or even TensorFlow.
You can do it all in the same framework.
ML researchers and practitioners will
be able to make predictions with uncertainty.
If you predict the light is green,
you'd better be pretty confident that you should go.
You can do that with probabilistic modeling
and TensorFlow Probability.
So we saw one small part of TFP.
Broadly speaking, the tools are broken in two components--
those tools useful for building models and those tools
useful for doing inference on those models.
On the model building side, you saw the normal distribution
and the variation of Gaussian process distribution.
A distribution is just is a collection of simple summary
statistics, exactly like it is in every other library.
There's a few differences.
R distribution support, this concept
of batch shape, which automatically takes advantage
of vector processing hardware.
But for the most part, they should be
pretty natural and easy to use.
We also have something called bijecters,
which is a library for transforming random variables.
In the simplest case, this can be
like taking the x of the normal, and now you have a lognormal.
In more complicated cases, it can
involve transforming a random variable with a neural network.
This includes things like mask autoregressive flows,
if you've heard about it, real MVPs,
and other sophisticated probabilistic models.
You saw layers.
We also have some losses that help you build Monte Carlo
approximations to otherwise intractable calculations.
Edward2 is our probabilistic programming language
that helps you combine different random variables as one.
On the inference side, no Bayesian library
would be complete without Markov chain Monte Carlo
tools, within which we have several transition kernels.
One of them is called Hamiltonian Monte Carlo,
which naturally takes advantage of TensorFlow's
automatic differentiation capability.
We also have tools for performing
variational inference-- again, taking
advantage of TF's automatic differentiation and optimizer
toolbox.
And of course, we have our own optimizers
that often come up in probabilistic modeling
problems, such as Nelder-Mead, BFGS, things like that.
The point is, this toolbox has maybe not everything,
but certainly, it has most of what
you might need to do fancier modeling to actually get
more out of your machine learning model.
And it doesn't have to be hard.
You saw the Keras examples were just
a sequence of one line changes.
So of course, TensorFlow probability
is used widely around Alphabet.
DeepMind uses it extensively.
Google Brain uses it.
Google accelerated sciences, product areas-- infrastructure
areas even use it for planning purposes.
But it's also used outside of Google.
So Baker Hughes GE is one of our early adopters of TensorFlow
probability, and they use it to build
models to detect anomalies.
Anomaly detection is a very hard problem because, hopefully,
your data set never has the anomaly
you're trying to detect.
For example, anyone who flew out here
would be happy to know that Baker Hughes GE uses
its anomaly detection to predict the lifespan of jet engines.
And if we had a data set that had a failing jet engine,
that would be a tragedy.
And so using math, we can get around this by actually--
or they get around this--
by modeling models and then trying to, in the abstract,
figure out if those are going to be good models.
So what you see is their data processing pipeline.
The orange boxes use TensorFlow probability extensively.
The orange bordered box is where they use TensorFlow.
And the basic flow is to try to treat the model itself
as a random variable, and then determine
if it's going to be a good model on otherwise
an incomplete data set.
And from this, they're able to do--
they get remarkable results, dramatic decreases
in false positives and false negatives over very large data
sets in complicated systems.
So the question is, who will be the next success story?
Try it out-- it's an open source Python package
built using TensorFlow that makes
it easy to combine deep learning with probabilistic models.
You can PIP install it.
Check out tensorflow.org/probability.
And if you're interested in learning more
about Bayesian approaches, check out this book,
which we rewrote using TensorFlow probability,
within which you can learn, like I said, Bayesian methods,
but also just how to use TensorFlow probability.
If you're not a Bayesian, that's fine too.
We have numerous tools for frequentists.
We have a second order generalized
linear model solver, which--
you should care, because if you're doing linear regression,
it could solve that problem on the order of 30 iterations,
which definitely cannot be said of a standard gradient descent.
And if you want to find out more about this example,
you can check out our GitHub repository,
where you'll find several Jupyter notebooks.
Thanks.
[MUSIC PLAYING]