TensorFlow概率（TensorFlow @ O'Reilly AI大會，舊金山'18）。 (TensorFlow Probability (TensorFlow @ O’Reilly AI Conference, San Francisco '18))

字幕列表影片播放

JOSH DILLON: Hi, I'm Josh, and today going
to be talking about TensorFlow Probability, which
is a project I've been working on for two years now.
So what is TensorFlow Probability?
So we are part of the TensorFlow ecosystem.
It's a library built using TensorFlow,
and the idea is to make it easy to combine deep learning
with probabilistic modeling.
We are useful for statisticians and data scientists
to whom we can provide R-like capabilities, which
take advantage of GPU and TPU, and to ML researchers
and practitioners so you can build deep models, which
capture uncertainty.
So why should you care?
So a neural net that predicts binary outcomes
is just a Bernoulli distribution that's parameterized
by something fancy.
So suppose you have this.
You've got your sort of V1 model.
Looks great.
Now what?
That's where TensorFlow Probability can help you out.
Using our software, you can encode additional information
in your problem.
You can control prediction variance.
You can even possibly ask tougher questions.
No longer assume that pixels are independent,
because guess what?
They're not.
This is what we're going to be talking about.
So the main take home message for this talk
is TensorFlow Probability is a bunch of low level
tools, a collection of low level tools
which are aimed at trying to make it easier
for you to express what you know about your problem.
To not try to shoehorn your problem
into a neural net architecture, but rather
describe what you know and take advantage of what you know.
And these sort of images over here, we'll talk about a few
of them, but each of them represents
a part of the TensorFlow Probability package.
OK, so in the simplest form, how would you
use TensorFlow Probability?
Sort of like a get our feet wet type example.
So we offer generalized linear models.
Think logistic regression, linear regression.
Very boring stuff maybe, but it's a good starting point.
So you'll see this pattern throughout the TensorFlow
Probability software stack and how you use it.
But basically, you specify a model,
in this case, Bernoulli corresponding logistic
regression.
And then you just fit it.
And in this case, we're using L1 regularization and L2
so you can get sparse weights.
And why you should care about this
is it's using a second order solver under the hood, which
means that up to floating point precision,
you would never need more than 30 iterations of this.
And in practice, maybe three or four is all it takes.
And since you can take advantage of GPU,
it's like a drop-in replacement, it
takes advantage of GPU drop-in replacement
for, say, R in this case.
So that's just kind of the canned,
like an example of some of the canned stuff we offer.
Where things really get exciting are
this sort of suite of tools.
So first we are going to talk about distributions,
which are probably what you think they are.
We'll also talk about bijectors in this talk.
TensorFlow Probability provides probabilistic layers,
things that wrap up variational inference
with different distributional assumptions.
We have a probabilistic programming language, which
is the successor of Edward.
That's also part of the TensorFlow Probability package.
And then on the inference side-- that's
kind of for building models.
On the inference side, we've got a collection
of Markov chain Monte Carlo transition kernels
and tools to use them, diagnostic criteria, that sort
of thing; tools for variational inference in a numerically
stable way; and various optimizers,
like stochastic gradient monument descent, BFGS,
[INAUDIBLE],, sort of the stuff not stochastic gradient
descent, maybe some of which are more useful for single machine
settings, others baking in probability with optimization.
OK, so a distribution, I hope this is boring because nothing
here should be really fancy.
Capability of drawing samples, you
can compute probabilities, CDF, 1 minus the CDF,
mean, variance, all the usual stuff.
Little more interesting here at the bottom.
The event shape and, you can't see it,
but it says batch shape.
So TensorFlow Probability distributions--
to take advantage of vectorized [INAUDIBLE] hardware,
you specify--
you call the distribution once, but you
specify multiple parameters.
So here's an example.
We're building a normal, but we're passing two location
parameters.
So when you call sample on this, it's
going to return two samples every one time you call
sample, if that makes sense.
One will correspond to the normal distribution
parameterized with mean minus 1, the other with mean 1.
It turns out this very simple idea is extremely powerful
and lets you immediately take advantage
of vector computation.
So not only do distributions have
this sort of like small tweak from other libraries
or packages, but we've got a bunch of them
and you can combine them in interesting ways.
So it's not super important what distribution this is.
The point is we're making a mixture,
combining categorical distributions
with multivariate normal with a diagonal parameterization,
and it all just kind of fits together,
and you can do cool things using simple building blocks.
And that's a theme that's pervasive in TensorFlow
Probability-- simple ideas scaled up
to be a powerful framework and formalism.
So here's another example of a distribution we have,
Gaussian Processes.
I think this is cool because in a few lines,
you can learn uncertainty.
So notice that the model has sort
of different beliefs in areas where there's no data
and it's tight where there is.
You could easily turn this into a layer in your neural net
if you wanted to.
OK, so distributions, there's a bunch of them.
They have these sort of batch semantics.
They're cool.
Onto our second building block, bijectors.
So a bijector is useful for transforming a random variable.
It is, think like log and x.
You may on the forward transformation
take the exponential of some random variable,
and then to reverse it you take the logarithm.
So the forward is useful for computing samples
and the inverse is useful for computing probabilities.
So a bijector is a bijective diffeomorphism,
a differentiable isopmorphism between two spaces.
And those spaces represent sort of an input random variable
and an output random variable.
And because we're interested in computing probabilities,
we have to keep track of the Jacobian.
So just change your variables and an integral.
And so that's what this implements.
We also have the notion of shape.
Because here, again, everything supports these sort
of batch shape semantics.
So what would you use a bijector for?
So behind the slide is an amazing idea.
You can take a neural net and use
it to transform any distribution you want,
and sort of get an arbitrarily rich distribution.
So this little piece of code here
really is just two hidden layers, two dense hidden layer
neural net.
And then it's wrapped up inside this autoregressive flow
bijector, which transforms a normal.
Now, here's why this is amazing.
You could plug this in as your loss in the output.
Like this could be your loss basically, the final line here.
Whoops.
I shouldn't have done that.
On the final line is just this distribution.logprob.
That's an arbitrarily rich distribution
capable of learning variance, not
prescribed by like a Bernoulli.
The variance is p times 1 minus p.
And unless your data actually is generated
by Bernoulli distribution, that's
a fairly restrictive assumption, in part because anytime that's
not the case, it's very sensitive to mis-specification.
So this is like a much richer family.
And it sort of combines immediately neural nets
and distributions.
So another cool thing, you can reverse bijectors.
And this little one line change was a whole other paper.
And we see this phenomenon in TensorFlow probability a lot.
Because everything's low level and modular, one little change,
brand new idea.
OK.
So that's kind of some background.
Let's go through an example of how you might use this.
So this is from a book, "Bayesian Methods for Hackers,"
which we'll talk about at the end.
And the question is, so I guess the guy who wrote this book,
he got a girlfriend.
And at some point his text messaging frequency changed.
So the question is, can we find that in the data?
And maybe you'd guess 22 days.
Or maybe 40 some days.
I don't know.
Let's see.
So here's a simple model.
We'll posit that there was a rate
of text messages in some pre period
and a rate in some post period.
And the question is, was there a change over?
And that's the sort of math, or statistical program,
as I like to call it.
That statistical program translates
into TensorFlow probability in an almost one to one
way, exponential, uniform, flip it over, final Poisson.
And to compute the joint log prob,
we just add everything up in log space.
And using that we can sample from the posterior.
And so what we find is, yes, there
was one rate around 18 text messages a day I guess.
Another around 23.
And it turns out that the highest posterior probability
was on day 44.
So how did we get these posterior samples
from the joint log probability?
We used MCMC.
So our MCMC library has several transition kernels.
I think one of the more powerful ones
because it takes advantage of automatic differentiation
is Hamiltonian Monte Carlo.
And all we do to use that is take our joint log
problem, which you saw in the previous slide,
and just pin whatever you want to condition on.
So in this case, we're going to condition on count data.
And we want to sample the tau and the two lambdas, the rates
and the changeover point.
So we set this up.
Whoops.
We ask for some number of results,
burn in steps, or the usual MCMC business.
Something a little different here is this transformer.
The transformer takes a constrained random variable
and unconstrains it, because HMC is taking a gradient step.
And it may step out of bounds.
And so since the lambda terms are rates of a Poisson,
they need to be positive.
So the x bijector goes to and from positive real
to unconstrained real.
So too with tau.
That was on the 01 interval.
And so using sigmoid, which you can't see here,
we transformed to and from.
And day 44.
It turns out that really was when he started dating.
And so it seems like Bayesian inference was right.
OK.
So super hard graphical model, which we won't talk about.
But the point is, there's a whole lot of math here,
and it's really scary.
Not really.
Each line basically transforms one to one.
So you pull out some graphical model
from the literature before neural nets got really popular
again.
And you can code it up in TensorFlow probability.
And where things get amazing is you can actually
parametrize these distributions with a neural net,
thus getting the benefit of both.
And you can differentiate through the whole thing.
So it's really sort of what's old is new again, yet in a way
that you can take advantage of modern hardware.
So just one to one between math and TFP.
OK.
So we did see a little bit of the deep learning,
the masked R regressive flow.
And I mentioned you can re-parametrize stuff.
So here's sort of the idea of re-parametrization.
So, as we know, probabilistic graphical models
tend to be computationally very intensive.
Neural nets are really good at embedding data
into a lower dimensional space.
Why not take your complex, computationally intensive,
probabilistic graphical model and parametrize it
with a neural net?
And that's kind of what this slide is saying
we should think about doing.
So you've heard of GANs.
So variational auto encoders are kind
of the probabilistic analog of the GAN.
That's the adversarial networks trying
to fight each other to come up with a good balance.
It actually has a probabilistic sort of analog.
And this is it.
So in this case, the posterior distribution takes, say,
an image, and is a distribution over a low dimensional space
Z. And the likelihood is a distribution
that takes a low dimensional representation
and outputs back the image.
And using variational inference, which really just consists
of 10 lines of code, you can take
these different distributions, which
are themselves parametrized by neural nets,
and just fit it with Monte Carlo variational inference,
taking advantage of TensorFlow's automatic differentiation.
So it all kind of fits together nicely.
OK.
So that was a lot of information that we kind of breezed through
quickly.
We are in the process of rewriting
this "Bayesian Methods for Hackers"
book using TensorFlow probability.
It already exists.
I think there's like a PyMC version of it.
And so we've started all the chapters.
One and two are in the best shape.
So definitely start with those.
In chapter one you'll find the text message example.
But that's basically it.
So in conclusion, TensorFlow probability
helps you combine deep learning with probabilistic modeling
so you can encode additional domain
knowledge about your problem.
Pip install.
Easy to use.
And you can check it out as part of the TensorFlow ecosystem
to learn more.
Thanks.
And I've got a few minutes here for questions,
if anyone has any.
Yeah.
AUDIENCE: [INAUDIBLE]
JOSHUA DILLON: Yeah.
So the question is, can I quantify uncertainty
in a neural net basically using this stuff?
And the answer is, absolutely yes.
That's why you would use this stuff.
In fact, the larger question of why would you even
use probabilistic modeling is probably because you
want to quantify uncertainty.
And so I pulled back to this variation autoencoder slide,
because what's happening is it's a little hard to see here,
cause it's just code, but this low dimensional space
is basically inducing uncertainty as a bottleneck.
And all of your neural nets do this.
Often you'll have a smaller, hidden layer,
going from a larger hidden layer,
to a smaller, back to a larger.
So the point with this is, just do that in a principled way.
Keep track of what you lose by sort of compressing it down.
And in so doing, then you actually get
a measure of how much you lost.
And so while this is a variational autoencoder,
the supervised learning sort of alternative to this
would be variational information bottleneck.
And the code for that is almost exactly the same.
The only difference is you're reconstructing a label
from some input x.
So you go from x, to z, to y.
So image, low dimensional, back to the thing
you're trying to predict.
OK.
So I'm out of time.
And with that, I will take it over to you.