字幕列表 影片播放 列印英文字幕 JOSH DILLON: Hi, I'm Josh, and today going to be talking about TensorFlow Probability, which is a project I've been working on for two years now. So what is TensorFlow Probability? So we are part of the TensorFlow ecosystem. It's a library built using TensorFlow, and the idea is to make it easy to combine deep learning with probabilistic modeling. We are useful for statisticians and data scientists to whom we can provide R-like capabilities, which take advantage of GPU and TPU, and to ML researchers and practitioners so you can build deep models, which capture uncertainty. So why should you care? So a neural net that predicts binary outcomes is just a Bernoulli distribution that's parameterized by something fancy. So suppose you have this. You've got your sort of V1 model. Looks great. Now what? That's where TensorFlow Probability can help you out. Using our software, you can encode additional information in your problem. You can control prediction variance. You can even possibly ask tougher questions. No longer assume that pixels are independent, because guess what? They're not. This is what we're going to be talking about. So the main take home message for this talk is TensorFlow Probability is a bunch of low level tools, a collection of low level tools which are aimed at trying to make it easier for you to express what you know about your problem. To not try to shoehorn your problem into a neural net architecture, but rather describe what you know and take advantage of what you know. And these sort of images over here, we'll talk about a few of them, but each of them represents a part of the TensorFlow Probability package. OK, so in the simplest form, how would you use TensorFlow Probability? Sort of like a get our feet wet type example. So we offer generalized linear models. Think logistic regression, linear regression. Very boring stuff maybe, but it's a good starting point. So you'll see this pattern throughout the TensorFlow Probability software stack and how you use it. But basically, you specify a model, in this case, Bernoulli corresponding logistic regression. And then you just fit it. And in this case, we're using L1 regularization and L2 so you can get sparse weights. And why you should care about this is it's using a second order solver under the hood, which means that up to floating point precision, you would never need more than 30 iterations of this. And in practice, maybe three or four is all it takes. And since you can take advantage of GPU, it's like a drop-in replacement, it takes advantage of GPU drop-in replacement for, say, R in this case. So that's just kind of the canned, like an example of some of the canned stuff we offer. Where things really get exciting are this sort of suite of tools. So first we are going to talk about distributions, which are probably what you think they are. We'll also talk about bijectors in this talk. TensorFlow Probability provides probabilistic layers, things that wrap up variational inference with different distributional assumptions. We have a probabilistic programming language, which is the successor of Edward. That's also part of the TensorFlow Probability package. And then on the inference side-- that's kind of for building models. On the inference side, we've got a collection of Markov chain Monte Carlo transition kernels and tools to use them, diagnostic criteria, that sort of thing; tools for variational inference in a numerically stable way; and various optimizers, like stochastic gradient monument descent, BFGS, [INAUDIBLE],, sort of the stuff not stochastic gradient descent, maybe some of which are more useful for single machine settings, others baking in probability with optimization. OK, so a distribution, I hope this is boring because nothing here should be really fancy. Capability of drawing samples, you can compute probabilities, CDF, 1 minus the CDF, mean, variance, all the usual stuff. Little more interesting here at the bottom. The event shape and, you can't see it, but it says batch shape. So TensorFlow Probability distributions-- to take advantage of vectorized [INAUDIBLE] hardware, you specify-- you call the distribution once, but you specify multiple parameters. So here's an example. We're building a normal, but we're passing two location parameters. So when you call sample on this, it's going to return two samples every one time you call sample, if that makes sense. One will correspond to the normal distribution parameterized with mean minus 1, the other with mean 1. It turns out this very simple idea is extremely powerful and lets you immediately take advantage of vector computation. So not only do distributions have this sort of like small tweak from other libraries or packages, but we've got a bunch of them and you can combine them in interesting ways. So it's not super important what distribution this is. The point is we're making a mixture, combining categorical distributions with multivariate normal with a diagonal parameterization, and it all just kind of fits together, and you can do cool things using simple building blocks. And that's a theme that's pervasive in TensorFlow Probability-- simple ideas scaled up to be a powerful framework and formalism. So here's another example of a distribution we have, Gaussian Processes. I think this is cool because in a few lines, you can learn uncertainty. So notice that the model has sort of different beliefs in areas where there's no data and it's tight where there is. You could easily turn this into a layer in your neural net if you wanted to. OK, so distributions, there's a bunch of them. They have these sort of batch semantics. They're cool. Onto our second building block, bijectors. So a bijector is useful for transforming a random variable. It is, think like log and x. You may on the forward transformation take the exponential of some random variable, and then to reverse it you take the logarithm. So the forward is useful for computing samples and the inverse is useful for computing probabilities. So a bijector is a bijective diffeomorphism, a differentiable isopmorphism between two spaces. And those spaces represent sort of an input random variable and an output random variable. And because we're interested in computing probabilities, we have to keep track of the Jacobian. So just change your variables and an integral. And so that's what this implements. We also have the notion of shape. Because here, again, everything supports these sort of batch shape semantics. So what would you use a bijector for? So behind the slide is an amazing idea. You can take a neural net and use it to transform any distribution you want, and sort of get an arbitrarily rich distribution. So this little piece of code here really is just two hidden layers, two dense hidden layer neural net. And then it's wrapped up inside this autoregressive flow bijector, which transforms a normal. Now, here's why this is amazing. You could plug this in as your loss in the output. Like this could be your loss basically, the final line here. Whoops. I shouldn't have done that. On the final line is just this distribution.logprob. That's an arbitrarily rich distribution capable of learning variance, not prescribed by like a Bernoulli. The variance is p times 1 minus p. And unless your data actually is generated by Bernoulli distribution, that's a fairly restrictive assumption, in part because anytime that's not the case, it's very sensitive to mis-specification. So this is like a much richer family. And it sort of combines immediately neural nets and distributions. So another cool thing, you can reverse bijectors. And this little one line change was a whole other paper. And we see this phenomenon in TensorFlow probability a lot. Because everything's low level and modular, one little change, brand new idea. OK. So that's kind of some background. Let's go through an example of how you might use this. So this is from a book, "Bayesian Methods for Hackers," which we'll talk about at the end. And the question is, so I guess the guy who wrote this book, he got a girlfriend. And at some point his text messaging frequency changed. So the question is, can we find that in the data? And maybe you'd guess 22 days. Or maybe 40 some days. I don't know. Let's see. So here's a simple model. We'll posit that there was a rate of text messages in some pre period and a rate in some post period. And the question is, was there a change over? And that's the sort of math, or statistical program, as I like to call it. That statistical program translates into TensorFlow probability in an almost one to one way, exponential, uniform, flip it over, final Poisson. And to compute the joint log prob, we just add everything up in log space. And using that we can sample from the posterior. And so what we find is, yes, there was one rate around 18 text messages a day I guess. Another around 23. And it turns out that the highest posterior probability was on day 44. So how did we get these posterior samples from the joint log probability? We used MCMC. So our MCMC library has several transition kernels. I think one of the more powerful ones because it takes advantage of automatic differentiation is Hamiltonian Monte Carlo. And all we do to use that is take our joint log problem, which you saw in the previous slide, and just pin whatever you want to condition on. So in this case, we're going to condition on count data. And we want to sample the tau and the two lambdas, the rates and the changeover point. So we set this up. Whoops. We ask for some number of results, burn in steps, or the usual MCMC business. Something a little different here is this transformer. The transformer takes a constrained random variable and unconstrains it, because HMC is taking a gradient step. And it may step out of bounds. And so since the lambda terms are rates of a Poisson, they need to be positive. So the x bijector goes to and from positive real to unconstrained real. So too with tau. That was on the 01 interval. And so using sigmoid, which you can't see here, we transformed to and from. And day 44. It turns out that really was when he started dating. And so it seems like Bayesian inference was right. OK. So super hard graphical model, which we won't talk about. But the point is, there's a whole lot of math here, and it's really scary. Not really. Each line basically transforms one to one. So you pull out some graphical model from the literature before neural nets got really popular again. And you can code it up in TensorFlow probability. And where things get amazing is you can actually parametrize these distributions with a neural net, thus getting the benefit of both. And you can differentiate through the whole thing. So it's really sort of what's old is new again, yet in a way that you can take advantage of modern hardware. So just one to one between math and TFP. OK. So we did see a little bit of the deep learning, the masked R regressive flow. And I mentioned you can re-parametrize stuff. So here's sort of the idea of re-parametrization. So, as we know, probabilistic graphical models tend to be computationally very intensive. Neural nets are really good at embedding data into a lower dimensional space. Why not take your complex, computationally intensive, probabilistic graphical model and parametrize it with a neural net? And that's kind of what this slide is saying we should think about doing. So you've heard of GANs. So variational auto encoders are kind of the probabilistic analog of the GAN. That's the adversarial networks trying to fight each other to come up with a good balance. It actually has a probabilistic sort of analog. And this is it. So in this case, the posterior distribution takes, say, an image, and is a distribution over a low dimensional space Z. And the likelihood is a distribution that takes a low dimensional representation and outputs back the image. And using variational inference, which really just consists of 10 lines of code, you can take these different distributions, which are themselves parametrized by neural nets, and just fit it with Monte Carlo variational inference, taking advantage of TensorFlow's automatic differentiation. So it all kind of fits together nicely. OK. So that was a lot of information that we kind of breezed through quickly. We are in the process of rewriting this "Bayesian Methods for Hackers" book using TensorFlow probability. It already exists. I think there's like a PyMC version of it. And so we've started all the chapters. One and two are in the best shape. So definitely start with those. In chapter one you'll find the text message example. But that's basically it. So in conclusion, TensorFlow probability helps you combine deep learning with probabilistic modeling so you can encode additional domain knowledge about your problem. Pip install. Easy to use. And you can check it out as part of the TensorFlow ecosystem to learn more. Thanks. And I've got a few minutes here for questions, if anyone has any. Yeah. AUDIENCE: [INAUDIBLE] JOSHUA DILLON: Yeah. So the question is, can I quantify uncertainty in a neural net basically using this stuff? And the answer is, absolutely yes. That's why you would use this stuff. In fact, the larger question of why would you even use probabilistic modeling is probably because you want to quantify uncertainty. And so I pulled back to this variation autoencoder slide, because what's happening is it's a little hard to see here, cause it's just code, but this low dimensional space is basically inducing uncertainty as a bottleneck. And all of your neural nets do this. Often you'll have a smaller, hidden layer, going from a larger hidden layer, to a smaller, back to a larger. So the point with this is, just do that in a principled way. Keep track of what you lose by sort of compressing it down. And in so doing, then you actually get a measure of how much you lost. And so while this is a variational autoencoder, the supervised learning sort of alternative to this would be variational information bottleneck. And the code for that is almost exactly the same. The only difference is you're reconstructing a label from some input x. So you go from x, to z, to y. So image, low dimensional, back to the thing you're trying to predict. OK. So I'm out of time. And with that, I will take it over to you.
B1 中級 TensorFlow概率(TensorFlow @ O'Reilly AI大會,舊金山'18)。 (TensorFlow Probability (TensorFlow @ O’Reilly AI Conference, San Francisco '18)) 3 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字