字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] JOSH DILLON: Hello, everyone. I'm Josh Dillon, and I'm a lead on the TensorFlow probability team. And today, I'm going to talk to you about probability stuff and how it relates to TensorFlow stuff. So let's find out what that means. OK, so these days, machine learning often means specifying deep model architectures and then fitting them under some loss. And happily, Keras makes specifying model architecture relatively easy. But what about the loss? Choosing the right loss is tough. Improving one-- even a reasonable one-- can be even tougher. And once you fit your model, how do you know it's good? Does accuracy tell the full picture? Why not use mean, entropy, mode? Wouldn't it be great if there existed some mathematical framework, which unified these ideas? Better still, wouldn't it be nice if it was plug and play with Keras and the rest of TF? This would make comparing models easier by simply maximizing likelihood and having readily available evaluative statistics. We can rapidly prototype different generating assumptions and quickly reject the bad ones. In short, wouldn't it be great if we could do this-- just say I want to maximize the log likelihood and then summarize what I learned easily and in a unified way? So let's play with that idea. Here, we have a data set-- these blue dots. And our task-- our pretend task-- is to predict the y-coordinate from the x-coordinate. And the way you might do this is specify some deep model. And of course, you might choose the mean squared error as your loss function. OK. But our wish here is to think probabilistically. And so that means maximizing the log likelihood, as indicated here with this lambda function-- the negative random variable log_prob under y. And what we want, in addition to that, is to get back a distribution-- a thing that has attached to it statistics that we can use to evaluate what we just learned. If only such a thing were possible. Of course, it is, and you can do this now. Using TensorFlow probability distribution layers, you can specify the model as part of your deep net. And the loss now is actually part of the model, sort of the way it used to be-- the way it's meant to be. And so let's unpack what's happening here. So we have two dense layers. That's sort of business as usual. The second one outputs one float, and that one float is parameterizing a normal distributions mean. And that's being done through this distribution lambda layer. In so being, we're able to find this line. That looks great. And the best part is, once we instantiate this model with test points, we have back a distribution instance-- for which you get not just the mean, which is what you'd get today, but also-- get not just the mean, which is what you'd get today, but also entropy variance, standard deviation, all of these things. And you can even compare between this and other distributions, as we'll see later. But if we look at this data, something's still a little fishy here, right? Notice that as the magnitude of x increases, the variance of y also seems to increase. So that means that maybe our model's a little suspicious. So since we're in this probabilistic framework and we're no longer doing loss hacking-- we're actually building a model-- what can we do to fix this? Answer-- learn the variance too. It's actually pretty obvious. If we're fitting a normal, why on earth do we think that the variance would just be 1? And by the way, that's what you're doing when you use mean squared error. And so now, to achieve this, all I had to do is make my previous layer output two floats. I pass one in as the mean to the normal, one in as the standard deviation of the normal. And presto chango, now I've learned the standard deviation from the data itself. That's what the green lines are. So this is really cool, because now, if you're a statistician, you would say, hey, I'm able to handle heteroscedasticity. If you want a $10 word, you can call this aleatoric uncertainty. And what this really means is that you're learning known unknowns. It means that the data itself had variance, and you learned it. And it cost you basically nothing to do but a few keystrokes. And furthermore, the way in which you saw how to do this was self-evident from the very fact that you were using a normal distribution which had this curious constant just sitting there. So this is good. But, hm, I don't know. Is there enough data for which we can reliably claim that this red line is actually the mean, and these green lines are actually the standard deviation? How would we know if we have enough data? Is there anything else we can do? Of course, there is. Why learn just a single set of weights? A Keras dense slayer has two components-- a kernel matrix and a bias vector. What makes you think that those point estimates are the best, especially given that your data set itself is random and possibly inadequate to meaningfully and reliably learn those point estimates? Instead, if you use a TensorFlow probability dense variational layer, you can actually learn a distribution overweight. This is the same as learning an ensemble that's infinitely large. But luckily, it doesn't take infinitely long to train this ensemble. In fact, it takes just a little bit longer than what it took to train on the previous slides. And as you can see here, all I had to do is replace Keras.dense with TFP.dense variational layer, and in so doing, achieve this kind of Bayesian weight uncertainty. The $10 word here is epistemic uncertainty. But again, I like to think of it as unknown unknowns. I'm not sure what my data is not telling me, so I'm going to be careful in the bookkeeping I make when tracking the weights that I learn. As a consequence, of course, this means that any model you make, any instantiation of this model is now actually a random variable because the weight's a random variable. And that's why you see here all of the lines. There are many lines. We have an ensemble of them. But if you were to average those and take the sample standard deviation, say, then that would give you an estimate of credible intervals over your prediction. So now you can go to your customer and say, look, here's what I think would happen, and here's how much you should trust me. So this is great, right? But we seem to have lost the heteroscedastic part. Notice that the blue dots are still more dispersed on the right-hand side. So can we do both? Of course, we can. It's all modular. I just have my dense operational layer output 2 floats, instead of one, like we did before. Feed that into my output layer, which is a normal distribution. And presto chango, I'm learning both known and unknown unknowns, and all it cost me was a few keystrokes. And so what you see here now is an ensemble of standard deviations associated with the known unknown parts-- the variance present or observable in the y-axis-- as well as a number or an ensemble of these mean regressions. OK, that's cool. So I like where this is going. But I have to ask, what makes you think a line is even the right thing to fit here? Is there another distribution we could choose, a richer distribution, that would actually find the right form of the data? And of course, the answer is yes. It's a Gaussian process. By tossing in this fancy distribution, it turns out that the data wasn't linear at all. No wonder we had such a hard time fitting it. It was sinusoidal, and the Gaussian process can see this. How can the Gaussian process see this? Because it treats the loss itself as a random variable. Now, how could you do that, if you're just specifying mean squared error as your loss? You can't. It has to be part of your model, and that's the power of probabilistic modeling. When you bake in these ideas into one model, you get to move things around fluidly between weight uncertainty and variance in the data, and even uncertainty in the loss function you're fitting itself. And so the question is, how can this all be so easy? How did it all fit together? It's TensorFlow Probability. So TensorFlow Probability is a collection of tools designed to make probabilistic reasoning in TensorFlow easier. It is not going to make your job easy. It's just going to give you the tools you need to express the ideas you have. You still have to have domain knowledge and expertise. But you can encode that domain knowledge and expertise in a probabilistic formalism, and TFP has the tools to do that. Statisticians and data scientists will be able to write and launch the same model. Gone are the days of hacking your model in R and importing it over to a faster language, like C++, or even TensorFlow. You can do it all in the same framework. ML researchers and practitioners will be able to make predictions with uncertainty. If you predict the light is green, you'd better be pretty confident that you should go. You can do that with probabilistic modeling and TensorFlow Probability. So we saw one small part of TFP. Broadly speaking, the tools are broken in two components-- those tools useful for building models and those tools useful for doing inference on those models. On the model building side, you saw the normal distribution and the variation of Gaussian process distribution. A distribution is just is a collection of simple summary statistics, exactly like it is in every other library. There's a few differences. R distribution support, this concept of batch shape, which automatically takes advantage of vector processing hardware. But for the most part, they should be pretty natural and easy to use. We also have something called bijecters, which is a library for transforming random variables. In the simplest case, this can be like taking the x of the normal, and now you have a lognormal. In more complicated cases, it can involve transforming a random variable with a neural network. This includes things like mask autoregressive flows, if you've heard about it, real MVPs, and other sophisticated probabilistic models. You saw layers. We also have some losses that help you build Monte Carlo approximations to otherwise intractable calculations. Edward2 is our probabilistic programming language that helps you combine different random variables as one. On the inference side, no Bayesian library would be complete without Markov chain Monte Carlo tools, within which we have several transition kernels. One of them is called Hamiltonian Monte Carlo, which naturally takes advantage of TensorFlow's automatic differentiation capability. We also have tools for performing variational inference-- again, taking advantage of TF's automatic differentiation and optimizer toolbox. And of course, we have our own optimizers that often come up in probabilistic modeling problems, such as Nelder-Mead, BFGS, things like that. The point is, this toolbox has maybe not everything, but certainly, it has most of what you might need to do fancier modeling to actually get more out of your machine learning model. And it doesn't have to be hard. You saw the Keras examples were just a sequence of one line changes. So of course, TensorFlow probability is used widely around Alphabet. DeepMind uses it extensively. Google Brain uses it. Google accelerated sciences, product areas-- infrastructure areas even use it for planning purposes. But it's also used outside of Google. So Baker Hughes GE is one of our early adopters of TensorFlow probability, and they use it to build models to detect anomalies. Anomaly detection is a very hard problem because, hopefully, your data set never has the anomaly you're trying to detect. For example, anyone who flew out here would be happy to know that Baker Hughes GE uses its anomaly detection to predict the lifespan of jet engines. And if we had a data set that had a failing jet engine, that would be a tragedy. And so using math, we can get around this by actually-- or they get around this-- by modeling models and then trying to, in the abstract, figure out if those are going to be good models. So what you see is their data processing pipeline. The orange boxes use TensorFlow probability extensively. The orange bordered box is where they use TensorFlow. And the basic flow is to try to treat the model itself as a random variable, and then determine if it's going to be a good model on otherwise an incomplete data set. And from this, they're able to do-- they get remarkable results, dramatic decreases in false positives and false negatives over very large data sets in complicated systems. So the question is, who will be the next success story? Try it out-- it's an open source Python package built using TensorFlow that makes it easy to combine deep learning with probabilistic models. You can PIP install it. Check out tensorflow.org/probability. And if you're interested in learning more about Bayesian approaches, check out this book, which we rewrote using TensorFlow probability, within which you can learn, like I said, Bayesian methods, but also just how to use TensorFlow probability. If you're not a Bayesian, that's fine too. We have numerous tools for frequentists. We have a second order generalized linear model solver, which-- you should care, because if you're doing linear regression, it could solve that problem on the order of 30 iterations, which definitely cannot be said of a standard gradient descent. And if you want to find out more about this example, you can check out our GitHub repository, where you'll find several Jupyter notebooks. Thanks. [MUSIC PLAYING]
B1 中級 TensorFlow概率。充滿信心地學習 (TF Dev Summit '19) (TensorFlow Probability: Learning with confidence (TF Dev Summit '19)) 2 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字