字幕列表 影片播放 列印英文字幕 JACQUES PIENAAR: OK. Good afternoon, everybody. Today, I'll be presenting about MLIR, multi-level intermediate representation compiler infrastructure, presenting in TensorFlow. And so just an overview. This will be a little bit different than some of the other TF training sessions, as most of them focus on how to better use TF technology, while here we're looking at something that is still coming. So there's a little bit mostly forward-looking as to things that we want to do and where we're going. So as an overview for today, I'm going to start by giving an introduction. What is MLIR? Sort of like a 50,000-foot view of what it is, how it all fits together. Then look at why are we developing MLIR? So the idea there is to show the past, where we came from, how we got to this point, to point to what we want to do in the future. Then we will look at two applications. One is the new TensorFlow to TensorFlow Lite converter. It's still in pre-alpha, but definitely try it and give some feedback. And then we'll also look at the TF2XLA bridge, looking at some of the forthcoming changes. And I'll walk through a small example there. OK, so let's start with the main question. What is MLIR? And so some people have said, it's a little of a Rorschach test. You can say you can do anything with MLIR, and everybody has a different opinion about what it is. So if we start from a high level, TensorFlow's goal is it's "an open source machine learning framework for everyone." Now, in the same way, MLIR, which stands for multi-level intermediate representation-- with the ML not representing "machine learning," for a change-- we're looking at "an open source program optimization framework for everyone." Now, the way I think about it is, as MLIR, as an abstraction building toolkit-- and I'll show some examples of what I mean by that-- and as well as it's a reusable set of compiler passes for these higher abstractions. So particularly with MLIR, we are targeting analysis, program optimization, and code generation. And so with that very high level. I'm going to start with why MLIR, and sort of go to exactly the components of what it is. So when we started MLIR initially, we had a question. Well, we looked at the ML accelerators and we saw, well, there is many, many, many accelerators coming forth. And we had the question-- how do we support all these accelerators for all the given [INAUDIBLE]?? And I mean, of course, TensorFlow should provide the best accelerator performance for our users. And so [INAUDIBLE],, can we make that easier? So MLIR started as an exploration of a different approach to doing code generators for accelerators. So we started looking at our existing code generator framework for accelerators, XLA. Now, XLA is one of the most advanced machine learning compilers. We have targets for CPU, GPU, TPU, and other back ends. How can we increase the reuse between a CPU, GPU, and TPU backends? Well, at the moment, TPU backend doesn't use LLVM, so that means we have more low level components that are needed there, because we're not able to reuse some of the same passes or structures. The TPU backend is specialized for the best TPU performance. But because it's so specialized, there's less reuse with the CPU and GPU components. But also looking at these different backends, we notice different abstractions. The CPU, and GPU, and TPU do not share abstractions beyond the HLO level. So if you look at the different backends, you'll have different levels of support for different loop abstractions, as well as stencil emitters between the two. This makes reusing code between the two more difficult. So furthermore, we have a lack of having actual abstractions between HLO and, for example, TPU, LLO, or LLVM, which results in big gaps. So you have effectively passes that are effectively one-shot compilers doing [INAUDIBLE] from very coarse grain ops to multiple lower level ops. It's like, well, OK, but if we want to support so many different TensorFlow ops on all these devices, we must leverage as much shared infrastructure as possible. So we should find a way to try and unify these different backends and these abstractions to be able to reuse more of the passes, more of the generators. So then we come to our first question. It's like, well, can we add a new abstraction to allow greater reuse? Is there some new abstractions that we can add? I mean, we thought yes. But I mean, you'll be saying, like, maybe. And we haven't added them yet, but we're looking at it. One of our goals was to address some usability issues with the current stack. One of them was custom ops for accelerators. The other one was dynamic shapes. But now assuming we had done this, what happens with the rest of the stack? Because I mean, the goal is still this is an end-to-end TensorFlow user experience that we are considering. And so roughly split out the stack into multiple pages. You have TensorFlow, and we're doing optimizations on the TensorFlow Graph. And this is specifically for targeting TPU. We have the TF2XLA bridge that bridges between TensorFlow ops to HLO. Then on HLO, we have different passes, device independent, device dependent. And then finally, we have in the backends the emission from HLO to, in this case, TPU LLO or, in the case of CPU, GPU, LLVM. And so what we have at the moment is TensorFlow can see the whole program, but it has no insight on the device. So TensorFlow does its optimizations, and then it has these parts where this will run on the device, this will run on XLA. XLA, on the other hand, has deep insight into devices. It knows all the different backends, and it knows how to optimize it. But it can't change the TensorFlow Graph. So XLA assumes as fixed all its inputs and the graph structure given to it by TensorFlow. This results in optimization barriers between the different passes. So a backend can't dictate to the bridge which HLO to produce. The HLO produced is constrained by what is in TensorFlow. And so this leads to things such as the double transpose trick, which allows people to force certain layouts. But such operations actually constrains the coupling between the different layers. Now the high level layer and the lower level layer has an implicit coupling, and fixed set of assumptions are hard-coded. Beyond that, the TF2XLA bridge is bridging two different systems with a large impedance mismatch. TensorFlow's side, we have the op ecosystem, we have dynamic sizes, we have different types, we have stateful ops. XLA side, we have HLOs. HLOs are mostly side effect free beyond a few ops. We have different types, but it's a very different system that the bridge does in one pass, transitions between. So at the moment, what we have is that this stack does not abstract out the various functions. We have the top level technologies tied to the lower level hardware implementations. So this results in a large gap of abstractions between these layers, which makes the passes more difficult to write. Now, this is not something unique to machine learning or to TensorFlow, XLA. This similar kind of gap also led to domain-specific IRs elsewhere, and in this particular case, in the compiler domain. If you look at the different inputs from languages such as Java, C++, Swift, Rust, and Julia, almost all of these languages that target LLVM have introduced a new mid-level IR on which they can do their higher level optimizations. And actually, C++ does not do this, so Clang doesn't have this. And I have tried to omit the slide before, but then Chris stopped me. He's very upset about the fact that there is no [INAUDIBLE].. And so this is a very dear point that we are missing a lot of optimizations and reuse by not having the [INAUDIBLE]. So what this means is we can do the domain-specific optimizations. We can do progressive lowering, so we can actually have simpler hops between these different systems. And if we look at the TensorFlow one, we can think of this as-- if you look at the CPU/GPU path, like the TF Graph to HLO, as the HLO being this intermediate representation, mid level between TF Graph and LLVM. But domain-specific IRs are great. They allow high-level domain-specific optimizations. This progressive lowering actually encourages reuse between the different levels, because you can have smaller passes doing dedicated things. It's great. It's great for location tracking. And this enables some flow-sensitive type checking, so you can operate on higher level semantically meaningful parts and produce verification at that level. The part that's not so great-- it's a huge expense to build this infrastructure. You're doing a reimplementation of all the same stuff-- pass managers, location tracking, use-def chains, inlining, all of these things. And more importantly, innovations in one community doesn't benefit the other communities. So there's a downside to having these domain-specific IRs. And if we look at the TensorFlow compiler ecosystem, the previous graph is very much simplified. Because the real situation is, from TensorFlow Graph, we have multiple different backends and multiple different IRs that it's being generated from. From a graph, we generate HLOs. Tensor RT has an output. There's nGraph, Core ML, TensorFlow Lite-- so many different Graph IRs, each with different challenges. So in a lot of these bridges, we have similar but different technologies. And this is not going away anytime soon. But this results in a fragile, poor user experience when failures happen. The location tracking between these different phases are variable. So in some cases, there's no location propagated through. So if an error occurs, the only thing you know about is what happens at the lower level, without being able to trace it back. Beyond this, this also leads to a duplication of infrastructure at all levels. We're reimplementing the same infrastructure multiple times. This is true, even in TensorFlow today. We have the same optimizations at both the TensorFlow and HLO level. And so in some cases, you have the optimization pass that you actually redo multiple times. And then in some other cases, you have future passes actually ignore the results of previous ones. So as an example, we do layered analysis and assignment on TensorFlow Graphs using Grappler. But when we get to HLO, these are mostly ignored, sometimes for good reasons because XLA actually knows better about the devices. And sometimes, it's unfortunate because that actually would have been a good decision, given the higher level graph structure. But beyond that, we need to actually duplicate these passes, because we cannot represent these same operations in one uniform way.