Placeholder Image

字幕列表 影片播放

  • JACQUES PIENAAR: OK.

  • Good afternoon, everybody.

  • Today, I'll be presenting about MLIR, multi-level intermediate

  • representation compiler infrastructure,

  • presenting in TensorFlow.

  • And so just an overview.

  • This will be a little bit different than some

  • of the other TF training sessions, as most of them

  • focus on how to better use TF technology,

  • while here we're looking at something that is still coming.

  • So there's a little bit mostly forward-looking as to things

  • that we want to do and where we're going.

  • So as an overview for today, I'm going

  • to start by giving an introduction.

  • What is MLIR?

  • Sort of like a 50,000-foot view of what it is,

  • how it all fits together.

  • Then look at why are we developing MLIR?

  • So the idea there is to show the past, where we came from,

  • how we got to this point, to point to what

  • we want to do in the future.

  • Then we will look at two applications.

  • One is the new TensorFlow to TensorFlow Lite converter.

  • It's still in pre-alpha, but definitely

  • try it and give some feedback.

  • And then we'll also look at the TF2XLA bridge,

  • looking at some of the forthcoming changes.

  • And I'll walk through a small example there.

  • OK, so let's start with the main question.

  • What is MLIR?

  • And so some people have said, it's

  • a little of a Rorschach test.

  • You can say you can do anything with MLIR,

  • and everybody has a different opinion about what it is.

  • So if we start from a high level,

  • TensorFlow's goal is it's "an open source machine learning

  • framework for everyone."

  • Now, in the same way, MLIR, which stands for multi-level

  • intermediate representation--

  • with the ML not representing "machine learning,"

  • for a change--

  • we're looking at "an open source program optimization

  • framework for everyone."

  • Now, the way I think about it is, as MLIR,

  • as an abstraction building toolkit--

  • and I'll show some examples of what I mean by that--

  • and as well as it's a reusable set of compiler passes

  • for these higher abstractions.

  • So particularly with MLIR, we are

  • targeting analysis, program optimization,

  • and code generation.

  • And so with that very high level.

  • I'm going to start with why MLIR,

  • and sort of go to exactly the components of what it is.

  • So when we started MLIR initially, we had a question.

  • Well, we looked at the ML accelerators and we saw, well,

  • there is many, many, many accelerators coming forth.

  • And we had the question-- how do we

  • support all these accelerators for all the given [INAUDIBLE]??

  • And I mean, of course, TensorFlow

  • should provide the best accelerator

  • performance for our users.

  • And so [INAUDIBLE],, can we make that easier?

  • So MLIR started as an exploration

  • of a different approach to doing code generators

  • for accelerators.

  • So we started looking at our existing code generator

  • framework for accelerators, XLA.

  • Now, XLA is one of the most advanced machine learning

  • compilers.

  • We have targets for CPU, GPU, TPU, and other back ends.

  • How can we increase the reuse between a CPU, GPU, and TPU

  • backends?

  • Well, at the moment, TPU backend doesn't use LLVM,

  • so that means we have more low level components that

  • are needed there, because we're not

  • able to reuse some of the same passes or structures.

  • The TPU backend is specialized for the best TPU performance.

  • But because it's so specialized, there's

  • less reuse with the CPU and GPU components.

  • But also looking at these different backends,

  • we notice different abstractions.

  • The CPU, and GPU, and TPU do not share abstractions

  • beyond the HLO level.

  • So if you look at the different backends,

  • you'll have different levels of support for different loop

  • abstractions, as well as stencil emitters between the two.

  • This makes reusing code between the two more difficult.

  • So furthermore, we have a lack of having

  • actual abstractions between HLO and, for example, TPU, LLO,

  • or LLVM, which results in big gaps.

  • So you have effectively passes that are effectively

  • one-shot compilers doing [INAUDIBLE]

  • from very coarse grain ops to multiple lower level ops.

  • It's like, well, OK, but if we want

  • to support so many different TensorFlow

  • ops on all these devices, we must leverage as much shared

  • infrastructure as possible.

  • So we should find a way to try and unify

  • these different backends and these abstractions

  • to be able to reuse more of the passes, more of the generators.

  • So then we come to our first question.

  • It's like, well, can we add a new abstraction

  • to allow greater reuse?

  • Is there some new abstractions that we can add?

  • I mean, we thought yes.

  • But I mean, you'll be saying, like, maybe.

  • And we haven't added them yet, but we're looking at it.

  • One of our goals was to address some usability issues

  • with the current stack.

  • One of them was custom ops for accelerators.

  • The other one was dynamic shapes.

  • But now assuming we had done this,

  • what happens with the rest of the stack?

  • Because I mean, the goal is still

  • this is an end-to-end TensorFlow user experience

  • that we are considering.

  • And so roughly split out the stack into multiple pages.

  • You have TensorFlow, and we're doing optimizations

  • on the TensorFlow Graph.

  • And this is specifically for targeting TPU.

  • We have the TF2XLA bridge that bridges between TensorFlow

  • ops to HLO.

  • Then on HLO, we have different passes,

  • device independent, device dependent.

  • And then finally, we have in the backends the emission from HLO

  • to, in this case, TPU LLO or, in the case of CPU, GPU, LLVM.

  • And so what we have at the moment is

  • TensorFlow can see the whole program,

  • but it has no insight on the device.

  • So TensorFlow does its optimizations,

  • and then it has these parts where

  • this will run on the device, this will run on XLA.

  • XLA, on the other hand, has deep insight into devices.

  • It knows all the different backends,

  • and it knows how to optimize it.

  • But it can't change the TensorFlow Graph.

  • So XLA assumes as fixed all its inputs and the graph structure

  • given to it by TensorFlow.

  • This results in optimization barriers

  • between the different passes.

  • So a backend can't dictate to the bridge

  • which HLO to produce.

  • The HLO produced is constrained by what is in TensorFlow.

  • And so this leads to things such as the double transpose

  • trick, which allows people to force certain layouts.

  • But such operations actually constrains the coupling

  • between the different layers.

  • Now the high level layer and the lower level layer

  • has an implicit coupling, and fixed set of assumptions

  • are hard-coded.

  • Beyond that, the TF2XLA bridge is

  • bridging two different systems with a large impedance

  • mismatch.

  • TensorFlow's side, we have the op ecosystem,

  • we have dynamic sizes, we have different types,

  • we have stateful ops.

  • XLA side, we have HLOs.

  • HLOs are mostly side effect free beyond a few ops.

  • We have different types, but it's a very different system

  • that the bridge does in one pass, transitions between.

  • So at the moment, what we have is

  • that this stack does not abstract out

  • the various functions.

  • We have the top level technologies tied to the lower

  • level hardware implementations.

  • So this results in a large gap of abstractions

  • between these layers, which makes the passes more

  • difficult to write.

  • Now, this is not something unique to machine learning

  • or to TensorFlow, XLA.

  • This similar kind of gap also led to domain-specific IRs

  • elsewhere, and in this particular case,

  • in the compiler domain.

  • If you look at the different inputs from languages such

  • as Java, C++, Swift, Rust, and Julia,

  • almost all of these languages that target LLVM have

  • introduced a new mid-level IR on which they can do their higher

  • level optimizations.

  • And actually, C++ does not do this,

  • so Clang doesn't have this.

  • And I have tried to omit the slide before,

  • but then Chris stopped me.

  • He's very upset about the fact that there is no [INAUDIBLE]..

  • And so this is a very dear point that we

  • are missing a lot of optimizations and reuse

  • by not having the [INAUDIBLE].

  • So what this means is we can do the domain-specific

  • optimizations.

  • We can do progressive lowering, so we can actually

  • have simpler hops between these different systems.

  • And if we look at the TensorFlow one,

  • we can think of this as-- if you look at the CPU/GPU path,

  • like the TF Graph to HLO, as the HLO being

  • this intermediate representation, mid level

  • between TF Graph and LLVM.

  • But domain-specific IRs are great.

  • They allow high-level domain-specific optimizations.

  • This progressive lowering actually

  • encourages reuse between the different levels,

  • because you can have smaller passes doing dedicated things.

  • It's great.

  • It's great for location tracking.

  • And this enables some flow-sensitive type checking,

  • so you can operate on higher level semantically

  • meaningful parts and produce verification at that level.

  • The part that's not so great--

  • it's a huge expense to build this infrastructure.

  • You're doing a reimplementation of all the same stuff-- pass

  • managers, location tracking, use-def chains, inlining,

  • all of these things.

  • And more importantly, innovations in one community

  • doesn't benefit the other communities.

  • So there's a downside to having these domain-specific IRs.

  • And if we look at the TensorFlow compiler ecosystem,

  • the previous graph is very much simplified.

  • Because the real situation is, from TensorFlow Graph,

  • we have multiple different backends

  • and multiple different IRs that it's being generated from.

  • From a graph, we generate HLOs.

  • Tensor RT has an output.

  • There's nGraph, Core ML, TensorFlow Lite--

  • so many different Graph IRs, each with different challenges.

  • So in a lot of these bridges, we have similar but different

  • technologies.

  • And this is not going away anytime soon.

  • But this results in a fragile, poor user

  • experience when failures happen.

  • The location tracking between these different phases

  • are variable.

  • So in some cases, there's no location propagated through.

  • So if an error occurs, the only thing you know about

  • is what happens at the lower level,

  • without being able to trace it back.

  • Beyond this, this also leads to a duplication of infrastructure

  • at all levels.

  • We're reimplementing the same infrastructure multiple times.

  • This is true, even in TensorFlow today.

  • We have the same optimizations at both the TensorFlow and HLO

  • level.

  • And so in some cases, you have the optimization

  • pass that you actually redo multiple times.

  • And then in some other cases, you

  • have future passes actually ignore

  • the results of previous ones.

  • So as an example, we do layered analysis and assignment

  • on TensorFlow Graphs using Grappler.

  • But when we get to HLO, these are mostly

  • ignored, sometimes for good reasons

  • because XLA actually knows better about the devices.

  • And sometimes, it's unfortunate because that actually

  • would have been a good decision, given the higher level graph

  • structure.

  • But beyond that, we need to actually duplicate these

  • passes, because we cannot represent these same operations

  • in one uniform way.