Placeholder Image

字幕列表 影片播放

  • JACQUES PIENAAR: OK.

  • Good afternoon, everybody.

  • Today, I'll be presenting about MLIR, multi-level intermediate

  • representation compiler infrastructure,

  • presenting in TensorFlow.

  • And so just an overview.

  • This will be a little bit different than some

  • of the other TF training sessions, as most of them

  • focus on how to better use TF technology,

  • while here we're looking at something that is still coming.

  • So there's a little bit mostly forward-looking as to things

  • that we want to do and where we're going.

  • So as an overview for today, I'm going

  • to start by giving an introduction.

  • What is MLIR?

  • Sort of like a 50,000-foot view of what it is,

  • how it all fits together.

  • Then look at why are we developing MLIR?

  • So the idea there is to show the past, where we came from,

  • how we got to this point, to point to what

  • we want to do in the future.

  • Then we will look at two applications.

  • One is the new TensorFlow to TensorFlow Lite converter.

  • It's still in pre-alpha, but definitely

  • try it and give some feedback.

  • And then we'll also look at the TF2XLA bridge,

  • looking at some of the forthcoming changes.

  • And I'll walk through a small example there.

  • OK, so let's start with the main question.

  • What is MLIR?

  • And so some people have said, it's

  • a little of a Rorschach test.

  • You can say you can do anything with MLIR,

  • and everybody has a different opinion about what it is.

  • So if we start from a high level,

  • TensorFlow's goal is it's "an open source machine learning

  • framework for everyone."

  • Now, in the same way, MLIR, which stands for multi-level

  • intermediate representation--

  • with the ML not representing "machine learning,"

  • for a change--

  • we're looking at "an open source program optimization

  • framework for everyone."

  • Now, the way I think about it is, as MLIR,

  • as an abstraction building toolkit--

  • and I'll show some examples of what I mean by that--

  • and as well as it's a reusable set of compiler passes

  • for these higher abstractions.

  • So particularly with MLIR, we are

  • targeting analysis, program optimization,

  • and code generation.

  • And so with that very high level.

  • I'm going to start with why MLIR,

  • and sort of go to exactly the components of what it is.

  • So when we started MLIR initially, we had a question.

  • Well, we looked at the ML accelerators and we saw, well,

  • there is many, many, many accelerators coming forth.

  • And we had the question-- how do we

  • support all these accelerators for all the given [INAUDIBLE]??

  • And I mean, of course, TensorFlow

  • should provide the best accelerator

  • performance for our users.

  • And so [INAUDIBLE],, can we make that easier?

  • So MLIR started as an exploration

  • of a different approach to doing code generators

  • for accelerators.

  • So we started looking at our existing code generator

  • framework for accelerators, XLA.

  • Now, XLA is one of the most advanced machine learning

  • compilers.

  • We have targets for CPU, GPU, TPU, and other back ends.

  • How can we increase the reuse between a CPU, GPU, and TPU

  • backends?

  • Well, at the moment, TPU backend doesn't use LLVM,

  • so that means we have more low level components that

  • are needed there, because we're not

  • able to reuse some of the same passes or structures.

  • The TPU backend is specialized for the best TPU performance.

  • But because it's so specialized, there's

  • less reuse with the CPU and GPU components.

  • But also looking at these different backends,

  • we notice different abstractions.

  • The CPU, and GPU, and TPU do not share abstractions

  • beyond the HLO level.

  • So if you look at the different backends,

  • you'll have different levels of support for different loop

  • abstractions, as well as stencil emitters between the two.

  • This makes reusing code between the two more difficult.

  • So furthermore, we have a lack of having

  • actual abstractions between HLO and, for example, TPU, LLO,

  • or LLVM, which results in big gaps.

  • So you have effectively passes that are effectively

  • one-shot compilers doing [INAUDIBLE]

  • from very coarse grain ops to multiple lower level ops.

  • It's like, well, OK, but if we want

  • to support so many different TensorFlow

  • ops on all these devices, we must leverage as much shared

  • infrastructure as possible.

  • So we should find a way to try and unify

  • these different backends and these abstractions

  • to be able to reuse more of the passes, more of the generators.

  • So then we come to our first question.

  • It's like, well, can we add a new abstraction

  • to allow greater reuse?

  • Is there some new abstractions that we can add?

  • I mean, we thought yes.

  • But I mean, you'll be saying, like, maybe.

  • And we haven't added them yet, but we're looking at it.

  • One of our goals was to address some usability issues

  • with the current stack.

  • One of them was custom ops for accelerators.

  • The other one was dynamic shapes.

  • But now assuming we had done this,

  • what happens with the rest of the stack?

  • Because I mean, the goal is still

  • this is an end-to-end TensorFlow user experience

  • that we are considering.

  • And so roughly split out the stack into multiple pages.

  • You have TensorFlow, and we're doing optimizations

  • on the TensorFlow Graph.

  • And this is specifically for targeting TPU.

  • We have the TF2XLA bridge that bridges between TensorFlow

  • ops to HLO.

  • Then on HLO, we have different passes,

  • device independent, device dependent.

  • And then finally, we have in the backends the emission from HLO

  • to, in this case, TPU LLO or, in the case of CPU, GPU, LLVM.

  • And so what we have at the moment is

  • TensorFlow can see the whole program,

  • but it has no insight on the device.

  • So TensorFlow does its optimizations,

  • and then it has these parts where

  • this will run on the device, this will run on XLA.

  • XLA, on the other hand, has deep insight into devices.

  • It knows all the different backends,

  • and it knows how to optimize it.

  • But it can't change the TensorFlow Graph.

  • So XLA assumes as fixed all its inputs and the graph structure

  • given to it by TensorFlow.

  • This results in optimization barriers

  • between the different passes.

  • So a backend can't dictate to the bridge

  • which HLO to produce.

  • The HLO produced is constrained by what is in TensorFlow.

  • And so this leads to things such as the double transpose

  • trick, which allows people to force certain layouts.

  • But such operations actually constrains the coupling

  • between the different layers.

  • Now the high level layer and the lower level layer

  • has an implicit coupling, and fixed set of assumptions

  • are hard-coded.

  • Beyond that, the TF2XLA bridge is

  • bridging two different systems with a large impedance

  • mismatch.

  • TensorFlow's side, we have the op ecosystem,

  • we have dynamic sizes, we have different types,

  • we have stateful ops.

  • XLA side, we have HLOs.

  • HLOs are mostly side effect free beyond a few ops.

  • We have different types, but it's a very different system

  • that the bridge does in one pass, transitions between.

  • So at the moment, what we have is

  • that this stack does not abstract out

  • the various functions.

  • We have the top level technologies tied to the lower

  • level hardware implementations.

  • So this results in a large gap of abstractions

  • between these layers, which makes the passes more

  • difficult to write.

  • Now, this is not something unique to machine learning

  • or to TensorFlow, XLA.

  • This similar kind of gap also led to domain-specific IRs

  • elsewhere, and in this particular case,

  • in the compiler domain.

  • If you look at the different inputs from languages such

  • as Java, C++, Swift, Rust, and Julia,

  • almost all of these languages that target LLVM have

  • introduced a new mid-level IR on which they can do their higher

  • level optimizations.

  • And actually, C++ does not do this,

  • so Clang doesn't have this.

  • And I have tried to omit the slide before,

  • but then Chris stopped me.

  • He's very upset about the fact that there is no [INAUDIBLE]..

  • And so this is a very dear point that we

  • are missing a lot of optimizations and reuse

  • by not having the [INAUDIBLE].

  • So what this means is we can do the domain-specific

  • optimizations.

  • We can do progressive lowering, so we can actually

  • have simpler hops between these different systems.

  • And if we look at the TensorFlow one,

  • we can think of this as-- if you look at the CPU/GPU path,

  • like the TF Graph to HLO, as the HLO being

  • this intermediate representation, mid level

  • between TF Graph and LLVM.

  • But domain-specific IRs are great.

  • They allow high-level domain-specific optimizations.

  • This progressive lowering actually

  • encourages reuse between the different levels,

  • because you can have smaller passes doing dedicated things.

  • It's great.

  • It's great for location tracking.

  • And this enables some flow-sensitive type checking,

  • so you can operate on higher level semantically

  • meaningful parts and produce verification at that level.

  • The part that's not so great--

  • it's a huge expense to build this infrastructure.

  • You're doing a reimplementation of all the same stuff-- pass

  • managers, location tracking, use-def chains, inlining,

  • all of these things.

  • And more importantly, innovations in one community

  • doesn't benefit the other communities.

  • So there's a downside to having these domain-specific IRs.

  • And if we look at the TensorFlow compiler ecosystem,

  • the previous graph is very much simplified.

  • Because the real situation is, from TensorFlow Graph,

  • we have multiple different backends

  • and multiple different IRs that it's being generated from.

  • From a graph, we generate HLOs.

  • Tensor RT has an output.

  • There's nGraph, Core ML, TensorFlow Lite--

  • so many different Graph IRs, each with different challenges.

  • So in a lot of these bridges, we have similar but different

  • technologies.

  • And this is not going away anytime soon.

  • But this results in a fragile, poor user

  • experience when failures happen.

  • The location tracking between these different phases

  • are variable.

  • So in some cases, there's no location propagated through.

  • So if an error occurs, the only thing you know about

  • is what happens at the lower level,

  • without being able to trace it back.

  • Beyond this, this also leads to a duplication of infrastructure

  • at all levels.

  • We're reimplementing the same infrastructure multiple times.

  • This is true, even in TensorFlow today.

  • We have the same optimizations at both the TensorFlow and HLO

  • level.

  • And so in some cases, you have the optimization

  • pass that you actually redo multiple times.

  • And then in some other cases, you

  • have future passes actually ignore

  • the results of previous ones.

  • So as an example, we do layered analysis and assignment

  • on TensorFlow Graphs using Grappler.

  • But when we get to HLO, these are mostly

  • ignored, sometimes for good reasons

  • because XLA actually knows better about the devices.

  • And sometimes, it's unfortunate because that actually

  • would have been a good decision, given the higher level graph

  • structure.

  • But beyond that, we need to actually duplicate these

  • passes, because we cannot represent these same operations

  • in one uniform way.

  • And an unfortunate side effect of this

  • is we actually end up transforming multiple times

  • back and forth between equivalent representations.

  • So for example, in one unit case,

  • we converted from Graph to GraphDef seven times,

  • from GraphDef to Graph four times, and then once to HLO.

  • And I mean, this is, in a way, not useful transformations.

  • But our problem is we are unable to represent

  • all of these different ops and structures together,

  • so we need to duplicate this.

  • So with that, the goal of MLIR is to enable global improvement

  • to TensorFlow infrastructure.

  • It's an SSA-based design to generalize and improve

  • ML graphs.

  • We want to add better side effect modeling, control flow

  • representation, improved generality

  • of the lowering passes--

  • which I'll explain at the time we're on the applications--

  • focus on dramatically increasing the code

  • reuse between all these distinct paths, fixed location

  • tracking, and other pervasive issues

  • for better user experience.

  • So when a failure occurs, the traceability

  • and the debug-ability is greatly improved.

  • And at the moment, we believe there's no reasonable existing

  • answers.

  • And we refuse to copy and paste another SSA-based optimizer

  • six more times.

  • And so that led us to MLIR.

  • A couple of the other goals is, well, we

  • want to embrace TensorFlow.

  • TensorFlow is one of our main targets.

  • We don't want to work around it.

  • We want to support the full generality of TF graphs.

  • We want to allow the aggressive reuse of infra

  • across multiple hardware paths.

  • And similar to TensorFlow, we want

  • to allow open customization.

  • We want to have a target be able to say,

  • implement my JPEG decoder using this block.

  • We want a user to be able to say, hey,

  • I have a custom kernel.

  • For this model I'm running, I want to see its performance,

  • so use my kernel.

  • Beyond that, we want to enable folks to experiment

  • with their own lower level implementations of operations.

  • So if the compiler is not there yet,

  • or the researcher has a better idea,

  • or we have an MLIR algorithm generating code,

  • we want to be able to plug that into the same system

  • and see the effect on an end-to-end behavior.

  • But we also want to embrace the limitations

  • of particular backends.

  • So for example, if your backend only supports convolution,

  • we want to provide convolution.

  • If you don't support control flow,

  • well, we don't give you control flow.

  • If you have static shapes, we only

  • can give you graph with static shapes, et cetera.

  • This includes what floating point precision

  • operations you support, forced quantization, things like that.

  • We want to avoid those big semantic gaps in lowering.

  • We do not want to have these big gaps where

  • you have one step to the next in the transformation paths

  • which have a bridging completely separate systems, which are

  • difficult to debug and verify.

  • And then very importantly, we want improve the traceability

  • and testability for users.

  • So if your model fails compilation,

  • we want to be able to point back to where it failed.

  • And so with this, what should MLIR provide?

  • Well, it should represent multiple levels of abstraction.

  • We should allow this progressive lowering--

  • so with any given function, having

  • a progressive set of lowerings that

  • gets you to the destination.

  • It should not be these big jumps on two separate data

  • structures.

  • We want to be able to lower through multiple different

  • abstractions.

  • This also means that the passes need

  • to be designed to operate on these different levels

  • and properties, rather than looking at fixed ops.

  • And I mean, I think this is especially

  • essential for TensorFlow, which has an open ecosystem with ops

  • being added very, very regularly at a good pace.

  • We also should make it easy to add

  • abstractions or domain-specific IR constructs.

  • An example here is, we have the affine dialect.

  • In our affine dialect, we have affine loops.

  • This isn't a hard coded construct in MLIR.

  • An affine loop, which can be used for some polyhedral code

  • optimization, is something that is extended.

  • It's the dialect itself has it.

  • Beyond that, we're looking at location

  • as a first class construct.

  • Locations are intrinsically tied to operations

  • and optimizations.

  • You cannot create an op without a location.

  • If you're doing transformation and you're

  • replacing with a new op, you have

  • to specify where this op comes from.

  • So this means we have a path-- hm?

  • AUDIENCE: What do you mean by the location?

  • JACQUES PIENAAR: So location in this case

  • could be file location, could be name, could be stack trace.

  • So we have a couple of different locations,

  • and we actually also have an opaque location

  • that is just interpretable by a dialect, for example--

  • so if your backend has a different concept.

  • The most common ones for TensorFlow

  • is a name location corresponding to the name in the GraphDef

  • or the Python call stack--

  • so the set of calls that got you to creating this op.

  • Another probably we also want to work on

  • is, we want to have this framework enable us to complete

  • a patchwork of tools.

  • So at the moment, a couple of users

  • have run into a problem where we have broken paths.

  • You have a tool A that will get you from representation x to y.

  • And you have a tool B that gets you from y prime to z.

  • But if you actually want to get from x to z,

  • you have to do something else, or you

  • have to restrict them all.

  • We want to try and get a path to complete this patchwork

  • and enable end-to-end workflows of interest.

  • And so this is sort of like, where are we applying MLIR?

  • So mostly, I have been talking the infra.

  • The first application is the TensorFlow

  • to TensorFlow Lite Converter.

  • This is something which is, like I said, is pre alpha.

  • I'll discuss it next.

  • And it's working for a couple of models.

  • We have a couple of new features coming

  • in there that enable some new TensorFlow Lite features.

  • The added target we're working with in JSON

  • is TensorFlow to XLA bridge.

  • You know, looking at the current lowering to XLA,

  • as well as accelerators.

  • We're working with the Grappler team

  • on graph organizations, shape inference, device partitioning,

  • and placement.

  • And then we also have the TPU and GPU codegen projects

  • going on to evaluate new approaches to generate code

  • for these different devices.

  • So I think of MLIR as three different parts.

  • One, you have the graph compiler,

  • which is like the op expansions, the lower entity of Lite,

  • auto outside compilation, parallelism/sharding,

  • that sort of target, code generator,

  • which focus on high level abstractions for code

  • generations.

  • And so there's polyhedral loop nests,

  • tiled Tensor views, to this nature.

  • And underlying all of this is the MLIR infrastructure.

  • So this is framework-independent IR.

  • You have rewrite generators, like automatic pattern

  • matchers, the mechanisms to define dialects

  • consisting of types and ops.

  • And that ties all of this together.

  • And that leads me to one of our first applications,

  • which is the TensorFlow Lite converter.

  • So the basic flow for TensorFlow Lite Converter

  • is a TensorFlow Graph is input.

  • And here, I'm using Graph--

  • misusing it slightly, because it could be a graph,

  • it could be from a Python script [INAUDIBLE] [? SaveModel. ?]

  • But the goal is to translate it from this representation

  • to the MLIR module, consisting of ops in the TF and TF

  • executor dialect, which I'll come to in a second,

  • to legalize from TF to TensorFlow Lite--

  • and now legalize is just a different way of saying

  • convert--

  • convert all the ops in TensorFlow

  • to TensorFlow Lite ops that are legal.

  • And so the reason we use legalize here

  • is we don't necessarily need to convert all the ops.

  • For example, TensorFlow Lite supports flex ops.

  • It supports custom ops.

  • So some may ops may actually remain in the TensorFlow

  • dialect.

  • Then beyond that, we have some optimizations,

  • and then translating it back out to TensorFlow Lite flatbuffer.

  • So the converter is to change from the two different graph

  • representations.

  • We have two different runtime--

  • TensorFlow and TensorFlow Lite.

  • They have different constraints and targets.

  • But the graphs we want users to run, that they trained and run,

  • we want to be the same.

  • And so we have this converter workflow.

  • The converter actually has an overlapping goals

  • with regular compilation, because, I mean,

  • Edge devices can also have accelerators.

  • And beyond that, I think of TensorFlow Lite in a way,

  • with this converter, as just a weird ISA.

  • So you have a weird instruction set that you're targeting,

  • but it's still a compiler problem.

  • Now, MLIR's pluggable type and rewrite system

  • simplifies specifying these transforms

  • and expressing what you need to do

  • to convert between these ops.

  • And as an example here, we have the quantized type

  • is a first class citizen in the TensorFlow Lite dialect.

  • So we have the quantized op representation.

  • So it's not just you and it, it's actually, you could say,

  • a uniform quantized with these parameters, which

  • allows for some additional checking and verification.

  • So as I mentioned earlier, one of the focuses is usability.

  • Now, usability is one of TOCO's top complaints

  • among TFLite users--

  • CHECK'ing on errors, unsupported cases,

  • confusing error messages--

  • and this is one of things we want to improve.

  • One of the ways we want improve it

  • is in making debugging easier.

  • We want the locations to point back to your source.

  • So when an error is emitted, we want

  • to point back to the TensorFlow Python that caused the error.

  • And for this, we are bolting on and extending the TF debug info

  • work currently ongoing, and we want

  • to track the location origin of instruction as well.

  • So for example, if you have a Fused multiple add,

  • then it has a Fused Location corresponding

  • to both the multiply and the add.

  • So it shows you this new op was created

  • from these previous jobs, and that allows you to track back

  • to the original code.

  • After that, we also actually want

  • to be able to say why a model failed to convert.

  • So we want to point to the unsupported ops and types.

  • We also want to say how those types came to be.

  • Oftentimes, the user gets a failure,

  • and they have no idea why this op isn't supported.

  • Because they, in some cases, didn't even

  • mean to have this op there.

  • And so we want to be able to make it easy to find out

  • why it got to a model failing.

  • And as I mentioned, we have dialect types

  • to enable more checking and better reporting.

  • So now, we can have things that are saying,

  • oh, you have an add with two nonconforming quantized types.

  • I'm sorry.

  • This add won't work and fail at runtime.

  • We can do the checking at compile time.

  • And so to give an example, if you

  • look at the old TOCO experience for having

  • an unsuspected value for an attribute,

  • you get a check failure.

  • And the check failure will point you to a stack trace somewhere.

  • And we want to go from that to where we are today,

  • where we specify like, hey, this node failed to convert,

  • because TF quantity actually requires

  • a data form an attribute to be either NHWC or NCHW.

  • And this op was inserted by this following call

  • from your libraries and from your user code

  • in libraries in between.

  • And this allows the user to go find where the error occurred.

  • And I'll mention this app is also involving--

  • if you try it today, you'll actually

  • see carets pointing to the errors,

  • as you would see with Clang compilation errors--

  • so source code interleaved, as long as it's

  • in the same working space.

  • And so the idea is to make the user experience much easier

  • for debugging these errors.

  • The next application is the new TensorFlow compiler bridge.

  • So at the moment, the TF2XLA bridge

  • is an interrupt between TensorFlow and XLA.

  • It consists of rewrite passes, as well as transformations

  • to XLA.

  • Now, XLA also targets from multi-node machines

  • to Edge devices.

  • So it's actually not as distinct from TensorFlow Lite.

  • And so also the paths we used to load

  • to these two different backends should not be assisting.

  • We should be able to reuse them.

  • As new features become available in one,

  • everyone be able to take advantage of that.

  • And I'll actually mention that I want

  • to get rid of the word "bridge" here, because one of our goals

  • is not to have this big span between different abstractions.

  • We want to have this more progressive,

  • with shorter, lowering steps that are individually testable.

  • And beyond that, we don't want it

  • to be just XLA-specific, looking at custom compiler backends.

  • We want to make it easy for other compiler frameworks

  • to integrate into TensorFlow in a unified way.

  • So the dialect compilers can have their own dialect,

  • for example.

  • Dialects can integrate in the same framework.

  • You define your operation and types,

  • and you can have the pattern to rewrite

  • specifying the transformations you

  • need to target your backend.

  • And it also means that custom pipelines can

  • reuse the existing components.

  • We want to make reusing the pipeline as easy as possible.

  • So MLIR, one of the goals is to be a reusable set of compiler

  • passes.

  • And so one of them is to translate

  • from TensorFlow or the XLA dialect to your dialect.

  • And so folks then have the option of saying,

  • what ops are legal in the dialect?

  • And then of course, throughout all this,

  • we want to be able to optimize the graph

  • and make it easier for the backends

  • to focus on the parts that are important to them.

  • I'll give a little example of the current

  • versus the new approach, like at the end of the TF2XLA,

  • for converting a TensorFlow op to XLA op.

  • And so taking Relu6 as an example--

  • so we have Relu6.

  • You define it using a class.

  • You register it on an XLA op.

  • This op is an XLA op kernel, which actually derives

  • from the TensorFlow op kernel.

  • Inside of it, it uses XlaOpKernel context,

  • and SetOutputs, and SetInputs.

  • As is very familiar to folks adding new TensorFlow ops,

  • one of the differences here is that this is actually

  • constructing an XLA expression.

  • This XLA expression is something that

  • is both in a side data structure captured by the context.

  • And the output set here are actually--

  • the values flowing through the graph

  • are tensors containing pointers to this XLA expression.

  • So what you have here is actually

  • a pointer being fed into the output of this op

  • and flowing through the TensorFlow graph.

  • But you can represent it, and it's very familiar,

  • I think, to folks to how to do this.

  • And I think that's one of the problems.

  • Because it's so complicated, that means

  • if something goes wrong, it's difficult to find out

  • what went wrong.

  • And so we're testing this at the moment

  • is by writing Python tests.

  • And for example, [INAUDIBLE] test Relu6,

  • we have a UnaryOpTest to derive from XLATest classes.

  • And so we have an end-to-end test case

  • which started from TensorFlow and ending on execution

  • on different devices.

  • So per floating point type, per device,

  • we execute Relu6 on both compile and both using TensorFlow.

  • So we're testing runtime values for different types

  • on different devices and checking approximate equality.

  • But very important here, it's actually you

  • have to construct this test to avoid all

  • the optimizers along the way.

  • So one of my favorite examples was,

  • I was once debugging why our [INAUDIBLE] take so long.

  • And looking at the longest running test,

  • I actually found out it was a null op, because it was being

  • constantly propagated away.

  • And we were just testing constant folding over,

  • and over, and over again, instead

  • of actually running anything.

  • So it can be very difficult to ensure

  • that the test you think you are writing

  • is actually the one you're writing.

  • And another point is, this is testing the nn_ops.relu6.

  • In this case, it actually corresponds to a TensorFlow op.

  • But when I talk about TensorFlow ops later,

  • I'm referring to the ops as registered by op registration--

  • so the C++ ops, versus the Python constructs.

  • Anyway, the current approach with TF2XLA

  • is it's a symbolic execution of the TF

  • graph using the executor.

  • We're storing pointers to a side data structure in tensors

  • flowing through the graph.

  • We capture the XLA type in tensors in different data

  • structures, depending on the TensorFlow

  • type flowing through the graph.

  • We're mostly using end-to-end tests,

  • using Python, for constructing it, as this is the easiest way.

  • And it allows for a very complicated test.

  • But we need to take to thread these test

  • cases past O(n) optimizers to actually ensure they work.

  • Now, the new approach we want to have here

  • is, we want to make it so that you can write the directed unit

  • tests simply.

  • In MLIR, the source of truth for the operations is the IR.

  • The IR is round trippable through textual form, which

  • means you can take an IR, run optimization pass, and dump it.

  • And that's exactly what you would have gotten

  • in the in-memory changes.

  • Beyond that, we want to ensure that there's

  • no optimization run beyond what a developer specified.

  • We want the types to be representable in the IR itself,

  • and in one way.

  • There should not be a confusion as to what type is represented,

  • where, and what it is.

  • And also, we want to enable having

  • these multiple abstractions to lower,

  • without having large jumps between the different systems.

  • And so just to plug at the start--

  • and this slide's actually out of date--

  • but so we have mlir-opt and the equivalent of that,

  • tf-opt, which are optimization tools similar

  • to LLVM's opt tool.

  • It's a tool for testing compiler passes.

  • Effectively, what you have is you have IR as input.

  • You run MLIR opt, specifying an exact pass,

  • and you get IR out again.

  • So textual in, and textual out.

  • This allows for pretty cheap verification of the output,

  • because you're verifying the transformation.

  • You're verifying the structure of the transformation.

  • So you do not need to actually run it on different devices,

  • if you just want to verify the transformation.

  • You do not need to compute values,

  • you're verifying the structure.

  • And so in this case, if we look at the TF Relu6 example,

  • you can create a test case using MLIR Translate, which

  • is another tool that goes from a TF graph into MLIR.

  • Or you can manually write it, if you enjoy writing SSA-based IRs

  • textual form.

  • So here is example of a function that takes this input

  • tensor of a fixed shape.

  • This is actually corresponding to a previous example.

  • And I'm not going to get into details,

  • but you can see the TF dialect IFC

  • for more information about the two different dialects.

  • But here, we have the TF executor dialect

  • consisting of the graph and an island within it.

  • Within the island, you have the ops in the TF dialect.

  • And so in this case, you have the Relu6 operation

  • with the normal attributes of T, device, and name, taking

  • in a tensor of 1 times 3 times f32

  • and producing the same one again, which gets yielded.

  • Now from this, we can actually convert to the TF dialect,

  • simply because in this case we have a single island

  • due to no control dependencies and no side-effect ops.

  • And so this is the purity of dialect representation

  • of this where we don't actually need an island for this.

  • And you'll see a single Relu6 operation.

  • But you'll actually see duplicate information

  • stored here, because now we have the explicit types.

  • And so with the explicit types, we can actually

  • get rid of all these different attributes for t,

  • and then mapping from t is the result type, all of this,

  • because we have the types.

  • So these are derived attributes from the op itself,

  • with the type being the source of true for them.

  • And so in the import and export between graph

  • devs and know devs, we can use this information

  • we derive it from the ops.

  • And so you can have this simpler form

  • to just specify the op and the result type.

  • And then from here--

  • oh, and one thing you might have noticed in the previous slide,

  • it actually highlighted the names, as well as

  • the attributes.

  • And that's because all ops have locations.

  • And from TensorFlow, one of the most common ones

  • is the name is used as location.

  • If you have the debug info, you can also

  • use the call stack as that information is provided.

  • And so locations for ops are optionally printed.

  • And so in this case, if we actually

  • printed the op as well, you can see the original name

  • as from the import as well.

  • Now names are one location, file line is another,

  • and then call stack's another.

  • And then for example, if you look at diffused ones--

  • and this is from all the other examples-- you

  • can get the location of this op was actually

  • due to a fusing of a Relu BiasAdd convolution

  • and this constant of b.

  • And now you have this single op.

  • That it's location.

  • You can trace back how you got to that op.

  • And now if you want to lower from TF ops to XLA ops,

  • well, we also have--

  • sorry-- previously, as shown, we have the TF dialect

  • to match the TF ops.

  • So these are ops defined via REGISTER ops.

  • And so this differentiates from the Python ops.

  • And similarly, we also have an XLA dialect with ops

  • to match HLOs.

  • Converting between these two, you

  • can use a general legalization framework.

  • And this framework is a general graph for writing structure.

  • And for example, we have patterns,

  • so that you have to specify patterns to go from a source

  • tag to a destination tag.

  • So in this case, convert from a TF Relu6

  • op that is a tensor of one of those types.

  • If that's true, then convert it to an XLA ClampOp

  • with between 0 and 6 in input.

  • So the XLA ClampOp has its first argument, min, then argument,

  • then max.

  • These are declarative rules.

  • They're effectively a statement in equality

  • under conditions that we can use to optimize for a given target

  • dialect.

  • And these rules actually also allow for dynamic constraints.

  • So you can say, only fire this rule if the following is true.

  • And that allows you to specialize the rules

  • for certain cross models, certain backends, all

  • these kinds of things.

  • But with that rule added into the system,

  • we can run tf-opt again.

  • We have dash xla-legalize.

  • And now from our previous input, we now get this as output.

  • So now you have a very directed change

  • from the TF Relu6 to the XLA Clamp,

  • with two extra constant ops added in the function.

  • And of course, backends can define different lowerings.

  • But this just means this transformation to clamp now

  • can be simply verified textually.

  • No execution is needed.

  • No explicit values need to be verified.

  • The verification of the correct execution of the different ops

  • can now be done independently of the verification

  • of the transformation.

  • AUDIENCE: But like, if you wanted to change the way

  • this pattern's implemented to go from a single clamp to two

  • clamps, like 1, 4, 0 and 1, 4, 6, for example,

  • that would break all your unit tests,

  • but it would still be correct.

  • JACQUES PIENAAR: Correct.

  • Yes.

  • Yes.

  • So I mean it's--

  • AUDIENCE: You're not worried about that?

  • JACQUES PIENAAR: It's a question of--

  • i'm not.

  • No, because I mean, what I'm verifying at the moment

  • is the transformation.

  • And you're saying, well, I can get

  • the same equivalent numerical result

  • by multiple different transformations.

  • AUDIENCE: And if someone one day changes one, a lot of tests

  • are going to break.

  • JACQUES PIENAAR: But you should not

  • be verifying this transformation.

  • Depending on where you verify this, yes.

  • AUDIENCE: OK.

  • Yeah.

  • JACQUES PIENAAR: OK.

  • And so what I did not show here was also how we autogenerate

  • C++ classes, forced operations with helper functions.

  • So in this case, for example, the XLA clamp op in C++

  • actually has accessors, such as min, operand, and max.

  • So you don't have to specify get operand 3.

  • We also generate docs from one description,

  • so we actually have one description format from which

  • we generate the exporting to GraphDef, TFLite, Flatbuffer,

  • XLA proto's.

  • All the ops defined in MLIR, you can

  • specify verifications for the ops,

  • as well as structural verifications

  • for regions and functions, which means

  • you can capture the ops and the variants together.

  • You can verify graphs after every step

  • of every transformation, if you want to.

  • So you can narrow down failure cases.

  • And it also means that pass can actually assume valid graphs,

  • so that you operate on a valid graph

  • without having to repeat the same pattern everywhere

  • to be defensive.

  • I also didn't speak a lot about the pluggable type

  • system and the code generation.

  • And so a lot of these, in the future

  • when we actually have more examples and more time,

  • we can definitely go into more of these.

  • But this was just like a whirlwind tour

  • for one of the applications we're looking at.

  • And sort of as in conclusion, MLIR

  • is a new compiler infrastructure to unify graph and codegen

  • for TensorFlow.

  • We're looking at representing multiple levels of abstractions

  • and ops from different dialects coexisting

  • and being optimized together.

  • We want to enable progressive lowering

  • in a testable and verifiable manner,

  • making it easier to add these tests verifying behavior.

  • And beyond that, we want to make the infrastructure as

  • unopinionated as possible.

  • We want to be able to get out of the way of the developer

  • and enable them to define their own abstractions for targeting

  • their use cases in their backends.

  • OK, with that, I want to thank everybody.

  • And time for questions.

  • [APPLAUSE]

  • AUDIENCE: I'm going to riff on this word, "opinioneated."

  • Do you have an opinion on memory safety for MLIR dialects?

  • JACQUES PIENAAR: OK, that is a broad question.

  • AUDIENCE: And so I can imagine, in the sense

  • of progressive lowering, to lower

  • a dialect that uses raw pointers instead of symbolic handles

  • to tensors.

  • JACQUES PIENAAR: Yes.

  • AUDIENCE: Is there expected to be an infrastructure that

  • will talk about safe and unsafe regions in programs?

  • Because if we're shipping them around to people,

  • it would be unfortunate if this became a vector for--

  • JACQUES PIENAAR: Yes.

  • So I think, again, sort [INAUDIBLE] upwards.

  • I mean, I think, in even simple cases,

  • if you think about numerical equivalence of transformations,

  • right?

  • So a case where we have one op that

  • has certain [INAUDIBLE] behavior and trapping behavior,

  • converting it to a different dialect which has,

  • I don't know, perhaps none of that.

  • So for example, let's say we go to a dialect which says,

  • well, fast math is fine.

  • So all the optimizations, division by zero

  • never happens, so you can always do an invert and multiply.

  • And I think that boils down to the rules

  • you do to do the transformations for where you're heading needs

  • to be aware of that.

  • So I mean, if you're lowering to a different dialect which

  • has less guarantees, I think that is

  • up to the legality of the end to determine that, right?

  • So meaning the memory safety verification,

  • I feel like in the dialect where it's safe,

  • we have to insert all the verification and whatnot.

  • If we're going towards the dialect which is unsafe,

  • we have one or two options-- either insert runtime checks

  • to do the verification and any extra sanitization

  • and [INAUDIBLE] them where we know it's possible,

  • or we have to say, well now, it's unsafe.

  • And sorry, you're taking an unsafe input.

  • I mean, that--

  • AUDIENCE: No, no--

  • JACQUES PIENAAR: I haven't thought about this much,

  • but that's--

  • Well, I will say, it's going to be more fun

  • as we start playing with this.

  • And so, I mean, I encourage folks

  • to start pushing and prodding it and find bugs.

  • I mean, at the moment, it's very much infrastructure

  • that's probably not in your usage path today.

  • But you know, we want to make it used for that.

  • Anyway, and a button to--

  • and this will be in the TensorFlow GitHub repo

  • later today.

  • Like, everything will be open.

  • AUDIENCE: Everything is in the TF and TF control flow

  • dialects?

  • JACQUES PIENAAR: TF control flow dialects, XLA, TF Lite.

  • AUDIENCE: TF Lite?

  • JACQUES PIENAAR: Yeah.

  • AUDIENCE: And there is also going to the open design

  • meetings [INAUDIBLE]?

  • AUDIENCE: So can you point us [INAUDIBLE]??

  • AUDIENCE: Sorry?

  • AUDIENCE: Yeah, a quick question here.

  • Quick question.

  • Can you share a simple example of, I

  • don't know, like TensorFlow with fully [? connect ?] there?

  • And then that we can go, like, step by step?

  • For example, converting it to MLIR, then

  • you described like optimization step,

  • and then converting it to LLO, HLO, all the steps?

  • I just want to dive deeper [INAUDIBLE]..

  • JACQUES PIENAAR: Sure.

  • AUDIENCE: Do you have this kind of end-to-end example,

  • like not just when I just run a command line and run convert,

  • and then it's converted.

  • I want to see intermediate data too.

  • JACQUES PIENAAR: Yes.

  • And so you can actually do this by--

  • and I don't have an example offhand to show you,

  • but in our testing directory, you'll

  • see a couple of that does some of these manually.

  • But effectively, what you can do is

  • you can link together multiple phases of MLIR translate piping

  • into an MLIR op, piping into an MLIR op,

  • piping into an MLIR translate, to see all these phases.

  • I mean, actually, you can specify

  • MLIR op with multiple different passes, one after the other,

  • and then a translate at the end.

  • And if you want to see all the intermediate data,

  • just slap in t in between.

  • I actually do have an example from a TF Lite [INAUDIBLE]

  • phone, but I do not have that slide handy.

  • Because I think that's actually one of the things that

  • is quite nice-- you can see every step

  • as you are progressively changing things and running

  • different passes.

  • AUDIENCE: And is it covering mobile devices, like a phone?

  • JACQUES PIENAAR: Well, I mean, our first use case is TF Lite.

  • AUDIENCE: Can I run it on mobile phone?

  • JACQUES PIENAAR: Oh, you mean running it on?

  • AUDIENCE: Yeah.

  • Sounds good.

  • Yep.

  • Yes.

  • JACQUES PIENAAR: I mean, I want to run it everywhere.

  • AUDIENCE: From TensorFlow, right?

  • OK.

  • Cool.

  • Cool.

  • Sounds good.

  • JACQUES PIENAAR: Thank you.

  • [MUSIC PLAYING]

JACQUES PIENAAR: OK.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

Inside TensorFlow:面向TF開發者的MLIR。 (Inside TensorFlow: MLIR for TF developers)

  • 3 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字