Inside TensorFlow：面向TF開發者的MLIR。 (Inside TensorFlow: MLIR for TF developers)

字幕列表影片播放

JACQUES PIENAAR: OK.
Good afternoon, everybody.
Today, I'll be presenting about MLIR, multi-level intermediate
representation compiler infrastructure,
presenting in TensorFlow.
And so just an overview.
This will be a little bit different than some
of the other TF training sessions, as most of them
focus on how to better use TF technology,
while here we're looking at something that is still coming.
So there's a little bit mostly forward-looking as to things
that we want to do and where we're going.
So as an overview for today, I'm going
to start by giving an introduction.
What is MLIR?
Sort of like a 50,000-foot view of what it is,
how it all fits together.
Then look at why are we developing MLIR?
So the idea there is to show the past, where we came from,
how we got to this point, to point to what
we want to do in the future.
Then we will look at two applications.
One is the new TensorFlow to TensorFlow Lite converter.
It's still in pre-alpha, but definitely
try it and give some feedback.
And then we'll also look at the TF2XLA bridge,
looking at some of the forthcoming changes.
And I'll walk through a small example there.
OK, so let's start with the main question.
What is MLIR?
And so some people have said, it's
a little of a Rorschach test.
You can say you can do anything with MLIR,
and everybody has a different opinion about what it is.
So if we start from a high level,
TensorFlow's goal is it's "an open source machine learning
framework for everyone."
Now, in the same way, MLIR, which stands for multi-level
intermediate representation--
with the ML not representing "machine learning,"
for a change--
we're looking at "an open source program optimization
framework for everyone."
Now, the way I think about it is, as MLIR,
as an abstraction building toolkit--
and I'll show some examples of what I mean by that--
and as well as it's a reusable set of compiler passes
for these higher abstractions.
So particularly with MLIR, we are
targeting analysis, program optimization,
and code generation.
And so with that very high level.
I'm going to start with why MLIR,
and sort of go to exactly the components of what it is.
So when we started MLIR initially, we had a question.
Well, we looked at the ML accelerators and we saw, well,
there is many, many, many accelerators coming forth.
And we had the question-- how do we
support all these accelerators for all the given [INAUDIBLE]??
And I mean, of course, TensorFlow
should provide the best accelerator
performance for our users.
And so [INAUDIBLE],, can we make that easier?
So MLIR started as an exploration
of a different approach to doing code generators
for accelerators.
So we started looking at our existing code generator
framework for accelerators, XLA.
Now, XLA is one of the most advanced machine learning
compilers.
We have targets for CPU, GPU, TPU, and other back ends.
How can we increase the reuse between a CPU, GPU, and TPU
backends?
Well, at the moment, TPU backend doesn't use LLVM,
so that means we have more low level components that
are needed there, because we're not
able to reuse some of the same passes or structures.
The TPU backend is specialized for the best TPU performance.
But because it's so specialized, there's
less reuse with the CPU and GPU components.
But also looking at these different backends,
we notice different abstractions.
The CPU, and GPU, and TPU do not share abstractions
beyond the HLO level.
So if you look at the different backends,
you'll have different levels of support for different loop
abstractions, as well as stencil emitters between the two.
This makes reusing code between the two more difficult.
So furthermore, we have a lack of having
actual abstractions between HLO and, for example, TPU, LLO,
or LLVM, which results in big gaps.
So you have effectively passes that are effectively
one-shot compilers doing [INAUDIBLE]
from very coarse grain ops to multiple lower level ops.
It's like, well, OK, but if we want
to support so many different TensorFlow
ops on all these devices, we must leverage as much shared
infrastructure as possible.
So we should find a way to try and unify
these different backends and these abstractions
to be able to reuse more of the passes, more of the generators.
So then we come to our first question.
It's like, well, can we add a new abstraction
to allow greater reuse?
Is there some new abstractions that we can add?
I mean, we thought yes.
But I mean, you'll be saying, like, maybe.
And we haven't added them yet, but we're looking at it.
One of our goals was to address some usability issues
with the current stack.
One of them was custom ops for accelerators.
The other one was dynamic shapes.
But now assuming we had done this,
what happens with the rest of the stack?
Because I mean, the goal is still
this is an end-to-end TensorFlow user experience
that we are considering.
And so roughly split out the stack into multiple pages.
You have TensorFlow, and we're doing optimizations
on the TensorFlow Graph.
And this is specifically for targeting TPU.
We have the TF2XLA bridge that bridges between TensorFlow
ops to HLO.
Then on HLO, we have different passes,
device independent, device dependent.
And then finally, we have in the backends the emission from HLO
to, in this case, TPU LLO or, in the case of CPU, GPU, LLVM.
And so what we have at the moment is
TensorFlow can see the whole program,
but it has no insight on the device.
So TensorFlow does its optimizations,
and then it has these parts where
this will run on the device, this will run on XLA.
XLA, on the other hand, has deep insight into devices.
It knows all the different backends,
and it knows how to optimize it.
But it can't change the TensorFlow Graph.
So XLA assumes as fixed all its inputs and the graph structure
given to it by TensorFlow.
This results in optimization barriers
between the different passes.
So a backend can't dictate to the bridge
which HLO to produce.
The HLO produced is constrained by what is in TensorFlow.
And so this leads to things such as the double transpose
trick, which allows people to force certain layouts.
But such operations actually constrains the coupling
between the different layers.
Now the high level layer and the lower level layer
has an implicit coupling, and fixed set of assumptions
are hard-coded.
Beyond that, the TF2XLA bridge is
bridging two different systems with a large impedance
mismatch.
TensorFlow's side, we have the op ecosystem,
we have dynamic sizes, we have different types,
we have stateful ops.
XLA side, we have HLOs.
HLOs are mostly side effect free beyond a few ops.
We have different types, but it's a very different system
that the bridge does in one pass, transitions between.
So at the moment, what we have is
that this stack does not abstract out
the various functions.
We have the top level technologies tied to the lower
level hardware implementations.
So this results in a large gap of abstractions
between these layers, which makes the passes more
difficult to write.
Now, this is not something unique to machine learning
or to TensorFlow, XLA.
This similar kind of gap also led to domain-specific IRs
elsewhere, and in this particular case,
in the compiler domain.
If you look at the different inputs from languages such
as Java, C++, Swift, Rust, and Julia,
almost all of these languages that target LLVM have
introduced a new mid-level IR on which they can do their higher
level optimizations.
And actually, C++ does not do this,
so Clang doesn't have this.
And I have tried to omit the slide before,
but then Chris stopped me.
He's very upset about the fact that there is no [INAUDIBLE]..
And so this is a very dear point that we
are missing a lot of optimizations and reuse
by not having the [INAUDIBLE].
So what this means is we can do the domain-specific
optimizations.
We can do progressive lowering, so we can actually
have simpler hops between these different systems.
And if we look at the TensorFlow one,
we can think of this as-- if you look at the CPU/GPU path,
like the TF Graph to HLO, as the HLO being
this intermediate representation, mid level
between TF Graph and LLVM.
But domain-specific IRs are great.
They allow high-level domain-specific optimizations.
This progressive lowering actually
encourages reuse between the different levels,
because you can have smaller passes doing dedicated things.
It's great.
It's great for location tracking.
And this enables some flow-sensitive type checking,
so you can operate on higher level semantically
meaningful parts and produce verification at that level.
The part that's not so great--
it's a huge expense to build this infrastructure.
You're doing a reimplementation of all the same stuff-- pass
managers, location tracking, use-def chains, inlining,
all of these things.
And more importantly, innovations in one community
doesn't benefit the other communities.
So there's a downside to having these domain-specific IRs.
And if we look at the TensorFlow compiler ecosystem,
the previous graph is very much simplified.
Because the real situation is, from TensorFlow Graph,
we have multiple different backends
and multiple different IRs that it's being generated from.
From a graph, we generate HLOs.
Tensor RT has an output.
There's nGraph, Core ML, TensorFlow Lite--
so many different Graph IRs, each with different challenges.
So in a lot of these bridges, we have similar but different
technologies.
And this is not going away anytime soon.
But this results in a fragile, poor user
experience when failures happen.
The location tracking between these different phases
are variable.
So in some cases, there's no location propagated through.
So if an error occurs, the only thing you know about
is what happens at the lower level,
without being able to trace it back.
Beyond this, this also leads to a duplication of infrastructure
at all levels.
We're reimplementing the same infrastructure multiple times.
This is true, even in TensorFlow today.
We have the same optimizations at both the TensorFlow and HLO
level.
And so in some cases, you have the optimization
pass that you actually redo multiple times.
And then in some other cases, you
have future passes actually ignore
the results of previous ones.
So as an example, we do layered analysis and assignment
on TensorFlow Graphs using Grappler.
But when we get to HLO, these are mostly
ignored, sometimes for good reasons
because XLA actually knows better about the devices.
And sometimes, it's unfortunate because that actually
would have been a good decision, given the higher level graph
structure.
But beyond that, we need to actually duplicate these
passes, because we cannot represent these same operations
in one uniform way.
And an unfortunate side effect of this
is we actually end up transforming multiple times
back and forth between equivalent representations.
So for example, in one unit case,
we converted from Graph to GraphDef seven times,
from GraphDef to Graph four times, and then once to HLO.
And I mean, this is, in a way, not useful transformations.
But our problem is we are unable to represent
all of these different ops and structures together,
so we need to duplicate this.
So with that, the goal of MLIR is to enable global improvement
to TensorFlow infrastructure.
It's an SSA-based design to generalize and improve
ML graphs.
We want to add better side effect modeling, control flow
representation, improved generality
of the lowering passes--
which I'll explain at the time we're on the applications--
focus on dramatically increasing the code
reuse between all these distinct paths, fixed location
tracking, and other pervasive issues
for better user experience.
So when a failure occurs, the traceability
and the debug-ability is greatly improved.
And at the moment, we believe there's no reasonable existing
answers.
And we refuse to copy and paste another SSA-based optimizer
six more times.
And so that led us to MLIR.
A couple of the other goals is, well, we
want to embrace TensorFlow.
TensorFlow is one of our main targets.
We don't want to work around it.
We want to support the full generality of TF graphs.
We want to allow the aggressive reuse of infra
across multiple hardware paths.
And similar to TensorFlow, we want
to allow open customization.
We want to have a target be able to say,
implement my JPEG decoder using this block.
We want a user to be able to say, hey,
I have a custom kernel.
For this model I'm running, I want to see its performance,
so use my kernel.
Beyond that, we want to enable folks to experiment
with their own lower level implementations of operations.
So if the compiler is not there yet,
or the researcher has a better idea,
or we have an MLIR algorithm generating code,
we want to be able to plug that into the same system
and see the effect on an end-to-end behavior.
But we also want to embrace the limitations
of particular backends.
So for example, if your backend only supports convolution,
we want to provide convolution.
If you don't support control flow,
well, we don't give you control flow.
If you have static shapes, we only
can give you graph with static shapes, et cetera.
This includes what floating point precision
operations you support, forced quantization, things like that.
We want to avoid those big semantic gaps in lowering.
We do not want to have these big gaps where
you have one step to the next in the transformation paths
which have a bridging completely separate systems, which are
difficult to debug and verify.
And then very importantly, we want improve the traceability
and testability for users.
So if your model fails compilation,
we want to be able to point back to where it failed.
And so with this, what should MLIR provide?
Well, it should represent multiple levels of abstraction.
We should allow this progressive lowering--
so with any given function, having
a progressive set of lowerings that
gets you to the destination.
It should not be these big jumps on two separate data
structures.
We want to be able to lower through multiple different
abstractions.
This also means that the passes need
to be designed to operate on these different levels
and properties, rather than looking at fixed ops.
And I mean, I think this is especially
essential for TensorFlow, which has an open ecosystem with ops
being added very, very regularly at a good pace.
We also should make it easy to add
abstractions or domain-specific IR constructs.
An example here is, we have the affine dialect.
In our affine dialect, we have affine loops.
This isn't a hard coded construct in MLIR.
An affine loop, which can be used for some polyhedral code
optimization, is something that is extended.
It's the dialect itself has it.
Beyond that, we're looking at location
as a first class construct.
Locations are intrinsically tied to operations
and optimizations.
You cannot create an op without a location.
If you're doing transformation and you're
replacing with a new op, you have
to specify where this op comes from.
So this means we have a path-- hm?
AUDIENCE: What do you mean by the location?
JACQUES PIENAAR: So location in this case
could be file location, could be name, could be stack trace.
So we have a couple of different locations,
and we actually also have an opaque location
that is just interpretable by a dialect, for example--
so if your backend has a different concept.
The most common ones for TensorFlow
is a name location corresponding to the name in the GraphDef
or the Python call stack--
so the set of calls that got you to creating this op.
Another probably we also want to work on
is, we want to have this framework enable us to complete
a patchwork of tools.
So at the moment, a couple of users
have run into a problem where we have broken paths.
You have a tool A that will get you from representation x to y.
And you have a tool B that gets you from y prime to z.
But if you actually want to get from x to z,
you have to do something else, or you
have to restrict them all.
We want to try and get a path to complete this patchwork
and enable end-to-end workflows of interest.
And so this is sort of like, where are we applying MLIR?
So mostly, I have been talking the infra.
The first application is the TensorFlow
to TensorFlow Lite Converter.
This is something which is, like I said, is pre alpha.
I'll discuss it next.
And it's working for a couple of models.
We have a couple of new features coming
in there that enable some new TensorFlow Lite features.
The added target we're working with in JSON
is TensorFlow to XLA bridge.
You know, looking at the current lowering to XLA,
as well as accelerators.
We're working with the Grappler team
on graph organizations, shape inference, device partitioning,
and placement.
And then we also have the TPU and GPU codegen projects
going on to evaluate new approaches to generate code
for these different devices.
So I think of MLIR as three different parts.
One, you have the graph compiler,
which is like the op expansions, the lower entity of Lite,
auto outside compilation, parallelism/sharding,
that sort of target, code generator,
which focus on high level abstractions for code
generations.
And so there's polyhedral loop nests,
tiled Tensor views, to this nature.
And underlying all of this is the MLIR infrastructure.
So this is framework-independent IR.
You have rewrite generators, like automatic pattern
matchers, the mechanisms to define dialects
consisting of types and ops.
And that ties all of this together.
And that leads me to one of our first applications,
which is the TensorFlow Lite converter.
So the basic flow for TensorFlow Lite Converter
is a TensorFlow Graph is input.
And here, I'm using Graph--
misusing it slightly, because it could be a graph,
it could be from a Python script [INAUDIBLE] [? SaveModel. ?]
But the goal is to translate it from this representation
to the MLIR module, consisting of ops in the TF and TF
executor dialect, which I'll come to in a second,
to legalize from TF to TensorFlow Lite--
and now legalize is just a different way of saying
convert--
convert all the ops in TensorFlow
to TensorFlow Lite ops that are legal.
And so the reason we use legalize here
is we don't necessarily need to convert all the ops.
For example, TensorFlow Lite supports flex ops.
It supports custom ops.
So some may ops may actually remain in the TensorFlow
dialect.
Then beyond that, we have some optimizations,
and then translating it back out to TensorFlow Lite flatbuffer.
So the converter is to change from the two different graph
representations.
We have two different runtime--
TensorFlow and TensorFlow Lite.
They have different constraints and targets.
But the graphs we want users to run, that they trained and run,
we want to be the same.
And so we have this converter workflow.
The converter actually has an overlapping goals
with regular compilation, because, I mean,
Edge devices can also have accelerators.
And beyond that, I think of TensorFlow Lite in a way,
with this converter, as just a weird ISA.
So you have a weird instruction set that you're targeting,
but it's still a compiler problem.
Now, MLIR's pluggable type and rewrite system
simplifies specifying these transforms
and expressing what you need to do
to convert between these ops.
And as an example here, we have the quantized type
is a first class citizen in the TensorFlow Lite dialect.
So we have the quantized op representation.
So it's not just you and it, it's actually, you could say,
a uniform quantized with these parameters, which
allows for some additional checking and verification.
So as I mentioned earlier, one of the focuses is usability.
Now, usability is one of TOCO's top complaints
among TFLite users--
CHECK'ing on errors, unsupported cases,
confusing error messages--
and this is one of things we want to improve.
One of the ways we want improve it
is in making debugging easier.
We want the locations to point back to your source.
So when an error is emitted, we want
to point back to the TensorFlow Python that caused the error.
And for this, we are bolting on and extending the TF debug info
work currently ongoing, and we want
to track the location origin of instruction as well.
So for example, if you have a Fused multiple add,
then it has a Fused Location corresponding
to both the multiply and the add.
So it shows you this new op was created
from these previous jobs, and that allows you to track back
to the original code.
After that, we also actually want
to be able to say why a model failed to convert.
So we want to point to the unsupported ops and types.
We also want to say how those types came to be.
Oftentimes, the user gets a failure,
and they have no idea why this op isn't supported.
Because they, in some cases, didn't even
mean to have this op there.
And so we want to be able to make it easy to find out
why it got to a model failing.
And as I mentioned, we have dialect types
to enable more checking and better reporting.
So now, we can have things that are saying,
oh, you have an add with two nonconforming quantized types.
I'm sorry.
This add won't work and fail at runtime.
We can do the checking at compile time.
And so to give an example, if you
look at the old TOCO experience for having
an unsuspected value for an attribute,
you get a check failure.
And the check failure will point you to a stack trace somewhere.
And we want to go from that to where we are today,
where we specify like, hey, this node failed to convert,
because TF quantity actually requires
a data form an attribute to be either NHWC or NCHW.
And this op was inserted by this following call
from your libraries and from your user code
in libraries in between.
And this allows the user to go find where the error occurred.
And I'll mention this app is also involving--
if you try it today, you'll actually
see carets pointing to the errors,
as you would see with Clang compilation errors--
so source code interleaved, as long as it's
in the same working space.
And so the idea is to make the user experience much easier
for debugging these errors.
The next application is the new TensorFlow compiler bridge.
So at the moment, the TF2XLA bridge
is an interrupt between TensorFlow and XLA.
It consists of rewrite passes, as well as transformations
to XLA.
Now, XLA also targets from multi-node machines
to Edge devices.
So it's actually not as distinct from TensorFlow Lite.
And so also the paths we used to load
to these two different backends should not be assisting.
We should be able to reuse them.
As new features become available in one,
everyone be able to take advantage of that.
And I'll actually mention that I want
to get rid of the word "bridge" here, because one of our goals
is not to have this big span between different abstractions.
We want to have this more progressive,
with shorter, lowering steps that are individually testable.
And beyond that, we don't want it
to be just XLA-specific, looking at custom compiler backends.
We want to make it easy for other compiler frameworks
to integrate into TensorFlow in a unified way.
So the dialect compilers can have their own dialect,
for example.
Dialects can integrate in the same framework.
You define your operation and types,
and you can have the pattern to rewrite
specifying the transformations you
need to target your backend.
And it also means that custom pipelines can
reuse the existing components.
We want to make reusing the pipeline as easy as possible.
So MLIR, one of the goals is to be a reusable set of compiler
passes.
And so one of them is to translate
from TensorFlow or the XLA dialect to your dialect.
And so folks then have the option of saying,
what ops are legal in the dialect?
And then of course, throughout all this,
we want to be able to optimize the graph
and make it easier for the backends
to focus on the parts that are important to them.
I'll give a little example of the current
versus the new approach, like at the end of the TF2XLA,
for converting a TensorFlow op to XLA op.
And so taking Relu6 as an example--
so we have Relu6.
You define it using a class.
You register it on an XLA op.
This op is an XLA op kernel, which actually derives
from the TensorFlow op kernel.
Inside of it, it uses XlaOpKernel context,
and SetOutputs, and SetInputs.
As is very familiar to folks adding new TensorFlow ops,
one of the differences here is that this is actually
constructing an XLA expression.
This XLA expression is something that
is both in a side data structure captured by the context.
And the output set here are actually--
the values flowing through the graph
are tensors containing pointers to this XLA expression.
So what you have here is actually
a pointer being fed into the output of this op
and flowing through the TensorFlow graph.
But you can represent it, and it's very familiar,
I think, to folks to how to do this.
And I think that's one of the problems.
Because it's so complicated, that means
if something goes wrong, it's difficult to find out
what went wrong.
And so we're testing this at the moment
is by writing Python tests.
And for example, [INAUDIBLE] test Relu6,
we have a UnaryOpTest to derive from XLATest classes.
And so we have an end-to-end test case
which started from TensorFlow and ending on execution
on different devices.
So per floating point type, per device,
we execute Relu6 on both compile and both using TensorFlow.
So we're testing runtime values for different types
on different devices and checking approximate equality.
But very important here, it's actually you
have to construct this test to avoid all
the optimizers along the way.
So one of my favorite examples was,
I was once debugging why our [INAUDIBLE] take so long.
And looking at the longest running test,
I actually found out it was a null op, because it was being
constantly propagated away.
And we were just testing constant folding over,
and over, and over again, instead
of actually running anything.
So it can be very difficult to ensure
that the test you think you are writing
is actually the one you're writing.
And another point is, this is testing the nn_ops.relu6.
In this case, it actually corresponds to a TensorFlow op.
But when I talk about TensorFlow ops later,
I'm referring to the ops as registered by op registration--
so the C++ ops, versus the Python constructs.
Anyway, the current approach with TF2XLA
is it's a symbolic execution of the TF
graph using the executor.
We're storing pointers to a side data structure in tensors
flowing through the graph.
We capture the XLA type in tensors in different data
structures, depending on the TensorFlow
type flowing through the graph.
We're mostly using end-to-end tests,
using Python, for constructing it, as this is the easiest way.
And it allows for a very complicated test.
But we need to take to thread these test
cases past O(n) optimizers to actually ensure they work.
Now, the new approach we want to have here
is, we want to make it so that you can write the directed unit
tests simply.
In MLIR, the source of truth for the operations is the IR.
The IR is round trippable through textual form, which
means you can take an IR, run optimization pass, and dump it.
And that's exactly what you would have gotten
in the in-memory changes.
Beyond that, we want to ensure that there's
no optimization run beyond what a developer specified.
We want the types to be representable in the IR itself,
and in one way.
There should not be a confusion as to what type is represented,
where, and what it is.
And also, we want to enable having
these multiple abstractions to lower,
without having large jumps between the different systems.
And so just to plug at the start--
and this slide's actually out of date--
but so we have mlir-opt and the equivalent of that,
tf-opt, which are optimization tools similar
to LLVM's opt tool.
It's a tool for testing compiler passes.
Effectively, what you have is you have IR as input.
You run MLIR opt, specifying an exact pass,
and you get IR out again.
So textual in, and textual out.
This allows for pretty cheap verification of the output,
because you're verifying the transformation.
You're verifying the structure of the transformation.
So you do not need to actually run it on different devices,
if you just want to verify the transformation.
You do not need to compute values,
you're verifying the structure.
And so in this case, if we look at the TF Relu6 example,
you can create a test case using MLIR Translate, which
is another tool that goes from a TF graph into MLIR.
Or you can manually write it, if you enjoy writing SSA-based IRs
textual form.
So here is example of a function that takes this input
tensor of a fixed shape.
This is actually corresponding to a previous example.
And I'm not going to get into details,
but you can see the TF dialect IFC
for more information about the two different dialects.
But here, we have the TF executor dialect
consisting of the graph and an island within it.
Within the island, you have the ops in the TF dialect.
And so in this case, you have the Relu6 operation
with the normal attributes of T, device, and name, taking
in a tensor of 1 times 3 times f32
and producing the same one again, which gets yielded.
Now from this, we can actually convert to the TF dialect,
simply because in this case we have a single island
due to no control dependencies and no side-effect ops.
And so this is the purity of dialect representation
of this where we don't actually need an island for this.
And you'll see a single Relu6 operation.
But you'll actually see duplicate information
stored here, because now we have the explicit types.
And so with the explicit types, we can actually
get rid of all these different attributes for t,
and then mapping from t is the result type, all of this,
because we have the types.
So these are derived attributes from the op itself,
with the type being the source of true for them.
And so in the import and export between graph
devs and know devs, we can use this information
we derive it from the ops.
And so you can have this simpler form
to just specify the op and the result type.
And then from here--
oh, and one thing you might have noticed in the previous slide,
it actually highlighted the names, as well as
the attributes.
And that's because all ops have locations.
And from TensorFlow, one of the most common ones
is the name is used as location.
If you have the debug info, you can also
use the call stack as that information is provided.
And so locations for ops are optionally printed.
And so in this case, if we actually
printed the op as well, you can see the original name
as from the import as well.
Now names are one location, file line is another,
and then call stack's another.
And then for example, if you look at diffused ones--
and this is from all the other examples-- you
can get the location of this op was actually
due to a fusing of a Relu BiasAdd convolution
and this constant of b.
And now you have this single op.
That it's location.
You can trace back how you got to that op.
And now if you want to lower from TF ops to XLA ops,
well, we also have--
sorry-- previously, as shown, we have the TF dialect
to match the TF ops.
So these are ops defined via REGISTER ops.
And so this differentiates from the Python ops.
And similarly, we also have an XLA dialect with ops
to match HLOs.
Converting between these two, you
can use a general legalization framework.
And this framework is a general graph for writing structure.
And for example, we have patterns,
so that you have to specify patterns to go from a source
tag to a destination tag.
So in this case, convert from a TF Relu6
op that is a tensor of one of those types.
If that's true, then convert it to an XLA ClampOp
with between 0 and 6 in input.
So the XLA ClampOp has its first argument, min, then argument,
then max.
These are declarative rules.
They're effectively a statement in equality
under conditions that we can use to optimize for a given target
dialect.
And these rules actually also allow for dynamic constraints.
So you can say, only fire this rule if the following is true.
And that allows you to specialize the rules
for certain cross models, certain backends, all
these kinds of things.
But with that rule added into the system,
we can run tf-opt again.
We have dash xla-legalize.
And now from our previous input, we now get this as output.
So now you have a very directed change
from the TF Relu6 to the XLA Clamp,
with two extra constant ops added in the function.
And of course, backends can define different lowerings.
But this just means this transformation to clamp now
can be simply verified textually.
No execution is needed.
No explicit values need to be verified.
The verification of the correct execution of the different ops
can now be done independently of the verification
of the transformation.
AUDIENCE: But like, if you wanted to change the way
this pattern's implemented to go from a single clamp to two
clamps, like 1, 4, 0 and 1, 4, 6, for example,
that would break all your unit tests,
but it would still be correct.
JACQUES PIENAAR: Correct.
Yes.
Yes.
So I mean it's--
AUDIENCE: You're not worried about that?
JACQUES PIENAAR: It's a question of--
i'm not.
No, because I mean, what I'm verifying at the moment
is the transformation.
And you're saying, well, I can get
the same equivalent numerical result
by multiple different transformations.
AUDIENCE: And if someone one day changes one, a lot of tests
are going to break.
JACQUES PIENAAR: But you should not
be verifying this transformation.
Depending on where you verify this, yes.
AUDIENCE: OK.
Yeah.
JACQUES PIENAAR: OK.
And so what I did not show here was also how we autogenerate
C++ classes, forced operations with helper functions.
So in this case, for example, the XLA clamp op in C++
actually has accessors, such as min, operand, and max.
So you don't have to specify get operand 3.
We also generate docs from one description,
so we actually have one description format from which
we generate the exporting to GraphDef, TFLite, Flatbuffer,
XLA proto's.
All the ops defined in MLIR, you can
specify verifications for the ops,
as well as structural verifications
for regions and functions, which means
you can capture the ops and the variants together.
You can verify graphs after every step
of every transformation, if you want to.
So you can narrow down failure cases.
And it also means that pass can actually assume valid graphs,
so that you operate on a valid graph
without having to repeat the same pattern everywhere
to be defensive.
I also didn't speak a lot about the pluggable type
system and the code generation.
And so a lot of these, in the future
when we actually have more examples and more time,
we can definitely go into more of these.
But this was just like a whirlwind tour
for one of the applications we're looking at.
And sort of as in conclusion, MLIR
is a new compiler infrastructure to unify graph and codegen
for TensorFlow.
We're looking at representing multiple levels of abstractions
and ops from different dialects coexisting
and being optimized together.
We want to enable progressive lowering
in a testable and verifiable manner,
making it easier to add these tests verifying behavior.
And beyond that, we want to make the infrastructure as
unopinionated as possible.
We want to be able to get out of the way of the developer
and enable them to define their own abstractions for targeting
their use cases in their backends.
OK, with that, I want to thank everybody.
And time for questions.
[APPLAUSE]
AUDIENCE: I'm going to riff on this word, "opinioneated."
Do you have an opinion on memory safety for MLIR dialects?
JACQUES PIENAAR: OK, that is a broad question.
AUDIENCE: And so I can imagine, in the sense
of progressive lowering, to lower
a dialect that uses raw pointers instead of symbolic handles
to tensors.
JACQUES PIENAAR: Yes.
AUDIENCE: Is there expected to be an infrastructure that
will talk about safe and unsafe regions in programs?
Because if we're shipping them around to people,
it would be unfortunate if this became a vector for--
JACQUES PIENAAR: Yes.
So I think, again, sort [INAUDIBLE] upwards.
I mean, I think, in even simple cases,
if you think about numerical equivalence of transformations,
right?
So a case where we have one op that
has certain [INAUDIBLE] behavior and trapping behavior,
converting it to a different dialect which has,
I don't know, perhaps none of that.
So for example, let's say we go to a dialect which says,
well, fast math is fine.
So all the optimizations, division by zero
never happens, so you can always do an invert and multiply.
And I think that boils down to the rules
you do to do the transformations for where you're heading needs
to be aware of that.
So I mean, if you're lowering to a different dialect which
has less guarantees, I think that is
up to the legality of the end to determine that, right?
So meaning the memory safety verification,
I feel like in the dialect where it's safe,
we have to insert all the verification and whatnot.
If we're going towards the dialect which is unsafe,
we have one or two options-- either insert runtime checks
to do the verification and any extra sanitization
and [INAUDIBLE] them where we know it's possible,
or we have to say, well now, it's unsafe.
And sorry, you're taking an unsafe input.
I mean, that--
AUDIENCE: No, no--
JACQUES PIENAAR: I haven't thought about this much,
but that's--
Well, I will say, it's going to be more fun
as we start playing with this.
And so, I mean, I encourage folks
to start pushing and prodding it and find bugs.
I mean, at the moment, it's very much infrastructure
that's probably not in your usage path today.
But you know, we want to make it used for that.
Anyway, and a button to--
and this will be in the TensorFlow GitHub repo
later today.
Like, everything will be open.
AUDIENCE: Everything is in the TF and TF control flow
dialects?
JACQUES PIENAAR: TF control flow dialects, XLA, TF Lite.
AUDIENCE: TF Lite?
JACQUES PIENAAR: Yeah.
AUDIENCE: And there is also going to the open design
meetings [INAUDIBLE]?
AUDIENCE: So can you point us [INAUDIBLE]??
AUDIENCE: Sorry?
AUDIENCE: Yeah, a quick question here.
Quick question.
Can you share a simple example of, I
don't know, like TensorFlow with fully [? connect ?] there?
And then that we can go, like, step by step?
For example, converting it to MLIR, then
you described like optimization step,
and then converting it to LLO, HLO, all the steps?
I just want to dive deeper [INAUDIBLE]..
JACQUES PIENAAR: Sure.
AUDIENCE: Do you have this kind of end-to-end example,
like not just when I just run a command line and run convert,
and then it's converted.
I want to see intermediate data too.
JACQUES PIENAAR: Yes.
And so you can actually do this by--
and I don't have an example offhand to show you,
but in our testing directory, you'll
see a couple of that does some of these manually.
But effectively, what you can do is
you can link together multiple phases of MLIR translate piping
into an MLIR op, piping into an MLIR op,
piping into an MLIR translate, to see all these phases.
I mean, actually, you can specify
MLIR op with multiple different passes, one after the other,
and then a translate at the end.
And if you want to see all the intermediate data,
just slap in t in between.
I actually do have an example from a TF Lite [INAUDIBLE]
phone, but I do not have that slide handy.
Because I think that's actually one of the things that
is quite nice-- you can see every step
as you are progressively changing things and running
different passes.
AUDIENCE: And is it covering mobile devices, like a phone?
JACQUES PIENAAR: Well, I mean, our first use case is TF Lite.
AUDIENCE: Can I run it on mobile phone?
JACQUES PIENAAR: Oh, you mean running it on?
AUDIENCE: Yeah.
Sounds good.
Yep.
Yes.
JACQUES PIENAAR: I mean, I want to run it everywhere.
AUDIENCE: From TensorFlow, right?
OK.
Cool.
Cool.
Sounds good.
JACQUES PIENAAR: Thank you.
[MUSIC PLAYING]