字幕列表 影片播放 列印英文字幕 JACQUES PIENAAR: OK. Good afternoon, everybody. Today, I'll be presenting about MLIR, multi-level intermediate representation compiler infrastructure, presenting in TensorFlow. And so just an overview. This will be a little bit different than some of the other TF training sessions, as most of them focus on how to better use TF technology, while here we're looking at something that is still coming. So there's a little bit mostly forward-looking as to things that we want to do and where we're going. So as an overview for today, I'm going to start by giving an introduction. What is MLIR? Sort of like a 50,000-foot view of what it is, how it all fits together. Then look at why are we developing MLIR? So the idea there is to show the past, where we came from, how we got to this point, to point to what we want to do in the future. Then we will look at two applications. One is the new TensorFlow to TensorFlow Lite converter. It's still in pre-alpha, but definitely try it and give some feedback. And then we'll also look at the TF2XLA bridge, looking at some of the forthcoming changes. And I'll walk through a small example there. OK, so let's start with the main question. What is MLIR? And so some people have said, it's a little of a Rorschach test. You can say you can do anything with MLIR, and everybody has a different opinion about what it is. So if we start from a high level, TensorFlow's goal is it's "an open source machine learning framework for everyone." Now, in the same way, MLIR, which stands for multi-level intermediate representation-- with the ML not representing "machine learning," for a change-- we're looking at "an open source program optimization framework for everyone." Now, the way I think about it is, as MLIR, as an abstraction building toolkit-- and I'll show some examples of what I mean by that-- and as well as it's a reusable set of compiler passes for these higher abstractions. So particularly with MLIR, we are targeting analysis, program optimization, and code generation. And so with that very high level. I'm going to start with why MLIR, and sort of go to exactly the components of what it is. So when we started MLIR initially, we had a question. Well, we looked at the ML accelerators and we saw, well, there is many, many, many accelerators coming forth. And we had the question-- how do we support all these accelerators for all the given [INAUDIBLE]?? And I mean, of course, TensorFlow should provide the best accelerator performance for our users. And so [INAUDIBLE],, can we make that easier? So MLIR started as an exploration of a different approach to doing code generators for accelerators. So we started looking at our existing code generator framework for accelerators, XLA. Now, XLA is one of the most advanced machine learning compilers. We have targets for CPU, GPU, TPU, and other back ends. How can we increase the reuse between a CPU, GPU, and TPU backends? Well, at the moment, TPU backend doesn't use LLVM, so that means we have more low level components that are needed there, because we're not able to reuse some of the same passes or structures. The TPU backend is specialized for the best TPU performance. But because it's so specialized, there's less reuse with the CPU and GPU components. But also looking at these different backends, we notice different abstractions. The CPU, and GPU, and TPU do not share abstractions beyond the HLO level. So if you look at the different backends, you'll have different levels of support for different loop abstractions, as well as stencil emitters between the two. This makes reusing code between the two more difficult. So furthermore, we have a lack of having actual abstractions between HLO and, for example, TPU, LLO, or LLVM, which results in big gaps. So you have effectively passes that are effectively one-shot compilers doing [INAUDIBLE] from very coarse grain ops to multiple lower level ops. It's like, well, OK, but if we want to support so many different TensorFlow ops on all these devices, we must leverage as much shared infrastructure as possible. So we should find a way to try and unify these different backends and these abstractions to be able to reuse more of the passes, more of the generators. So then we come to our first question. It's like, well, can we add a new abstraction to allow greater reuse? Is there some new abstractions that we can add? I mean, we thought yes. But I mean, you'll be saying, like, maybe. And we haven't added them yet, but we're looking at it. One of our goals was to address some usability issues with the current stack. One of them was custom ops for accelerators. The other one was dynamic shapes. But now assuming we had done this, what happens with the rest of the stack? Because I mean, the goal is still this is an end-to-end TensorFlow user experience that we are considering. And so roughly split out the stack into multiple pages. You have TensorFlow, and we're doing optimizations on the TensorFlow Graph. And this is specifically for targeting TPU. We have the TF2XLA bridge that bridges between TensorFlow ops to HLO. Then on HLO, we have different passes, device independent, device dependent. And then finally, we have in the backends the emission from HLO to, in this case, TPU LLO or, in the case of CPU, GPU, LLVM. And so what we have at the moment is TensorFlow can see the whole program, but it has no insight on the device. So TensorFlow does its optimizations, and then it has these parts where this will run on the device, this will run on XLA. XLA, on the other hand, has deep insight into devices. It knows all the different backends, and it knows how to optimize it. But it can't change the TensorFlow Graph. So XLA assumes as fixed all its inputs and the graph structure given to it by TensorFlow. This results in optimization barriers between the different passes. So a backend can't dictate to the bridge which HLO to produce. The HLO produced is constrained by what is in TensorFlow. And so this leads to things such as the double transpose trick, which allows people to force certain layouts. But such operations actually constrains the coupling between the different layers. Now the high level layer and the lower level layer has an implicit coupling, and fixed set of assumptions are hard-coded. Beyond that, the TF2XLA bridge is bridging two different systems with a large impedance mismatch. TensorFlow's side, we have the op ecosystem, we have dynamic sizes, we have different types, we have stateful ops. XLA side, we have HLOs. HLOs are mostly side effect free beyond a few ops. We have different types, but it's a very different system that the bridge does in one pass, transitions between. So at the moment, what we have is that this stack does not abstract out the various functions. We have the top level technologies tied to the lower level hardware implementations. So this results in a large gap of abstractions between these layers, which makes the passes more difficult to write. Now, this is not something unique to machine learning or to TensorFlow, XLA. This similar kind of gap also led to domain-specific IRs elsewhere, and in this particular case, in the compiler domain. If you look at the different inputs from languages such as Java, C++, Swift, Rust, and Julia, almost all of these languages that target LLVM have introduced a new mid-level IR on which they can do their higher level optimizations. And actually, C++ does not do this, so Clang doesn't have this. And I have tried to omit the slide before, but then Chris stopped me. He's very upset about the fact that there is no [INAUDIBLE].. And so this is a very dear point that we are missing a lot of optimizations and reuse by not having the [INAUDIBLE]. So what this means is we can do the domain-specific optimizations. We can do progressive lowering, so we can actually have simpler hops between these different systems. And if we look at the TensorFlow one, we can think of this as-- if you look at the CPU/GPU path, like the TF Graph to HLO, as the HLO being this intermediate representation, mid level between TF Graph and LLVM. But domain-specific IRs are great. They allow high-level domain-specific optimizations. This progressive lowering actually encourages reuse between the different levels, because you can have smaller passes doing dedicated things. It's great. It's great for location tracking. And this enables some flow-sensitive type checking, so you can operate on higher level semantically meaningful parts and produce verification at that level. The part that's not so great-- it's a huge expense to build this infrastructure. You're doing a reimplementation of all the same stuff-- pass managers, location tracking, use-def chains, inlining, all of these things. And more importantly, innovations in one community doesn't benefit the other communities. So there's a downside to having these domain-specific IRs. And if we look at the TensorFlow compiler ecosystem, the previous graph is very much simplified. Because the real situation is, from TensorFlow Graph, we have multiple different backends and multiple different IRs that it's being generated from. From a graph, we generate HLOs. Tensor RT has an output. There's nGraph, Core ML, TensorFlow Lite-- so many different Graph IRs, each with different challenges. So in a lot of these bridges, we have similar but different technologies. And this is not going away anytime soon. But this results in a fragile, poor user experience when failures happen. The location tracking between these different phases are variable. So in some cases, there's no location propagated through. So if an error occurs, the only thing you know about is what happens at the lower level, without being able to trace it back. Beyond this, this also leads to a duplication of infrastructure at all levels. We're reimplementing the same infrastructure multiple times. This is true, even in TensorFlow today. We have the same optimizations at both the TensorFlow and HLO level. And so in some cases, you have the optimization pass that you actually redo multiple times. And then in some other cases, you have future passes actually ignore the results of previous ones. So as an example, we do layered analysis and assignment on TensorFlow Graphs using Grappler. But when we get to HLO, these are mostly ignored, sometimes for good reasons because XLA actually knows better about the devices. And sometimes, it's unfortunate because that actually would have been a good decision, given the higher level graph structure. But beyond that, we need to actually duplicate these passes, because we cannot represent these same operations in one uniform way. And an unfortunate side effect of this is we actually end up transforming multiple times back and forth between equivalent representations. So for example, in one unit case, we converted from Graph to GraphDef seven times, from GraphDef to Graph four times, and then once to HLO. And I mean, this is, in a way, not useful transformations. But our problem is we are unable to represent all of these different ops and structures together, so we need to duplicate this. So with that, the goal of MLIR is to enable global improvement to TensorFlow infrastructure. It's an SSA-based design to generalize and improve ML graphs. We want to add better side effect modeling, control flow representation, improved generality of the lowering passes-- which I'll explain at the time we're on the applications-- focus on dramatically increasing the code reuse between all these distinct paths, fixed location tracking, and other pervasive issues for better user experience. So when a failure occurs, the traceability and the debug-ability is greatly improved. And at the moment, we believe there's no reasonable existing answers. And we refuse to copy and paste another SSA-based optimizer six more times. And so that led us to MLIR. A couple of the other goals is, well, we want to embrace TensorFlow. TensorFlow is one of our main targets. We don't want to work around it. We want to support the full generality of TF graphs. We want to allow the aggressive reuse of infra across multiple hardware paths. And similar to TensorFlow, we want to allow open customization. We want to have a target be able to say, implement my JPEG decoder using this block. We want a user to be able to say, hey, I have a custom kernel. For this model I'm running, I want to see its performance, so use my kernel. Beyond that, we want to enable folks to experiment with their own lower level implementations of operations. So if the compiler is not there yet, or the researcher has a better idea, or we have an MLIR algorithm generating code, we want to be able to plug that into the same system and see the effect on an end-to-end behavior. But we also want to embrace the limitations of particular backends. So for example, if your backend only supports convolution, we want to provide convolution. If you don't support control flow, well, we don't give you control flow. If you have static shapes, we only can give you graph with static shapes, et cetera. This includes what floating point precision operations you support, forced quantization, things like that. We want to avoid those big semantic gaps in lowering. We do not want to have these big gaps where you have one step to the next in the transformation paths which have a bridging completely separate systems, which are difficult to debug and verify. And then very importantly, we want improve the traceability and testability for users. So if your model fails compilation, we want to be able to point back to where it failed. And so with this, what should MLIR provide? Well, it should represent multiple levels of abstraction. We should allow this progressive lowering-- so with any given function, having a progressive set of lowerings that gets you to the destination. It should not be these big jumps on two separate data structures. We want to be able to lower through multiple different abstractions. This also means that the passes need to be designed to operate on these different levels and properties, rather than looking at fixed ops. And I mean, I think this is especially essential for TensorFlow, which has an open ecosystem with ops being added very, very regularly at a good pace. We also should make it easy to add abstractions or domain-specific IR constructs. An example here is, we have the affine dialect. In our affine dialect, we have affine loops. This isn't a hard coded construct in MLIR. An affine loop, which can be used for some polyhedral code optimization, is something that is extended. It's the dialect itself has it. Beyond that, we're looking at location as a first class construct. Locations are intrinsically tied to operations and optimizations. You cannot create an op without a location. If you're doing transformation and you're replacing with a new op, you have to specify where this op comes from. So this means we have a path-- hm? AUDIENCE: What do you mean by the location? JACQUES PIENAAR: So location in this case could be file location, could be name, could be stack trace. So we have a couple of different locations, and we actually also have an opaque location that is just interpretable by a dialect, for example-- so if your backend has a different concept. The most common ones for TensorFlow is a name location corresponding to the name in the GraphDef or the Python call stack-- so the set of calls that got you to creating this op. Another probably we also want to work on is, we want to have this framework enable us to complete a patchwork of tools. So at the moment, a couple of users have run into a problem where we have broken paths. You have a tool A that will get you from representation x to y. And you have a tool B that gets you from y prime to z. But if you actually want to get from x to z, you have to do something else, or you have to restrict them all. We want to try and get a path to complete this patchwork and enable end-to-end workflows of interest. And so this is sort of like, where are we applying MLIR? So mostly, I have been talking the infra. The first application is the TensorFlow to TensorFlow Lite Converter. This is something which is, like I said, is pre alpha. I'll discuss it next. And it's working for a couple of models. We have a couple of new features coming in there that enable some new TensorFlow Lite features. The added target we're working with in JSON is TensorFlow to XLA bridge. You know, looking at the current lowering to XLA, as well as accelerators. We're working with the Grappler team on graph organizations, shape inference, device partitioning, and placement. And then we also have the TPU and GPU codegen projects going on to evaluate new approaches to generate code for these different devices. So I think of MLIR as three different parts. One, you have the graph compiler, which is like the op expansions, the lower entity of Lite, auto outside compilation, parallelism/sharding, that sort of target, code generator, which focus on high level abstractions for code generations. And so there's polyhedral loop nests, tiled Tensor views, to this nature. And underlying all of this is the MLIR infrastructure. So this is framework-independent IR. You have rewrite generators, like automatic pattern matchers, the mechanisms to define dialects consisting of types and ops. And that ties all of this together. And that leads me to one of our first applications, which is the TensorFlow Lite converter. So the basic flow for TensorFlow Lite Converter is a TensorFlow Graph is input. And here, I'm using Graph-- misusing it slightly, because it could be a graph, it could be from a Python script [INAUDIBLE] [? SaveModel. ?] But the goal is to translate it from this representation to the MLIR module, consisting of ops in the TF and TF executor dialect, which I'll come to in a second, to legalize from TF to TensorFlow Lite-- and now legalize is just a different way of saying convert-- convert all the ops in TensorFlow to TensorFlow Lite ops that are legal. And so the reason we use legalize here is we don't necessarily need to convert all the ops. For example, TensorFlow Lite supports flex ops. It supports custom ops. So some may ops may actually remain in the TensorFlow dialect. Then beyond that, we have some optimizations, and then translating it back out to TensorFlow Lite flatbuffer. So the converter is to change from the two different graph representations. We have two different runtime-- TensorFlow and TensorFlow Lite. They have different constraints and targets. But the graphs we want users to run, that they trained and run, we want to be the same. And so we have this converter workflow. The converter actually has an overlapping goals with regular compilation, because, I mean, Edge devices can also have accelerators. And beyond that, I think of TensorFlow Lite in a way, with this converter, as just a weird ISA. So you have a weird instruction set that you're targeting, but it's still a compiler problem. Now, MLIR's pluggable type and rewrite system simplifies specifying these transforms and expressing what you need to do to convert between these ops. And as an example here, we have the quantized type is a first class citizen in the TensorFlow Lite dialect. So we have the quantized op representation. So it's not just you and it, it's actually, you could say, a uniform quantized with these parameters, which allows for some additional checking and verification. So as I mentioned earlier, one of the focuses is usability. Now, usability is one of TOCO's top complaints among TFLite users-- CHECK'ing on errors, unsupported cases, confusing error messages-- and this is one of things we want to improve. One of the ways we want improve it is in making debugging easier. We want the locations to point back to your source. So when an error is emitted, we want to point back to the TensorFlow Python that caused the error. And for this, we are bolting on and extending the TF debug info work currently ongoing, and we want to track the location origin of instruction as well. So for example, if you have a Fused multiple add, then it has a Fused Location corresponding to both the multiply and the add. So it shows you this new op was created from these previous jobs, and that allows you to track back to the original code. After that, we also actually want to be able to say why a model failed to convert. So we want to point to the unsupported ops and types. We also want to say how those types came to be. Oftentimes, the user gets a failure, and they have no idea why this op isn't supported. Because they, in some cases, didn't even mean to have this op there. And so we want to be able to make it easy to find out why it got to a model failing. And as I mentioned, we have dialect types to enable more checking and better reporting. So now, we can have things that are saying, oh, you have an add with two nonconforming quantized types. I'm sorry. This add won't work and fail at runtime. We can do the checking at compile time. And so to give an example, if you look at the old TOCO experience for having an unsuspected value for an attribute, you get a check failure. And the check failure will point you to a stack trace somewhere. And we want to go from that to where we are today, where we specify like, hey, this node failed to convert, because TF quantity actually requires a data form an attribute to be either NHWC or NCHW. And this op was inserted by this following call from your libraries and from your user code in libraries in between. And this allows the user to go find where the error occurred. And I'll mention this app is also involving-- if you try it today, you'll actually see carets pointing to the errors, as you would see with Clang compilation errors-- so source code interleaved, as long as it's in the same working space. And so the idea is to make the user experience much easier for debugging these errors. The next application is the new TensorFlow compiler bridge. So at the moment, the TF2XLA bridge is an interrupt between TensorFlow and XLA. It consists of rewrite passes, as well as transformations to XLA. Now, XLA also targets from multi-node machines to Edge devices. So it's actually not as distinct from TensorFlow Lite. And so also the paths we used to load to these two different backends should not be assisting. We should be able to reuse them. As new features become available in one, everyone be able to take advantage of that. And I'll actually mention that I want to get rid of the word "bridge" here, because one of our goals is not to have this big span between different abstractions. We want to have this more progressive, with shorter, lowering steps that are individually testable. And beyond that, we don't want it to be just XLA-specific, looking at custom compiler backends. We want to make it easy for other compiler frameworks to integrate into TensorFlow in a unified way. So the dialect compilers can have their own dialect, for example. Dialects can integrate in the same framework. You define your operation and types, and you can have the pattern to rewrite specifying the transformations you need to target your backend. And it also means that custom pipelines can reuse the existing components. We want to make reusing the pipeline as easy as possible. So MLIR, one of the goals is to be a reusable set of compiler passes. And so one of them is to translate from TensorFlow or the XLA dialect to your dialect. And so folks then have the option of saying, what ops are legal in the dialect? And then of course, throughout all this, we want to be able to optimize the graph and make it easier for the backends to focus on the parts that are important to them. I'll give a little example of the current versus the new approach, like at the end of the TF2XLA, for converting a TensorFlow op to XLA op. And so taking Relu6 as an example-- so we have Relu6. You define it using a class. You register it on an XLA op. This op is an XLA op kernel, which actually derives from the TensorFlow op kernel. Inside of it, it uses XlaOpKernel context, and SetOutputs, and SetInputs. As is very familiar to folks adding new TensorFlow ops, one of the differences here is that this is actually constructing an XLA expression. This XLA expression is something that is both in a side data structure captured by the context. And the output set here are actually-- the values flowing through the graph are tensors containing pointers to this XLA expression. So what you have here is actually a pointer being fed into the output of this op and flowing through the TensorFlow graph. But you can represent it, and it's very familiar, I think, to folks to how to do this. And I think that's one of the problems. Because it's so complicated, that means if something goes wrong, it's difficult to find out what went wrong. And so we're testing this at the moment is by writing Python tests. And for example, [INAUDIBLE] test Relu6, we have a UnaryOpTest to derive from XLATest classes. And so we have an end-to-end test case which started from TensorFlow and ending on execution on different devices. So per floating point type, per device, we execute Relu6 on both compile and both using TensorFlow. So we're testing runtime values for different types on different devices and checking approximate equality. But very important here, it's actually you have to construct this test to avoid all the optimizers along the way. So one of my favorite examples was, I was once debugging why our [INAUDIBLE] take so long. And looking at the longest running test, I actually found out it was a null op, because it was being constantly propagated away. And we were just testing constant folding over, and over, and over again, instead of actually running anything. So it can be very difficult to ensure that the test you think you are writing is actually the one you're writing. And another point is, this is testing the nn_ops.relu6. In this case, it actually corresponds to a TensorFlow op. But when I talk about TensorFlow ops later, I'm referring to the ops as registered by op registration-- so the C++ ops, versus the Python constructs. Anyway, the current approach with TF2XLA is it's a symbolic execution of the TF graph using the executor. We're storing pointers to a side data structure in tensors flowing through the graph. We capture the XLA type in tensors in different data structures, depending on the TensorFlow type flowing through the graph. We're mostly using end-to-end tests, using Python, for constructing it, as this is the easiest way. And it allows for a very complicated test. But we need to take to thread these test cases past O(n) optimizers to actually ensure they work. Now, the new approach we want to have here is, we want to make it so that you can write the directed unit tests simply. In MLIR, the source of truth for the operations is the IR. The IR is round trippable through textual form, which means you can take an IR, run optimization pass, and dump it. And that's exactly what you would have gotten in the in-memory changes. Beyond that, we want to ensure that there's no optimization run beyond what a developer specified. We want the types to be representable in the IR itself, and in one way. There should not be a confusion as to what type is represented, where, and what it is. And also, we want to enable having these multiple abstractions to lower, without having large jumps between the different systems. And so just to plug at the start-- and this slide's actually out of date-- but so we have mlir-opt and the equivalent of that, tf-opt, which are optimization tools similar to LLVM's opt tool. It's a tool for testing compiler passes. Effectively, what you have is you have IR as input. You run MLIR opt, specifying an exact pass, and you get IR out again. So textual in, and textual out. This allows for pretty cheap verification of the output, because you're verifying the transformation. You're verifying the structure of the transformation. So you do not need to actually run it on different devices, if you just want to verify the transformation. You do not need to compute values, you're verifying the structure. And so in this case, if we look at the TF Relu6 example, you can create a test case using MLIR Translate, which is another tool that goes from a TF graph into MLIR. Or you can manually write it, if you enjoy writing SSA-based IRs textual form. So here is example of a function that takes this input tensor of a fixed shape. This is actually corresponding to a previous example. And I'm not going to get into details, but you can see the TF dialect IFC for more information about the two different dialects. But here, we have the TF executor dialect consisting of the graph and an island within it. Within the island, you have the ops in the TF dialect. And so in this case, you have the Relu6 operation with the normal attributes of T, device, and name, taking in a tensor of 1 times 3 times f32 and producing the same one again, which gets yielded. Now from this, we can actually convert to the TF dialect, simply because in this case we have a single island due to no control dependencies and no side-effect ops. And so this is the purity of dialect representation of this where we don't actually need an island for this. And you'll see a single Relu6 operation. But you'll actually see duplicate information stored here, because now we have the explicit types. And so with the explicit types, we can actually get rid of all these different attributes for t, and then mapping from t is the result type, all of this, because we have the types. So these are derived attributes from the op itself, with the type being the source of true for them. And so in the import and export between graph devs and know devs, we can use this information we derive it from the ops. And so you can have this simpler form to just specify the op and the result type. And then from here-- oh, and one thing you might have noticed in the previous slide, it actually highlighted the names, as well as the attributes. And that's because all ops have locations. And from TensorFlow, one of the most common ones is the name is used as location. If you have the debug info, you can also use the call stack as that information is provided. And so locations for ops are optionally printed. And so in this case, if we actually printed the op as well, you can see the original name as from the import as well. Now names are one location, file line is another, and then call stack's another. And then for example, if you look at diffused ones-- and this is from all the other examples-- you can get the location of this op was actually due to a fusing of a Relu BiasAdd convolution and this constant of b. And now you have this single op. That it's location. You can trace back how you got to that op. And now if you want to lower from TF ops to XLA ops, well, we also have-- sorry-- previously, as shown, we have the TF dialect to match the TF ops. So these are ops defined via REGISTER ops. And so this differentiates from the Python ops. And similarly, we also have an XLA dialect with ops to match HLOs. Converting between these two, you can use a general legalization framework. And this framework is a general graph for writing structure. And for example, we have patterns, so that you have to specify patterns to go from a source tag to a destination tag. So in this case, convert from a TF Relu6 op that is a tensor of one of those types. If that's true, then convert it to an XLA ClampOp with between 0 and 6 in input. So the XLA ClampOp has its first argument, min, then argument, then max. These are declarative rules. They're effectively a statement in equality under conditions that we can use to optimize for a given target dialect. And these rules actually also allow for dynamic constraints. So you can say, only fire this rule if the following is true. And that allows you to specialize the rules for certain cross models, certain backends, all these kinds of things. But with that rule added into the system, we can run tf-opt again. We have dash xla-legalize. And now from our previous input, we now get this as output. So now you have a very directed change from the TF Relu6 to the XLA Clamp, with two extra constant ops added in the function. And of course, backends can define different lowerings. But this just means this transformation to clamp now can be simply verified textually. No execution is needed. No explicit values need to be verified. The verification of the correct execution of the different ops can now be done independently of the verification of the transformation. AUDIENCE: But like, if you wanted to change the way this pattern's implemented to go from a single clamp to two clamps, like 1, 4, 0 and 1, 4, 6, for example, that would break all your unit tests, but it would still be correct. JACQUES PIENAAR: Correct. Yes. Yes. So I mean it's-- AUDIENCE: You're not worried about that? JACQUES PIENAAR: It's a question of-- i'm not. No, because I mean, what I'm verifying at the moment is the transformation. And you're saying, well, I can get the same equivalent numerical result by multiple different transformations. AUDIENCE: And if someone one day changes one, a lot of tests are going to break. JACQUES PIENAAR: But you should not be verifying this transformation. Depending on where you verify this, yes. AUDIENCE: OK. Yeah. JACQUES PIENAAR: OK. And so what I did not show here was also how we autogenerate C++ classes, forced operations with helper functions. So in this case, for example, the XLA clamp op in C++ actually has accessors, such as min, operand, and max. So you don't have to specify get operand 3. We also generate docs from one description, so we actually have one description format from which we generate the exporting to GraphDef, TFLite, Flatbuffer, XLA proto's. All the ops defined in MLIR, you can specify verifications for the ops, as well as structural verifications for regions and functions, which means you can capture the ops and the variants together. You can verify graphs after every step of every transformation, if you want to. So you can narrow down failure cases. And it also means that pass can actually assume valid graphs, so that you operate on a valid graph without having to repeat the same pattern everywhere to be defensive. I also didn't speak a lot about the pluggable type system and the code generation. And so a lot of these, in the future when we actually have more examples and more time, we can definitely go into more of these. But this was just like a whirlwind tour for one of the applications we're looking at. And sort of as in conclusion, MLIR is a new compiler infrastructure to unify graph and codegen for TensorFlow. We're looking at representing multiple levels of abstractions and ops from different dialects coexisting and being optimized together. We want to enable progressive lowering in a testable and verifiable manner, making it easier to add these tests verifying behavior. And beyond that, we want to make the infrastructure as unopinionated as possible. We want to be able to get out of the way of the developer and enable them to define their own abstractions for targeting their use cases in their backends. OK, with that, I want to thank everybody. And time for questions. [APPLAUSE] AUDIENCE: I'm going to riff on this word, "opinioneated." Do you have an opinion on memory safety for MLIR dialects? JACQUES PIENAAR: OK, that is a broad question. AUDIENCE: And so I can imagine, in the sense of progressive lowering, to lower a dialect that uses raw pointers instead of symbolic handles to tensors. JACQUES PIENAAR: Yes. AUDIENCE: Is there expected to be an infrastructure that will talk about safe and unsafe regions in programs? Because if we're shipping them around to people, it would be unfortunate if this became a vector for-- JACQUES PIENAAR: Yes. So I think, again, sort [INAUDIBLE] upwards. I mean, I think, in even simple cases, if you think about numerical equivalence of transformations, right? So a case where we have one op that has certain [INAUDIBLE] behavior and trapping behavior, converting it to a different dialect which has, I don't know, perhaps none of that. So for example, let's say we go to a dialect which says, well, fast math is fine. So all the optimizations, division by zero never happens, so you can always do an invert and multiply. And I think that boils down to the rules you do to do the transformations for where you're heading needs to be aware of that. So I mean, if you're lowering to a different dialect which has less guarantees, I think that is up to the legality of the end to determine that, right? So meaning the memory safety verification, I feel like in the dialect where it's safe, we have to insert all the verification and whatnot. If we're going towards the dialect which is unsafe, we have one or two options-- either insert runtime checks to do the verification and any extra sanitization and [INAUDIBLE] them where we know it's possible, or we have to say, well now, it's unsafe. And sorry, you're taking an unsafe input. I mean, that-- AUDIENCE: No, no-- JACQUES PIENAAR: I haven't thought about this much, but that's-- Well, I will say, it's going to be more fun as we start playing with this. And so, I mean, I encourage folks to start pushing and prodding it and find bugs. I mean, at the moment, it's very much infrastructure that's probably not in your usage path today. But you know, we want to make it used for that. Anyway, and a button to-- and this will be in the TensorFlow GitHub repo later today. Like, everything will be open. AUDIENCE: Everything is in the TF and TF control flow dialects? JACQUES PIENAAR: TF control flow dialects, XLA, TF Lite. AUDIENCE: TF Lite? JACQUES PIENAAR: Yeah. AUDIENCE: And there is also going to the open design meetings [INAUDIBLE]? AUDIENCE: So can you point us [INAUDIBLE]?? AUDIENCE: Sorry? AUDIENCE: Yeah, a quick question here. Quick question. Can you share a simple example of, I don't know, like TensorFlow with fully [? connect ?] there? And then that we can go, like, step by step? For example, converting it to MLIR, then you described like optimization step, and then converting it to LLO, HLO, all the steps? I just want to dive deeper [INAUDIBLE].. JACQUES PIENAAR: Sure. AUDIENCE: Do you have this kind of end-to-end example, like not just when I just run a command line and run convert, and then it's converted. I want to see intermediate data too. JACQUES PIENAAR: Yes. And so you can actually do this by-- and I don't have an example offhand to show you, but in our testing directory, you'll see a couple of that does some of these manually. But effectively, what you can do is you can link together multiple phases of MLIR translate piping into an MLIR op, piping into an MLIR op, piping into an MLIR translate, to see all these phases. I mean, actually, you can specify MLIR op with multiple different passes, one after the other, and then a translate at the end. And if you want to see all the intermediate data, just slap in t in between. I actually do have an example from a TF Lite [INAUDIBLE] phone, but I do not have that slide handy. Because I think that's actually one of the things that is quite nice-- you can see every step as you are progressively changing things and running different passes. AUDIENCE: And is it covering mobile devices, like a phone? JACQUES PIENAAR: Well, I mean, our first use case is TF Lite. AUDIENCE: Can I run it on mobile phone? JACQUES PIENAAR: Oh, you mean running it on? AUDIENCE: Yeah. Sounds good. Yep. Yes. JACQUES PIENAAR: I mean, I want to run it everywhere. AUDIENCE: From TensorFlow, right? OK. Cool. Cool. Sounds good. JACQUES PIENAAR: Thank you. [MUSIC PLAYING]
B1 中級 Inside TensorFlow:面向TF開發者的MLIR。 (Inside TensorFlow: MLIR for TF developers) 4 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字