字幕列表 影片播放 列印英文字幕 EUGENE ZHULENEV: My name is Eugene. I'm working on Grappler and performance. But today, I'm going to talk mostly about how we do graph rewriting for functions. So functions is a very important part of TF2.0. It's very important for the end user, how user thinks about the flow computation. So it's not graph anymore. It's a compositional functions. But that creates a bunch of problems for us at runtime. And so basically, we have to rewrite all the graph and all of the functions to make it executable and to avoid some specific problems later comes with the functions. So this is, on the left, this button code. So this kind of V1 Python that people usually write. So you define your variables, some constants. You do assignment, and then you want to return, update variables. So if you don't add explicit control dependency, you're going to read the initial values. So you have to think what are your dependencies and have to make everything very explicit. And so on the right, this is some textural GraphDef representation. So there's not enough space to go to the full GraphDef in the [INAUDIBLE]. So this is roughly how it will look like if you go to the back string into TensorFlow. If you get to GraphDef to the back string, you can get some short representation of the graph, something like on the right. So variables are actually VarHandleOp. This is a single kernel that returns DT results data type. Constants, they are just simple Const nodes with the values. And then when you [INAUDIBLE] add, it is an AssignAdd and TensorFlow kernel that takes results handle from VarHandleOp, and the constant, and assigns and add a value. And when you do it, it is another TensorFlow kernel read variable op. So you get TensorFlow [INAUDIBLE] resource handle a and c. And because you have explicit control dependencies, this series depends on update a and update c, both of them. And you fetch with a and c. But explicitly, dividing control dependencies is very annoying. So in TensorFlow 2.0, the same graph will look something like this. So you have your variables defined outside of a function scope. And you have a TensorFlow notation and_and_get. So I want to add two different constants, two variables, a and c, and get back our latest results. So the pattern looks something like on the left. So you don't have to explicitly add control dependencies, which make lives easier. And you don't have to think about program order. So tf function traces the function. And it has all the necessary control dependencies. So when you execute a graph, you will get the program order you would expect from [INAUDIBLE].. So the FunctionDef, so the graph representation looks something like on the right. So we have a GraphDef. And the GraphDef is basically a sequence of null Defs. And we also have a FunctionDef. It's a way to group together some subgraph with input arguments and output arguments. So this function on the left from the pattern will be transformed to FunctionDef. That will be represented, something like this. So we have two input parameters over source type a and b. And the return type of the function is pair of ints. So the update a-- I'm sorry-- add. We are assigning the new value to [INAUDIBLE] source. And then we do [INAUDIBLE]. And we have the control dependencies from the previous operation, touching the same resource. And this is added automatically by TensorFlow 2.0 automatic control dependencies. So we don't have to do anything. And then we have written letters. So we have functions at graph level, at FunctionDef and GraphDef. In TensorFlow, they have two types of returns. You might return the data type. So you return two Tensors, read a and read c. But also, it has a special kind of return notation, control return. So it doesn't return any data. But you might group some of the ops inside your FunctionDef. And you can specify that these ops have a name. And they must always run. And when you have a control dependency on the function call, run time verifies that every op that is part of control [INAUDIBLE] will run. And the main graph will look something like this. So we have a partition call. It is Def 2.0 function call mechanism that does partitioning and multi-device function, invocation, and graph optimization and all the other [INAUDIBLE] that are required for the runtime. And [INAUDIBLE] operations is just the identity that is the first and the second output of FunctionDef. so? Any questions? Is it clear, like the notation on the right side? SPEAKER 1: So in this example, in particular, the read operations already depend on the updates. So in this case, the control dependencies, I guess, are not that required. But in other case, if we don't return the read values then those-- EUGENE ZHULENEV: No. The read depends only on the input, a and c. SPEAKER 1: Yeah but it has a control dependency on the update. EUGENE ZHULENEV: Yeah, because this control dependency added automatically by Def 2.0 at Def function, when you trace the function. And you have multiple Def ops touching the same resource. Automatic [INAUDIBLE] will add control dependencies. SPEAKER 1: I'm saying in terms of the return values, like in this case, if you have the read a and read c, they automatically capture the-- so if you tried to fetch read a and read c, you make sure that updates are run. So in this case, the control dependencies are not that useful, I guess, or are they-- SPEAKER 2: [INAUDIBLE] SPEAKER 1: Yeah, yeah, the control returns. EUGENE ZHULENEV: Oh, yeah. Well, yes. SPEAKER 1: Just pointing out that it would be more useful if you don't return the right value. Then in this case, the control returns would be very useful. EUGENE ZHULENEV: Yeah. Yeah, so if you have some way of update-- for example, if you have a function that doesn't have returns, that you are a function only for the side effects, update some counters, much of counterstatistics. In this case, you don't have irregular data outputs. So this is why you have control return. But right now, tf 2.0 will add these control returns to both places. This is needed because to separate the data path and control path, because when you inline that function-- and it turns out that no one is using output 1. So you might end up that you will not execute your state updates inside the function. But it is guaranteed that everything on the control path, on the control region, will be executed under some conditions that I'll mention later. SPEAKER 3: Question. What's that small [INAUDIBLE] in front of update [INAUDIBLE] a? If you see the read a equals to ReadVariableOp in a, and then what's that small sign in front? EUGENE ZHULENEV: Control dependency, you mean? This one? SPEAKER 3: Yeah, this one. Why it has this special-- EUGENE ZHULENEV: Yeah, so the data dependency goes by name. So a is regular data dependency, and this one is a control dependency. So we don't read the actual output. It just means that update a must execute before we can execute read a. SPEAKER 3: Oh, so it's just a special sign [INAUDIBLE].. EUGENE ZHULENEV: Yeah. Yeah. SPEAKER 4: That's the internal representation of control dependencies in the GraphDef total. SPEAKER 3: I see. EUGENE ZHULENEV: So yeah. If your op has two inputs, you will have ab as inputs. And then you can add as many as you want control dependencies. At graph execution time, run we'll make sure that everything in control dependencies will be executed before your kernel with a special notation. Here's another example. So we have variables a and b, are vectors of length 10. And variable c, it's just a counter, all types of integers. And we want to get a strided slice from variable a and b. And then we also want to increment the counter. Every time we take a slice, we want to know how many slices did you take for whatever reason. So this graph doesn't make much sense. But anyway, and we're only interested in slice 1. And the graph will look something like this. We have three variables, work handle ops, constant assigned ad for the counter, and then we read a variable, this one. So we read variables from-- this should be b. So we read variables from their resource, from the handle, and then we take a strided slice. And then we fetch slice 1. And because we fetch slice 1, we can remove slice 0. It is not needed for slice 1. We can also remove from the graph a, and that's all. But you see that the nice property we get, that slice 0 is invalid. If you tried to take a slice from 20 to 30, in a variable of size 10, you'll get fail at run time. But because you don't really need slice 0, everything is just fine. As long as you don't fetch slice 0, this graph is totally valid. But if you do the same graph, but you put your slices inside a function, so the same kind of thing, you have three variables, a, b, c. c is a counter. You do AssignAdd inside a function. So nice, you don't have to add explicit control dependency. You'll get your count updated automatically. Then you take slice 0 and slice 1 and [INAUDIBLE].. And you invoke a function. You're not interested in slice 0. You'll want only to get the second value. So just keep it. And you're trying to print it. So at a GraphDef level and FunctionDef, it will look something like this. It will have a good slice function that takes three sources, a, b, and c. It will have an increment for the counter. It will read the variants of a, b, take the slices. And return slice 0 and slice 1. And it will have a control [INAUDIBLE] increment automatically added by automatic [INAUDIBLE] tracking. So slices don't have any control dependencies because slices, they do not depend on variable c. They depend on different variables. And then you have a function call, a partition call to the get slice a and b. And you don't need the first slice. You take the second one by a function called output 1. And you're trying to fetch it, or print it, or whatever. And the problem right now, that this code, we will fail because the function semantics [INAUDIBLE] flow currently is strict with regards to its inputs and outputs. So before function return, function can return, it must finish the execution of all the output parameters. And before function can start executing, it must finish executing of all the inputs. Yeah, so there's another property. So all inputs must be evaluated and must be not dead. Your function can start running. And note that it means that if you have inputs from switches, it's [INAUDIBLE] used to represent control flow, in TensorFlow. So there is a special notion of that tensor. So basically, when you round your graph at runtime, and you have dead tensors, you don't execute any kernels. You just update the flag, and propagate this deadness across the graph. So we can have a large graph and if note, which is represented as switch. And based on the runtime value, you execute only half of the graph. So another property of the function is that all outputs must be related and also not dead. So, for example, you have a function with the control flow. And you want to take a gradient of that function. You have to return all intermediates from the function, and some of them might be dead because they might be from intermediate computations from untaken branch. And basically, your function can be dead. But sometimes that [INAUDIBLE] so there is a special way to avoid that. And I think it might not be used right now. But before that, when we generated gradient functions TensorFlow 2.0 marked all the functions for the fourth pass, that they carry it on that inputs. And instead of that inputs, runtime would return empty tensor, and we'll just continue. So but the most confusing thing, if you do this code without the f function notation, that's a regular Python function. Get slice. a, b, and c have the same inputs to increment. But then you have to add explicit control dependency because they're not using TF2.0 automatic control dependency tracking. So we will get graph on the right. And it is exactly the same graph as you would get if you don't use functions at all, if you just write it line by line, one by one. And it would allow it the full runtime to prune all the nodes that are not used. And your graph is valid again. But the only difference is that you don't have Def dot function notation. And even if we completely remove from this counter-- and I don't want to count anything you can see-- so the only difference would be the TF function notation. And it is a huge difference at runtime. One graph fails, and another graph is just fine. And this is the real example. So Alex gave a presentation about functions in TensorFlow. And that's how some people use it in their functions running. So you have a function that returns some environment state, does action, and does environment [INAUDIBLE].. You get a, b, c. And you just run the loop on the action. And in the end, you run your z function. So if you would not take this function with def dot function notation, on every loop iteration you will get the state, you will do the action, and you will [INAUDIBLE],, which is not fine. Another example from TensorFlow probability-- so you have def dot function that has three inputs. You have-- the first output comes from some constant inside the function, constant and some computations. Then you take first, second, and third inputs, and do some computation and output at position 2 and 3. And you pass it to exactly the same function, and then you pass to the next function. So we have the first input to the first function is some failed computation. It might be placeholder that you didn't define. It must be a strided slice within an inverted bounce or fail to read from the null file system. But if you trace this fetch, we are trying to [INAUDIBLE] the second value. If you trace across the function boundaries, this one, this one, yeah. We ended up in this constant. And it's totally valid. If we'll try to fetch the last output of the last function, and we'll trace it, we'll get to the second input or the first function. And it's totally valid. So as long as we don't try to fetch this computation or this computation we don't really need the first input. And a lot of [INAUDIBLE] in terms of probability relies on this property of the function, that the function is not really a function call. Is just some Python scoping language construct that allows you to separate your graph construction. And when people moving to TensorFlow eager and they annotate all their functions with [INAUDIBLE] function, it all breaks because function at runtime level are strict. But that's not what it was designed, how people wanted to do functions. So if you open graph proto-- and this is the first commit when functions were added to the TensorFlow. And there's very important things. So the [INAUDIBLE] may start execution as soon as some of its inputs are ready. And if you want to enforce that all your inputs are ready, you must use double because runtime is allowed to start executing when the [INAUDIBLE].. It doesn't have to wait for all your inputs. And the consumer over their function may start executing early, as soon as that function value is ready. So if you want to ensure that your function has a strict semantics, you must use doubles, otherwise runtime is allowed to do whatever it wants and start executing as early as it wants. It happened to be that implementation was strict and people didn't notice that. And there are a lot of golden terms that relies on strict semantics. But we don't really want strict functions. We'd really love to get laser functions. So nothing is evaluated until it's really necessary. And in count, TensorFlow is a good [INAUDIBLE] necessary means that we want to execute the primitive TensorFlow kernel. So primitive TensorFlow kernel is any op add multiply convolution [INAUDIBLE].. And composite kernels are right now currently only a function, I think. There are some proposals to add more composite kernel support, but I don't know how it will look like. But currently, we [INAUDIBLE] functions to be lazy. And ideally, we would love to make switch lazy as well. So if you have switch, your conditional execution, right now, you'll have to execute all the inputs to the switch, even if you don't use it later. But it's very hard to get to the current executor. And also, nonstrict semantics is very important for performance. So imagine that TF2.0, we have a function resonant. And we have a huge model that has 10 resonate inferences. And each resonate has 256 input parameters. So if these parameters are living on remote parameter servers, you can't start executing your function before you finish all these 256 tensors to your local machine and start training. And then you have-- well, you'll take a lot of memory to keep them at the same time on the host of the device, or whatever. And it will take a lot of time until you get them, all the network. Even if we use TensorFlow 2.0, and the functions do not get Tensor source inputs. They just get resources and they do read variable op. We still have kind of the same problem because we can't start executing the function before the previous update to [INAUDIBLE] finished. So before you can start running your resonant 50 function, even with the resources as inputs, you have to wait for all your control dependencies, of all the previous updates to all the previous variables. So even if you don't get a lot of network traffic, you still wait on parameters [INAUDIBLE] computing that updates, or maybe something else. So we really want laser functions and run them as soon as something [INAUDIBLE] start efficient parameters running the first layer and fetch the parameters for the second layer only when we completed the first layer. So we didn't get laser functions originally because functions were added to the TensorFlow on time much later than original executor and the GraphDef was created. And TensorFlow executor at runtime is super strict. So people often think that TensorFlow is lazy. So you fetch variable a, and then you get back and pull only whatever you need. Well, that's not actually how it works. So TensorFlow runtime is very strict, and greedy, and is used in pushes. And these received lazy [INAUDIBLE].. It's just a product of pruning. So we removed-- first, before execution, we removed all the notes that we don't need to compute the output. And then we start looking at the nodes one by one that are ready, they don't have inputs, and run them, and update all the consumers that you are ready to run. And this is the fundamental property of how executor should see in terms of tensor flow works. And it's almost impossible to do anything for that. And you can't really touch it. It's super complex. It's super performance critical. And adding classification is just impossible. So that's why we end up with strict functions. But originally, even in design documents from 2015, people discussed this problem. And people wanted to have lazy functions, or lazy semantics because people at the time thought that if you have some else layer in [INAUDIBLE] layer as a function, eventually, you'll want to wait for all inputs. It is too critical for performance. But that was too hard to implement at that time. And we ended up with strict functions. But no one used functions at v1. It was a very rare thing. And it was not a problem. But in Def 2.0, with def dot function notation, we wrap all the functions into FunctionDefs. And now we have hundreds of thousands of functions at one time. And now have all these problems with strict laser semantics. And concept semantics is not really defined. So different people think different things and what should be the semantics and what is the right semantics. So right now, we're kind of between strict and lazy as we can. But sometimes it's still strict. So it's a little bit from us. So to get back, I will lay the semantics. One way is to fix execution model at run time, which is impossible without rewriting from scratch. So we might do it later, but that's not something we can do. So the easiest way to do that is just to align all the functions into the graph. So instead of GraphDef with a petition call to the function call, you will get one single graph, which is executable. And function then for the full runtime can start executing nodes as any authority. So we don't have any of this function boundary anymore. But you still have to do it very carefully. Because if you inline it without-- because if you just align the function body inside the graph, you might get completely different semantics. Because some people rely on control dependencies adhere to the function or the strictness, and all the side effects, visibility, and TensorFlow 2.0 when it adds automatic control [INAUDIBLE] tracking. It has some assumptions about what is the program execution order. And if function inlining would violate that, it would be very, very sad situation. So TensorFlow 2.0 Python front end has some function semantics and graph construction semantics that helps a lot to get this program order semantics. So all the immutable state is represented as resource tensors. So the resource is just a handle to a device and some memory location. So you can imagine this is a pointer on GPO or pointer to the buffer on CPU. And we pass our sources as inputs of the functions. And if the function has an input from resource a, it will have an incoming control edge from the last op in graph or the last function that has the touch [INAUDIBLE] resource that has the same resource as input. So if you have assigned a variable, and then you pass the resource for the same variable to a function, you will have a control dependency from data sign. And if anyone else is touching the same resource after the function call or any other op, TF2.0 will add an outgoing control edge to the next op. So if we pass a variable to a function, and then you outside of a function you have added variable op, TF2.0 will add a control dependencies from your function call to the [INAUDIBLE].. So if you do the read, you will observe all the changes that were made to that variable inside the function body. So the most important thing, I guess, from TF2 function notation, that it does automatic control dependence striking. So when you write your Python function, and you have some idea what should be your program order execution semantics, so you add 1 to a. Then you add 1 to b. You think that the a to b should happen after a. And if you add 1 to a, then you read a, you expect that you see the new value or the variable a. It was not the case in TF1. In TF2, when you add TF dot function notation, it will add all the necessary control dependencies. So it will add control dependencies between all the ops that have the same resource as input. And it will also add control dependencies between all stage 4 ops inside function barrier. So all your stateful operations will be always executed in the same order. And it will give you some sanity. So if you have multiple prints inside a function barrier, you should every time observe that prints in the same order. In TF2.V1 you can observe that prints in any order. And that's confusing. And all the stage four ops risk side effects. Ops that can have side effects, they will be added to control the control output. So function run time and functional landing must respect that. So we don't want to lose any of the side effects or updates to the state. And these are some rules. So all the side effects to the resources that happened before function call must be visible to all the nodes inside function call. And all the side effects to the resources happened inside function body must be visible to every op function using the same resource after the function completed. Currently, it's implemented. So you have a control dependencies from the op that made some side effect to the function call. And the function call has a control dependencies to the next stop that might need that side effects. So to enforce that semantics, functional lighting has special rules. So it will add a special note, input control note. So if your function goal has any control dependencies, that input control node-- all the control dependencies will be forwarded to that input control node. And all the function inputs will have control dependencies from that node. So it basically means that you can't start executing your function before your control dependencies are satisfied. All the nodes are executed. It also means that if your function code doesn't have any control dependencies, then the full runtime is free to start running your function value as soon as anything is ready. Also, it will add up control node. So this note will have control just from all the side effects, from all the control outputs over function. So if, for some reason, you have a side effect inside a function, and this side effect is not connected to one of the control outputs, one function will be aligned. TensorFlow is free to prune the side effect, or so you might end up with some partial observed state updates. But that should not happen in practice in TensorFlow 2.0, because all the side effects, when you construct a graph from Python, all the side effects should be appropriately connected to control outputs. But that's not the case of [INAUDIBLE] models. I think [INAUDIBLE] doesn't use TensorFlow 2.0 automatic control dependency striking in some cases, or it was not using some time ago. So that might be violated [INAUDIBLE] in some models, but I hope that doesn't happen right now. And also, there's assumption if the function call does not have an outgoing control edge, it means that no one [INAUDIBLE] doesn't care about what's happening inside the function, what are the side effects. So if you have an ongoing data edge, someone needs, the data output. But if function code doesn't have outgoing control edge, it means that function might do whatever it wants inside function value updates, any variables, counters, print, anything send any data over the network. And no one cares about what it does. So that might lead to some troubles. But I think in TF2.0, that also never happens in practice because automatic dependency checking for nested function calls will add other required control dependencies. And when you execute the top level function, it should also add all the control dependencies. But, again, it might happen with some models that do not use automatic control dependency tracking. So that's how the function, the function will look like after landing. So this is the function from previous example. So the function takes three resources as inputs. It reads variables ab, increments the counter, and the strided slice, one of which is invalid, and returns both slices, and control output is increment. So we have function call. And this will be after aligning the GraphDef. So we no longer have FunctionDef, and we no longer have function call. We have just one graph with multiple nodes. So we have incoming input control node. This is now op. So the function call node didn't have any control dependencies, so now op has [INAUDIBLE] control dependencies. We read variables from inputs, and we depend on input control note. So any reads from the variables from the inputs will happen after the input control node executed. Then we have an increment [INAUDIBLE].. Then we do two strided slices. And then we have two identify nodes for the function writtens. And we reslice 0 for the first [INAUDIBLE] value and slice 1 for the second. And we have an output control node. And the output control node has control dependencies from the side effects inside function. And this function has only the one side effect assign add counter increment. So read the variable op mark the state [INAUDIBLE].. But it is not a side effect because ReadVariableOp can't modify, can have side effects. You just observed the states. So in here, we could read variable op to this [INAUDIBLE] control node. But in practice, there are many stay full ops. That just reading the state entry to variable op just one of them. And slice, so previously, we had a function call node, get slice. And slice is identity node, the reads the second output. Now we don't have a function going out anymore. And slice is just an identity node that reads directly from the in line to function return. And we have a-- so we read a variable from the counter. And it has automatically control dependencies to the output control node. So every time we read the counter we must observe all the side effects that happened to that counter inside function value. And now we can do pruning again. So we don't use written 0. We can prune it. We don't use slice 0. It is invalid, 20 to 30. We can prune it. And we don't need the value or the variable. So we can prune, read the variable op. And so again, back to the graph, that is valid, and can be executed just on time without exceptions. So there are a few more problems. So when you have a function, and you in line function value, and function value does not have any device annotations, you have to decide on what device to place that node. So before TF2.0, we always had single device function. So either the function call node is placed on CPU, or the nodes inside function body would be executed on CPU. In TF2.0, it's too limited because you have resonate in your function. And that function call is on CPU. You don't want to run [INAUDIBLE] on CPU, or you might want to use multiple CPUs, GPUs, or have a function to transfer multiple devices. So there are multiple strategies, how to place nodes within line function. So you can do nothing at all and rely on placer. You can make force function to be single device. And you do some time for V1 graphs, for compatibility mode, primarily, or [INAUDIBLE] for multiple device functions in TF2.0 that all the nodes inside function value must be placed on the same job replica and task. But the device might be empty. And then we rely on the placer to place them on the correct device. Because imagine that you have a function call and [INAUDIBLE] runtime, and your function call happened to be on another machine. And then when you would execute that function call on that machine at runtime, it would be able to choose only from the devices available [INAUDIBLE] machine. So if you [INAUDIBLE] and don't add any device notation, place [INAUDIBLE] device placements. So if the user placed a function call, like resonant function call, on machine 1, and resonant function site doesn't have any device notation, and we learn it, and then we run placer, all the resonant nodes may be placed on a completely different machine and GPUs, and it will break assumptions of the user. So when we [INAUDIBLE] functions currently, we overwrite job task and replica, and then end untouched device. So the placer will pick the right device for the node even after learning. Also have a bunch of functions created internal for v1 that do not use control outputs and don't just control dependency inside function value at all. So after aligning such functions, you might end up with completely different execution semantics and lots of [INAUDIBLE]. What's another fun part of current runtime, that current function an time, it does not prove any state full ops. And it is very different from execution semantics of the graph, because if you have state full ops, variable updates inside your graph, and you don't have control dependencies, runtime will plummet. But if you have exactly the same graph inside function, runtime will execute all the stateful ops. And this mismatch is very difficult to handle when you in line your function graph inside or the outside GraphDef because you have different notions of what should be pruned and when. So TF2.0 always inlines all the functions because TF2.0 is guaranteed to have correct control dependencies and function control outputs. But we don't inline functions by default in legacy graphs. Grappler is responsible for aligning functions in legacy graphs. But Grappler does a lot of additional checks with all the stateful ops as a path to one other output, that we don't have any side effect for ops inside function barrier that are not connected to anything because they will be pruned. And there are a bunch of other also. In TFV1, you can get function with mismatched deadness, which should not be possible. But that happens. So Grappler is very conservative. So it aligns only if it can prove that it is safe, and it does not change the semantics. And this functional inlining is a huge source of bugs. Sometimes people think that function inlining is a problem, and often it is a source of a program because it is very complicated and semantics was never defined properly for the function. So I mostly had to come up with the semantics to make all the tests pass and a bunch of [INAUDIBLE] workarounds for different tests. I hope that we might be able to get better semantics at some point. But right now, it's super messy. Also, you have much other functional, like functional op. So, for example, we have a functional if-- so this will basically predicate. And you have attributes with two functions, then and else function. And another op is functional while. So in V1 we have-- so each next iteration enter exit nodes to represent while ops. In V2, we have a functional while. So you define a function for the body, for the condition. And then you just run it at runtime as a special op. Also, you have a case, something like if with multiple branches. Currently, we lower all these functional ops to V1 control flow. So basically, the if node, the if node becomes a switch node and 2 function call node for then and else functions. And these functions are then aligned, plus any other function call. So we do that primarily for [INAUDIBLE] semantics. So if you're on your while or if as the open side graph, it will have strict semantics. And some models just expect lazy semantics from your if. It also very limiting for concurrency. So V1 control flow. While ops, for example, you can run multiple [INAUDIBLE] iterations at a time. You can [INAUDIBLE] while loop durations. If you will try to do that with a functional while, it's just impossible. You have to wait for every iteration to completely finish before you can start the next iteration. So a lot of people want to move to functional ops, functional control flow. But in practice, it's very difficult, primarily because of the performance. But it makes a lot of analysis easy because we often have to reconstruct what was the while loop from the GraphDef from all the switches, next iteration, entry/exit nodes. And it is very [INAUDIBLE] prone, and we have a lot of troubles with it. So if you would have at the graph optimization level functional ops, that would help. And then at the later stage, we can just lower all of them to switch [INAUDIBLE] to get the good performance. So here is an example of how functional if looks like in a graph. So we have a flag. That's the Boolean variable. We read a flag with a read variable op. We have some constant 0. So we have two functions plus 1 that adds to integer input 1 in the [INAUDIBLE] and plus 2. So we read a flag, a Boolean flag. Then we have 0. And the result is if plus 1, plus 2. So if your flag is true, we add plus 1. If the flag is false, we add plus 2 to 0. So the result will be 1 or 2 depending on the flag. So when we lower this functional if to V1 control flow constructs, we will have the reveal op for the flag, which will have a 0 for the cost. So we have a switch node based on the flag and 0. So the switch node has two outputs. It will output the value 0 on the output 0 if the flag is true, and it will output the value 0 on the output 1 if the flag is false. And the other output, unused output will be dead. So it will basically prevent execution from one of these nodes. So then a function is a function call with a switch output 0. And else function is another function call with a switch node, output 1. So if your flag is false, this node will be dead and will not be executed. And then we have a result as a merge. So we merge the results of the and function and else function. And one of them is going to be dead, and another one will be alive. And this will be the final result. So after we lower-- so this is after we lower if node to function calls. And then function learning kicks in. And we get rid of all the function call nodes. We basically have then function written. Just add from switch 1 and else function. This should be 2. And we merge the written values of the functions. Yeah, that was great. So that's how we get rid of all the functions. So we have functions as a mental model for the end user, how do you think about your TensorFlow graph, how you built your program. So you no longer think in terms of graph and [INAUDIBLE].. You think in terms of functions. But when we get these functions at the runtime, we still carry them to [INAUDIBLE] because that's what we have to do for performance and sometimes for correctness. But there's kind of a promise of TF, that function notation in TensorFlow [INAUDIBLE] mode. If you want [INAUDIBLE] semantics back, just annotate your function with TF dot function and you'll get back a graph. But that's not completely true, because if you have [INAUDIBLE] function goals annotated with [INAUDIBLE] function, you would have multiple function deaths, multiple function call nodes, and you'll get strict semantics. And this is not [INAUDIBLE] the semantics of V1. So the only way to provide users what was promised is to align all the functions. And then we'll get back to the single graph with the lazy semantics, with the pruning and all the nice properties of that dataflow graph. [MUSIC PLAYING]
B1 中級 Inside TensorFlow: Graph rewriting (Macros, not functions) (Inside TensorFlow: Graph rewriting (Macros, not functions)) 5 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字