Placeholder Image

字幕列表 影片播放

  • EUGENE ZHULENEV: My name is Eugene.

  • I'm working on Grappler and performance.

  • But today, I'm going to talk mostly

  • about how we do graph rewriting for functions.

  • So functions is a very important part of TF2.0.

  • It's very important for the end user, how user thinks

  • about the flow computation.

  • So it's not graph anymore.

  • It's a compositional functions.

  • But that creates a bunch of problems for us at runtime.

  • And so basically, we have to rewrite all the graph

  • and all of the functions to make it executable

  • and to avoid some specific problems

  • later comes with the functions.

  • So this is, on the left, this button code.

  • So this kind of V1 Python that people usually write.

  • So you define your variables, some constants.

  • You do assignment, and then you want

  • to return, update variables.

  • So if you don't add explicit control dependency,

  • you're going to read the initial values.

  • So you have to think what are your dependencies

  • and have to make everything very explicit.

  • And so on the right, this is some textural GraphDef

  • representation.

  • So there's not enough space to go to the full GraphDef

  • in the [INAUDIBLE].

  • So this is roughly how it will look

  • like if you go to the back string into TensorFlow.

  • If you get to GraphDef to the back string,

  • you can get some short representation of the graph,

  • something like on the right.

  • So variables are actually VarHandleOp.

  • This is a single kernel that returns DT results data type.

  • Constants, they are just simple Const nodes with the values.

  • And then when you [INAUDIBLE] add,

  • it is an AssignAdd and TensorFlow kernel

  • that takes results handle from VarHandleOp, and the constant,

  • and assigns and add a value.

  • And when you do it, it is another TensorFlow kernel

  • read variable op.

  • So you get TensorFlow [INAUDIBLE]

  • resource handle a and c.

  • And because you have explicit control dependencies,

  • this series depends on update a and update c, both of them.

  • And you fetch with a and c.

  • But explicitly, dividing control dependencies is very annoying.

  • So in TensorFlow 2.0, the same graph

  • will look something like this.

  • So you have your variables defined

  • outside of a function scope.

  • And you have a TensorFlow notation and_and_get.

  • So I want to add two different constants, two variables, a

  • and c, and get back our latest results.

  • So the pattern looks something like on the left.

  • So you don't have to explicitly add control dependencies, which

  • make lives easier.

  • And you don't have to think about program order.

  • So tf function traces the function.

  • And it has all the necessary control dependencies.

  • So when you execute a graph, you will

  • get the program order you would expect from [INAUDIBLE]..

  • So the FunctionDef, so the graph representation

  • looks something like on the right.

  • So we have a GraphDef.

  • And the GraphDef is basically a sequence of null Defs.

  • And we also have a FunctionDef.

  • It's a way to group together some subgraph with input

  • arguments and output arguments.

  • So this function on the left from the pattern

  • will be transformed to FunctionDef.

  • That will be represented, something like this.

  • So we have two input parameters over source type a and b.

  • And the return type of the function is pair of ints.

  • So the update a--

  • I'm sorry-- add.

  • We are assigning the new value to [INAUDIBLE] source.

  • And then we do [INAUDIBLE].

  • And we have the control dependencies

  • from the previous operation, touching the same resource.

  • And this is added automatically by TensorFlow 2.0

  • automatic control dependencies.

  • So we don't have to do anything.

  • And then we have written letters.

  • So we have functions at graph level,

  • at FunctionDef and GraphDef.

  • In TensorFlow, they have two types of returns.

  • You might return the data type.

  • So you return two Tensors, read a and read c.

  • But also, it has a special kind of return notation, control

  • return.

  • So it doesn't return any data.

  • But you might group some of the ops inside your FunctionDef.

  • And you can specify that these ops have a name.

  • And they must always run.

  • And when you have a control dependency on the function

  • call, run time verifies that every op

  • that is part of control [INAUDIBLE] will run.

  • And the main graph will look something like this.

  • So we have a partition call.

  • It is Def 2.0 function call mechanism

  • that does partitioning and multi-device function,

  • invocation, and graph optimization and all the other

  • [INAUDIBLE] that are required for the runtime.

  • And [INAUDIBLE] operations is just

  • the identity that is the first and the second output

  • of FunctionDef.

  • so?

  • Any questions?

  • Is it clear, like the notation on the right side?

  • SPEAKER 1: So in this example, in particular,

  • the read operations already depend on the updates.

  • So in this case, the control dependencies, I guess,

  • are not that required.

  • But in other case, if we don't return the read values then

  • those--

  • EUGENE ZHULENEV: No.

  • The read depends only on the input, a and c.

  • SPEAKER 1: Yeah but it has a control

  • dependency on the update.

  • EUGENE ZHULENEV: Yeah, because this control dependency

  • added automatically by Def 2.0 at Def function,

  • when you trace the function.

  • And you have multiple Def ops touching the same resource.

  • Automatic [INAUDIBLE] will add control dependencies.

  • SPEAKER 1: I'm saying in terms of the return values,

  • like in this case, if you have the read a and read

  • c, they automatically capture the--

  • so if you tried to fetch read a and read c,

  • you make sure that updates are run.

  • So in this case, the control dependencies

  • are not that useful, I guess, or are they--

  • SPEAKER 2: [INAUDIBLE]

  • SPEAKER 1: Yeah, yeah, the control returns.

  • EUGENE ZHULENEV: Oh, yeah.

  • Well, yes.

  • SPEAKER 1: Just pointing out that it

  • would be more useful if you don't return the right value.

  • Then in this case, the control returns would be very useful.

  • EUGENE ZHULENEV: Yeah.

  • Yeah, so if you have some way of update--

  • for example, if you have a function that

  • doesn't have returns, that you are a function only

  • for the side effects, update some counters,

  • much of counterstatistics.

  • In this case, you don't have irregular data outputs.

  • So this is why you have control return.

  • But right now, tf 2.0 will add these control returns

  • to both places.

  • This is needed because to separate the data path

  • and control path, because when you inline that function--

  • and it turns out that no one is using output 1.

  • So you might end up that you will not execute your state

  • updates inside the function.

  • But it is guaranteed that everything on the control path,

  • on the control region, will be executed under some conditions

  • that I'll mention later.

  • SPEAKER 3: Question.

  • What's that small [INAUDIBLE] in front of update [INAUDIBLE] a?

  • If you see the read a equals to ReadVariableOp in a, and then

  • what's that small sign in front?

  • EUGENE ZHULENEV: Control dependency, you mean?

  • This one?

  • SPEAKER 3: Yeah, this one.

  • Why it has this special--

  • EUGENE ZHULENEV: Yeah, so the data dependency goes by name.

  • So a is regular data dependency, and this one

  • is a control dependency.

  • So we don't read the actual output.

  • It just means that update a must execute

  • before we can execute read a.

  • SPEAKER 3: Oh, so it's just a special sign [INAUDIBLE]..

  • EUGENE ZHULENEV: Yeah.

  • Yeah.

  • SPEAKER 4: That's the internal representation

  • of control dependencies in the GraphDef total.

  • SPEAKER 3: I see.

  • EUGENE ZHULENEV: So yeah.

  • If your op has two inputs, you will have ab as inputs.

  • And then you can add as many as you want control dependencies.

  • At graph execution time, run we'll make sure

  • that everything in control dependencies

  • will be executed before your kernel with a special notation.

  • Here's another example.

  • So we have variables a and b, are vectors of length 10.

  • And variable c, it's just a counter, all types of integers.

  • And we want to get a strided slice from variable a and b.

  • And then we also want to increment the counter.

  • Every time we take a slice, we want

  • to know how many slices did you take for whatever reason.

  • So this graph doesn't make much sense.

  • But anyway, and we're only interested in slice 1.

  • And the graph will look something like this.

  • We have three variables, work handle ops, constant assigned

  • ad for the counter, and then we read a variable, this one.

  • So we read variables from--

  • this should be b.

  • So we read variables from their resource, from the handle,

  • and then we take a strided slice.

  • And then we fetch slice 1.

  • And because we fetch slice 1, we can remove slice 0.

  • It is not needed for slice 1.

  • We can also remove from the graph a, and that's all.

  • But you see that the nice property we get, that slice 0

  • is invalid.

  • If you tried to take a slice from 20 to 30,

  • in a variable of size 10, you'll get fail at run time.

  • But because you don't really need slice 0,

  • everything is just fine.

  • As long as you don't fetch slice 0, this graph is totally valid.

  • But if you do the same graph, but you put your slices

  • inside a function, so the same kind of thing,

  • you have three variables, a, b, c. c is a counter.

  • You do AssignAdd inside a function.

  • So nice, you don't have to add explicit control dependency.

  • You'll get your count updated automatically.

  • Then you take slice 0 and slice 1 and [INAUDIBLE]..

  • And you invoke a function.

  • You're not interested in slice 0.

  • You'll want only to get the second value.

  • So just keep it.

  • And you're trying to print it.

  • So at a GraphDef level and FunctionDef,

  • it will look something like this.

  • It will have a good slice function

  • that takes three sources, a, b, and c.

  • It will have an increment for the counter.

  • It will read the variants of a, b, take the slices.

  • And return slice 0 and slice 1.

  • And it will have a control [INAUDIBLE] increment

  • automatically added by automatic [INAUDIBLE] tracking.

  • So slices don't have any control dependencies

  • because slices, they do not depend on variable c.

  • They depend on different variables.

  • And then you have a function call,

  • a partition call to the get slice a and b.

  • And you don't need the first slice.

  • You take the second one by a function called output 1.

  • And you're trying to fetch it, or print it, or whatever.

  • And the problem right now, that this code,

  • we will fail because the function semantics [INAUDIBLE]

  • flow currently is strict with regards

  • to its inputs and outputs.

  • So before function return, function can return,

  • it must finish the execution of all the output parameters.

  • And before function can start executing,

  • it must finish executing of all the inputs.

  • Yeah, so there's another property.

  • So all inputs must be evaluated and must be not dead.

  • Your function can start running.

  • And note that it means that if you have inputs from switches,

  • it's [INAUDIBLE] used to represent

  • control flow, in TensorFlow.

  • So there is a special notion of that tensor.

  • So basically, when you round your graph at runtime,

  • and you have dead tensors, you don't execute any kernels.

  • You just update the flag, and propagate this deadness

  • across the graph.

  • So we can have a large graph and if note,

  • which is represented as switch.

  • And based on the runtime value, you execute only half

  • of the graph.

  • So another property of the function

  • is that all outputs must be related and also not dead.

  • So, for example, you have a function with the control flow.

  • And you want to take a gradient of that function.

  • You have to return all intermediates

  • from the function, and some of them

  • might be dead because they might be

  • from intermediate computations from untaken branch.

  • And basically, your function can be dead.

  • But sometimes that [INAUDIBLE] so there

  • is a special way to avoid that.

  • And I think it might not be used right now.

  • But before that, when we generated gradient functions

  • TensorFlow 2.0 marked all the functions for the fourth pass,

  • that they carry it on that inputs.

  • And instead of that inputs, runtime

  • would return empty tensor, and we'll just continue.

  • So but the most confusing thing, if you do this code

  • without the f function notation, that's

  • a regular Python function.

  • Get slice. a, b, and c have the same inputs to increment.

  • But then you have to add explicit control dependency

  • because they're not using TF2.0 automatic control dependency

  • tracking.

  • So we will get graph on the right.

  • And it is exactly the same graph as you

  • would get if you don't use functions at all, if you just

  • write it line by line, one by one.

  • And it would allow it the full runtime

  • to prune all the nodes that are not used.

  • And your graph is valid again.

  • But the only difference is that you don't have Def dot function

  • notation.

  • And even if we completely remove from this counter--

  • and I don't want to count anything you can see--

  • so the only difference would be the TF function notation.

  • And it is a huge difference at runtime.

  • One graph fails, and another graph is just fine.

  • And this is the real example.

  • So Alex gave a presentation about functions in TensorFlow.

  • And that's how some people use it in their functions running.

  • So you have a function that returns some environment state,

  • does action, and does environment [INAUDIBLE]..

  • You get a, b, c.

  • And you just run the loop on the action.

  • And in the end, you run your z function.

  • So if you would not take this function with def dot function

  • notation, on every loop iteration

  • you will get the state, you will do the action,

  • and you will [INAUDIBLE],, which is not fine.

  • Another example from TensorFlow probability--

  • so you have def dot function that has three inputs.

  • You have-- the first output comes from some constant inside

  • the function, constant and some computations.

  • Then you take first, second, and third inputs,

  • and do some computation and output at position 2 and 3.

  • And you pass it to exactly the same function,

  • and then you pass to the next function.

  • So we have the first input to the first function

  • is some failed computation.

  • It might be placeholder that you didn't define.

  • It must be a strided slice within an inverted bounce

  • or fail to read from the null file system.

  • But if you trace this fetch, we are trying to [INAUDIBLE]

  • the second value.

  • If you trace across the function boundaries, this one, this one,

  • yeah.

  • We ended up in this constant.

  • And it's totally valid.

  • If we'll try to fetch the last output of the last function,

  • and we'll trace it, we'll get to the second input

  • or the first function.

  • And it's totally valid.

  • So as long as we don't try to fetch

  • this computation or this computation

  • we don't really need the first input.

  • And a lot of [INAUDIBLE] in terms of probability

  • relies on this property of the function,

  • that the function is not really a function call.

  • Is just some Python scoping language construct

  • that allows you to separate your graph construction.

  • And when people moving to TensorFlow eager

  • and they annotate all their functions

  • with [INAUDIBLE] function, it all

  • breaks because function at runtime level are strict.

  • But that's not what it was designed,

  • how people wanted to do functions.

  • So if you open graph proto--

  • and this is the first commit when functions

  • were added to the TensorFlow.

  • And there's very important things.

  • So the [INAUDIBLE] may start execution

  • as soon as some of its inputs are ready.

  • And if you want to enforce that all your inputs are ready,

  • you must use double because runtime

  • is allowed to start executing when the [INAUDIBLE]..

  • It doesn't have to wait for all your inputs.

  • And the consumer over their function

  • may start executing early, as soon as

  • that function value is ready.

  • So if you want to ensure that your function has

  • a strict semantics, you must use doubles,

  • otherwise runtime is allowed to do whatever it wants and start

  • executing as early as it wants.

  • It happened to be that implementation was strict

  • and people didn't notice that.

  • And there are a lot of golden terms

  • that relies on strict semantics.

  • But we don't really want strict functions.

  • We'd really love to get laser functions.

  • So nothing is evaluated until it's really necessary.

  • And in count, TensorFlow is a good [INAUDIBLE]

  • necessary means that we want to execute

  • the primitive TensorFlow kernel.

  • So primitive TensorFlow kernel is any op

  • add multiply convolution [INAUDIBLE]..

  • And composite kernels are right now currently only a function,

  • I think.

  • There are some proposals to add more composite kernel support,

  • but I don't know how it will look like.

  • But currently, we [INAUDIBLE] functions to be lazy.

  • And ideally, we would love to make switch lazy as well.

  • So if you have switch, your conditional execution,

  • right now, you'll have to execute all the inputs

  • to the switch, even if you don't use it later.

  • But it's very hard to get to the current executor.

  • And also, nonstrict semantics is very important for performance.

  • So imagine that TF2.0, we have a function resonant.

  • And we have a huge model that has 10 resonate inferences.

  • And each resonate has 256 input parameters.

  • So if these parameters are living on remote parameter

  • servers, you can't start executing your function

  • before you finish all these 256 tensors to your local machine

  • and start training.

  • And then you have--

  • well, you'll take a lot of memory

  • to keep them at the same time on the host of the device,

  • or whatever.

  • And it will take a lot of time until you get them,

  • all the network.

  • Even if we use TensorFlow 2.0, and the functions

  • do not get Tensor source inputs.

  • They just get resources and they do read variable op.

  • We still have kind of the same problem

  • because we can't start executing the function

  • before the previous update to [INAUDIBLE] finished.

  • So before you can start running your resonant 50 function, even

  • with the resources as inputs, you

  • have to wait for all your control

  • dependencies, of all the previous updates

  • to all the previous variables.

  • So even if you don't get a lot of network traffic,

  • you still wait on parameters [INAUDIBLE]

  • computing that updates, or maybe something else.

  • So we really want laser functions

  • and run them as soon as something [INAUDIBLE]

  • start efficient parameters running the first layer

  • and fetch the parameters for the second layer

  • only when we completed the first layer.

  • So we didn't get laser functions originally

  • because functions were added to the TensorFlow on time

  • much later than original executor and the GraphDef

  • was created.

  • And TensorFlow executor at runtime is super strict.

  • So people often think that TensorFlow is lazy.

  • So you fetch variable a, and then

  • you get back and pull only whatever you need.

  • Well, that's not actually how it works.

  • So TensorFlow runtime is very strict, and greedy,

  • and is used in pushes.

  • And these received lazy [INAUDIBLE]..

  • It's just a product of pruning.

  • So we removed-- first, before execution,

  • we removed all the notes that we don't

  • need to compute the output.

  • And then we start looking at the nodes one

  • by one that are ready, they don't have inputs,

  • and run them, and update all the consumers

  • that you are ready to run.

  • And this is the fundamental property

  • of how executor should see in terms of tensor flow works.

  • And it's almost impossible to do anything for that.

  • And you can't really touch it.

  • It's super complex.

  • It's super performance critical.

  • And adding classification is just impossible.

  • So that's why we end up with strict functions.

  • But originally, even in design documents from 2015,

  • people discussed this problem.

  • And people wanted to have lazy functions, or lazy semantics

  • because people at the time thought

  • that if you have some else layer in [INAUDIBLE] layer

  • as a function, eventually, you'll

  • want to wait for all inputs.

  • It is too critical for performance.

  • But that was too hard to implement at that time.

  • And we ended up with strict functions.

  • But no one used functions at v1.

  • It was a very rare thing.

  • And it was not a problem.

  • But in Def 2.0, with def dot function notation,

  • we wrap all the functions into FunctionDefs.

  • And now we have hundreds of thousands

  • of functions at one time.

  • And now have all these problems with strict laser semantics.

  • And concept semantics is not really defined.

  • So different people think different things

  • and what should be the semantics and what

  • is the right semantics.

  • So right now, we're kind of between strict and lazy

  • as we can.

  • But sometimes it's still strict.

  • So it's a little bit from us.

  • So to get back, I will lay the semantics.

  • One way is to fix execution model

  • at run time, which is impossible without rewriting from scratch.

  • So we might do it later, but that's not something we can do.

  • So the easiest way to do that is just

  • to align all the functions into the graph.

  • So instead of GraphDef with a petition call to the function

  • call, you will get one single graph, which is executable.

  • And function then for the full runtime

  • can start executing nodes as any authority.

  • So we don't have any of this function boundary anymore.

  • But you still have to do it very carefully.

  • Because if you inline it without--

  • because if you just align the function body inside the graph,

  • you might get completely different semantics.

  • Because some people rely on control dependencies

  • adhere to the function or the strictness,

  • and all the side effects, visibility, and TensorFlow 2.0

  • when it adds automatic control [INAUDIBLE] tracking.

  • It has some assumptions about what is the program execution

  • order.

  • And if function inlining would violate that,

  • it would be very, very sad situation.

  • So TensorFlow 2.0 Python front end

  • has some function semantics and graph construction

  • semantics that helps a lot to get this program order

  • semantics.

  • So all the immutable state is represented

  • as resource tensors.

  • So the resource is just a handle to a device and some memory

  • location.

  • So you can imagine this is a pointer on GPO or pointer

  • to the buffer on CPU.

  • And we pass our sources as inputs of the functions.

  • And if the function has an input from resource a,

  • it will have an incoming control edge from the last op in graph

  • or the last function that has the touch

  • [INAUDIBLE] resource that has the same resource as input.

  • So if you have assigned a variable,

  • and then you pass the resource for the same variable

  • to a function, you will have a control dependency

  • from data sign.

  • And if anyone else is touching the same resource after

  • the function call or any other op,

  • TF2.0 will add an outgoing control edge to the next op.

  • So if we pass a variable to a function,

  • and then you outside of a function you have added

  • variable op, TF2.0 will add a control dependencies from

  • your function call to the [INAUDIBLE]..

  • So if you do the read, you will observe all the changes

  • that were made to that variable inside the function body.

  • So the most important thing, I guess, from TF2 function

  • notation, that it does automatic control dependence striking.

  • So when you write your Python function,

  • and you have some idea what should be your program order

  • execution semantics, so you add 1 to a.

  • Then you add 1 to b.

  • You think that the a to b should happen after a.

  • And if you add 1 to a, then you read a,

  • you expect that you see the new value or the variable a.

  • It was not the case in TF1.

  • In TF2, when you add TF dot function notation,

  • it will add all the necessary control dependencies.

  • So it will add control dependencies

  • between all the ops that have the same resource as input.

  • And it will also add control dependencies between all stage

  • 4 ops inside function barrier.

  • So all your stateful operations will be always executed

  • in the same order.

  • And it will give you some sanity.

  • So if you have multiple prints inside a function barrier,

  • you should every time observe that prints in the same order.

  • In TF2.V1 you can observe that prints in any order.

  • And that's confusing.

  • And all the stage four ops risk side effects.

  • Ops that can have side effects, they

  • will be added to control the control output.

  • So function run time and functional landing

  • must respect that.

  • So we don't want to lose any of the side effects or updates

  • to the state.

  • And these are some rules.

  • So all the side effects to the resources

  • that happened before function call

  • must be visible to all the nodes inside function call.

  • And all the side effects to the resources happened

  • inside function body must be visible to every op

  • function using the same resource after the function completed.

  • Currently, it's implemented.

  • So you have a control dependencies

  • from the op that made some side effect to the function call.

  • And the function call has a control dependencies

  • to the next stop that might need that side effects.

  • So to enforce that semantics, functional lighting

  • has special rules.

  • So it will add a special note, input control note.

  • So if your function goal has any control dependencies,

  • that input control node--

  • all the control dependencies will be forwarded

  • to that input control node.

  • And all the function inputs will have control dependencies

  • from that node.

  • So it basically means that you can't

  • start executing your function before your control

  • dependencies are satisfied.

  • All the nodes are executed.

  • It also means that if your function code doesn't

  • have any control dependencies, then the full runtime

  • is free to start running your function value

  • as soon as anything is ready.

  • Also, it will add up control node.

  • So this note will have control just from all the side effects,

  • from all the control outputs over function.

  • So if, for some reason, you have a side

  • effect inside a function, and this side effect is not

  • connected to one of the control outputs,

  • one function will be aligned.

  • TensorFlow is free to prune the side effect,

  • or so you might end up with some partial observed state updates.

  • But that should not happen in practice in TensorFlow 2.0,

  • because all the side effects, when you construct

  • a graph from Python, all the side effects

  • should be appropriately connected to control outputs.

  • But that's not the case of [INAUDIBLE] models.

  • I think [INAUDIBLE] doesn't use TensorFlow 2.0

  • automatic control dependency striking in some cases,

  • or it was not using some time ago.

  • So that might be violated [INAUDIBLE] in some models,

  • but I hope that doesn't happen right now.

  • And also, there's assumption if the function call does not

  • have an outgoing control edge, it

  • means that no one [INAUDIBLE] doesn't

  • care about what's happening inside the function, what

  • are the side effects.

  • So if you have an ongoing data edge,

  • someone needs, the data output.

  • But if function code doesn't have outgoing control edge,

  • it means that function might do whatever

  • it wants inside function value updates, any variables,

  • counters, print, anything send any data over the network.

  • And no one cares about what it does.

  • So that might lead to some troubles.

  • But I think in TF2.0, that also never happens in practice

  • because automatic dependency checking for nested function

  • calls will add other required control dependencies.

  • And when you execute the top level function,

  • it should also add all the control dependencies.

  • But, again, it might happen with some models

  • that do not use automatic control dependency tracking.

  • So that's how the function, the function

  • will look like after landing.

  • So this is the function from previous example.

  • So the function takes three resources as inputs.

  • It reads variables ab, increments the counter,

  • and the strided slice, one of which is invalid,

  • and returns both slices, and control output is increment.

  • So we have function call.

  • And this will be after aligning the GraphDef.

  • So we no longer have FunctionDef,

  • and we no longer have function call.

  • We have just one graph with multiple nodes.

  • So we have incoming input control node.

  • This is now op.

  • So the function call node didn't have any control dependencies,

  • so now op has [INAUDIBLE] control dependencies.

  • We read variables from inputs, and we

  • depend on input control note.

  • So any reads from the variables from the inputs

  • will happen after the input control node executed.

  • Then we have an increment [INAUDIBLE]..

  • Then we do two strided slices.

  • And then we have two identify nodes

  • for the function writtens.

  • And we reslice 0 for the first [INAUDIBLE] value

  • and slice 1 for the second.

  • And we have an output control node.

  • And the output control node has control dependencies

  • from the side effects inside function.

  • And this function has only the one side effect

  • assign add counter increment.

  • So read the variable op mark the state [INAUDIBLE]..

  • But it is not a side effect because ReadVariableOp can't

  • modify, can have side effects.

  • You just observed the states.

  • So in here, we could read variable op

  • to this [INAUDIBLE] control node.

  • But in practice, there are many stay full ops.

  • That just reading the state entry to variable

  • op just one of them.

  • And slice, so previously, we had a function call node,

  • get slice.

  • And slice is identity node, the reads the second output.

  • Now we don't have a function going out anymore.

  • And slice is just an identity node

  • that reads directly from the in line to function return.

  • And we have a--

  • so we read a variable from the counter.

  • And it has automatically control dependencies

  • to the output control node.

  • So every time we read the counter

  • we must observe all the side effects

  • that happened to that counter inside function value.

  • And now we can do pruning again.

  • So we don't use written 0.

  • We can prune it.

  • We don't use slice 0.

  • It is invalid, 20 to 30.

  • We can prune it.

  • And we don't need the value or the variable.

  • So we can prune, read the variable op.

  • And so again, back to the graph, that is valid,

  • and can be executed just on time without exceptions.

  • So there are a few more problems.

  • So when you have a function, and you in line function value,

  • and function value does not have any device annotations,

  • you have to decide on what device to place that node.

  • So before TF2.0, we always had single device function.

  • So either the function call node is placed on CPU,

  • or the nodes inside function body would be executed on CPU.

  • In TF2.0, it's too limited because you have resonate

  • in your function.

  • And that function call is on CPU.

  • You don't want to run [INAUDIBLE] on CPU,

  • or you might want to use multiple CPUs, GPUs,

  • or have a function to transfer multiple devices.

  • So there are multiple strategies,

  • how to place nodes within line function.

  • So you can do nothing at all and rely on placer.

  • You can make force function to be single device.

  • And you do some time for V1 graphs, for compatibility mode,

  • primarily, or [INAUDIBLE] for multiple device functions

  • in TF2.0 that all the nodes inside function value must be

  • placed on the same job replica and task.

  • But the device might be empty.

  • And then we rely on the placer to place them

  • on the correct device.

  • Because imagine that you have a function call and [INAUDIBLE]

  • runtime, and your function call happened

  • to be on another machine.

  • And then when you would execute that function call

  • on that machine at runtime, it would

  • be able to choose only from the devices available [INAUDIBLE]

  • machine.

  • So if you [INAUDIBLE] and don't add any device notation,

  • place [INAUDIBLE] device placements.

  • So if the user placed a function call, like resonant function

  • call, on machine 1, and resonant function site

  • doesn't have any device notation, and we learn it,

  • and then we run placer, all the resonant nodes

  • may be placed on a completely different machine and GPUs,

  • and it will break assumptions of the user.

  • So when we [INAUDIBLE] functions currently,

  • we overwrite job task and replica,

  • and then end untouched device.

  • So the placer will pick the right device for the node

  • even after learning.

  • Also have a bunch of functions created internal for v1

  • that do not use control outputs and don't just

  • control dependency inside function value at all.

  • So after aligning such functions,

  • you might end up with completely different execution semantics

  • and lots of [INAUDIBLE].

  • What's another fun part of current runtime,

  • that current function an time, it does not

  • prove any state full ops.

  • And it is very different from execution semantics

  • of the graph, because if you have

  • state full ops, variable updates inside your graph,

  • and you don't have control dependencies,

  • runtime will plummet.

  • But if you have exactly the same graph inside function,

  • runtime will execute all the stateful ops.

  • And this mismatch is very difficult

  • to handle when you in line your function graph

  • inside or the outside GraphDef because you

  • have different notions of what should be pruned and when.

  • So TF2.0 always inlines all the functions because TF2.0 is

  • guaranteed to have correct control dependencies

  • and function control outputs.

  • But we don't inline functions by default in legacy graphs.

  • Grappler is responsible for aligning functions

  • in legacy graphs.

  • But Grappler does a lot of additional checks

  • with all the stateful ops as a path to one other output,

  • that we don't have any side effect for ops inside function

  • barrier that are not connected to anything because they

  • will be pruned.

  • And there are a bunch of other also.

  • In TFV1, you can get function with mismatched deadness,

  • which should not be possible.

  • But that happens.

  • So Grappler is very conservative.

  • So it aligns only if it can prove that it is safe,

  • and it does not change the semantics.

  • And this functional inlining is a huge source of bugs.

  • Sometimes people think that function inlining is a problem,

  • and often it is a source of a program

  • because it is very complicated and semantics was never defined

  • properly for the function.

  • So I mostly had to come up with the semantics to make

  • all the tests pass and a bunch of [INAUDIBLE] workarounds

  • for different tests.

  • I hope that we might be able to get

  • better semantics at some point.

  • But right now, it's super messy.

  • Also, you have much other functional, like functional op.

  • So, for example, we have a functional if--

  • so this will basically predicate.

  • And you have attributes with two functions, then

  • and else function.

  • And another op is functional while.

  • So in V1 we have-- so each next iteration

  • enter exit nodes to represent while ops.

  • In V2, we have a functional while.

  • So you define a function for the body, for the condition.

  • And then you just run it at runtime as a special op.

  • Also, you have a case, something like if with multiple branches.

  • Currently, we lower all these functional ops

  • to V1 control flow.

  • So basically, the if node, the if node

  • becomes a switch node and 2 function call node

  • for then and else functions.

  • And these functions are then aligned,

  • plus any other function call.

  • So we do that primarily for [INAUDIBLE] semantics.

  • So if you're on your while or if as the open side graph,

  • it will have strict semantics.

  • And some models just expect lazy semantics from your if.

  • It also very limiting for concurrency.

  • So V1 control flow.

  • While ops, for example, you can run multiple [INAUDIBLE]

  • iterations at a time.

  • You can [INAUDIBLE] while loop durations.

  • If you will try to do that with a functional while,

  • it's just impossible.

  • You have to wait for every iteration

  • to completely finish before you can start the next iteration.

  • So a lot of people want to move to functional ops, functional

  • control flow.

  • But in practice, it's very difficult,

  • primarily because of the performance.

  • But it makes a lot of analysis easy

  • because we often have to reconstruct

  • what was the while loop from the GraphDef

  • from all the switches, next iteration, entry/exit nodes.

  • And it is very [INAUDIBLE] prone,

  • and we have a lot of troubles with it.

  • So if you would have at the graph optimization

  • level functional ops, that would help.

  • And then at the later stage, we can just lower all of them

  • to switch [INAUDIBLE] to get the good performance.

  • So here is an example of how functional

  • if looks like in a graph.

  • So we have a flag.

  • That's the Boolean variable.

  • We read a flag with a read variable op.

  • We have some constant 0.

  • So we have two functions plus 1 that

  • adds to integer input 1 in the [INAUDIBLE] and plus 2.

  • So we read a flag, a Boolean flag.

  • Then we have 0.

  • And the result is if plus 1, plus 2.

  • So if your flag is true, we add plus 1.

  • If the flag is false, we add plus 2 to 0.

  • So the result will be 1 or 2 depending on the flag.

  • So when we lower this functional if to V1 control flow

  • constructs, we will have the reveal op for the flag, which

  • will have a 0 for the cost.

  • So we have a switch node based on the flag and 0.

  • So the switch node has two outputs.

  • It will output the value 0 on the output 0

  • if the flag is true, and it will output the value 0

  • on the output 1 if the flag is false.

  • And the other output, unused output will be dead.

  • So it will basically prevent execution

  • from one of these nodes.

  • So then a function is a function call with a switch output 0.

  • And else function is another function call

  • with a switch node, output 1.

  • So if your flag is false, this node will be dead

  • and will not be executed.

  • And then we have a result as a merge.

  • So we merge the results of the and function and else function.

  • And one of them is going to be dead,

  • and another one will be alive.

  • And this will be the final result.

  • So after we lower--

  • so this is after we lower if node to function calls.

  • And then function learning kicks in.

  • And we get rid of all the function call nodes.

  • We basically have then function written.

  • Just add from switch 1 and else function.

  • This should be 2.

  • And we merge the written values of the functions.

  • Yeah, that was great.

  • So that's how we get rid of all the functions.

  • So we have functions as a mental model for the end user,

  • how do you think about your TensorFlow graph,

  • how you built your program.

  • So you no longer think in terms of graph and [INAUDIBLE]..

  • You think in terms of functions.

  • But when we get these functions at the runtime,

  • we still carry them to [INAUDIBLE]

  • because that's what we have to do for performance

  • and sometimes for correctness.

  • But there's kind of a promise of TF, that function notation

  • in TensorFlow [INAUDIBLE] mode.

  • If you want [INAUDIBLE] semantics back,

  • just annotate your function with TF dot function

  • and you'll get back a graph.

  • But that's not completely true, because if you have [INAUDIBLE]

  • function goals annotated with [INAUDIBLE] function,

  • you would have multiple function deaths,

  • multiple function call nodes, and you'll

  • get strict semantics.

  • And this is not [INAUDIBLE] the semantics of V1.

  • So the only way to provide users what was promised

  • is to align all the functions.

  • And then we'll get back to the single graph

  • with the lazy semantics, with the pruning

  • and all the nice properties of that dataflow graph.

  • [MUSIC PLAYING]

EUGENE ZHULENEV: My name is Eugene.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

Inside TensorFlow: Graph rewriting (Macros, not functions) (Inside TensorFlow: Graph rewriting (Macros, not functions))

  • 4 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字