Inside TensorFlow: Graph rewriting (Macros, not functions) (Inside TensorFlow: Graph rewriting (Macros, not functions))

字幕列表影片播放

EUGENE ZHULENEV: My name is Eugene.
I'm working on Grappler and performance.
But today, I'm going to talk mostly
about how we do graph rewriting for functions.
So functions is a very important part of TF2.0.
It's very important for the end user, how user thinks
about the flow computation.
So it's not graph anymore.
It's a compositional functions.
But that creates a bunch of problems for us at runtime.
And so basically, we have to rewrite all the graph
and all of the functions to make it executable
and to avoid some specific problems
later comes with the functions.
So this is, on the left, this button code.
So this kind of V1 Python that people usually write.
So you define your variables, some constants.
You do assignment, and then you want
to return, update variables.
So if you don't add explicit control dependency,
you're going to read the initial values.
So you have to think what are your dependencies
and have to make everything very explicit.
And so on the right, this is some textural GraphDef
representation.
So there's not enough space to go to the full GraphDef
in the [INAUDIBLE].
So this is roughly how it will look
like if you go to the back string into TensorFlow.
If you get to GraphDef to the back string,
you can get some short representation of the graph,
something like on the right.
So variables are actually VarHandleOp.
This is a single kernel that returns DT results data type.
Constants, they are just simple Const nodes with the values.
And then when you [INAUDIBLE] add,
it is an AssignAdd and TensorFlow kernel
that takes results handle from VarHandleOp, and the constant,
and assigns and add a value.
And when you do it, it is another TensorFlow kernel
read variable op.
So you get TensorFlow [INAUDIBLE]
resource handle a and c.
And because you have explicit control dependencies,
this series depends on update a and update c, both of them.
And you fetch with a and c.
But explicitly, dividing control dependencies is very annoying.
So in TensorFlow 2.0, the same graph
will look something like this.
So you have your variables defined
outside of a function scope.
And you have a TensorFlow notation and_and_get.
So I want to add two different constants, two variables, a
and c, and get back our latest results.
So the pattern looks something like on the left.
So you don't have to explicitly add control dependencies, which
make lives easier.
And you don't have to think about program order.
So tf function traces the function.
And it has all the necessary control dependencies.
So when you execute a graph, you will
get the program order you would expect from [INAUDIBLE]..
So the FunctionDef, so the graph representation
looks something like on the right.
So we have a GraphDef.
And the GraphDef is basically a sequence of null Defs.
And we also have a FunctionDef.
It's a way to group together some subgraph with input
arguments and output arguments.
So this function on the left from the pattern
will be transformed to FunctionDef.
That will be represented, something like this.
So we have two input parameters over source type a and b.
And the return type of the function is pair of ints.
So the update a--
I'm sorry-- add.
We are assigning the new value to [INAUDIBLE] source.
And then we do [INAUDIBLE].
And we have the control dependencies
from the previous operation, touching the same resource.
And this is added automatically by TensorFlow 2.0
automatic control dependencies.
So we don't have to do anything.
And then we have written letters.
So we have functions at graph level,
at FunctionDef and GraphDef.
In TensorFlow, they have two types of returns.
You might return the data type.
So you return two Tensors, read a and read c.
But also, it has a special kind of return notation, control
return.
So it doesn't return any data.
But you might group some of the ops inside your FunctionDef.
And you can specify that these ops have a name.
And they must always run.
And when you have a control dependency on the function
call, run time verifies that every op
that is part of control [INAUDIBLE] will run.
And the main graph will look something like this.
So we have a partition call.
It is Def 2.0 function call mechanism
that does partitioning and multi-device function,
invocation, and graph optimization and all the other
[INAUDIBLE] that are required for the runtime.
And [INAUDIBLE] operations is just
the identity that is the first and the second output
of FunctionDef.
so?
Any questions?
Is it clear, like the notation on the right side?
SPEAKER 1: So in this example, in particular,
the read operations already depend on the updates.
So in this case, the control dependencies, I guess,
are not that required.
But in other case, if we don't return the read values then
those--
EUGENE ZHULENEV: No.
The read depends only on the input, a and c.
SPEAKER 1: Yeah but it has a control
dependency on the update.
EUGENE ZHULENEV: Yeah, because this control dependency
added automatically by Def 2.0 at Def function,
when you trace the function.
And you have multiple Def ops touching the same resource.
Automatic [INAUDIBLE] will add control dependencies.
SPEAKER 1: I'm saying in terms of the return values,
like in this case, if you have the read a and read
c, they automatically capture the--
so if you tried to fetch read a and read c,
you make sure that updates are run.
So in this case, the control dependencies
are not that useful, I guess, or are they--
SPEAKER 2: [INAUDIBLE]
SPEAKER 1: Yeah, yeah, the control returns.
EUGENE ZHULENEV: Oh, yeah.
Well, yes.
SPEAKER 1: Just pointing out that it
would be more useful if you don't return the right value.
Then in this case, the control returns would be very useful.
EUGENE ZHULENEV: Yeah.
Yeah, so if you have some way of update--
for example, if you have a function that
doesn't have returns, that you are a function only
for the side effects, update some counters,
much of counterstatistics.
In this case, you don't have irregular data outputs.
So this is why you have control return.
But right now, tf 2.0 will add these control returns
to both places.
This is needed because to separate the data path
and control path, because when you inline that function--
and it turns out that no one is using output 1.
So you might end up that you will not execute your state
updates inside the function.
But it is guaranteed that everything on the control path,
on the control region, will be executed under some conditions
that I'll mention later.
SPEAKER 3: Question.
What's that small [INAUDIBLE] in front of update [INAUDIBLE] a?
If you see the read a equals to ReadVariableOp in a, and then
what's that small sign in front?
EUGENE ZHULENEV: Control dependency, you mean?
This one?
SPEAKER 3: Yeah, this one.
Why it has this special--
EUGENE ZHULENEV: Yeah, so the data dependency goes by name.
So a is regular data dependency, and this one
is a control dependency.
So we don't read the actual output.
It just means that update a must execute
before we can execute read a.
SPEAKER 3: Oh, so it's just a special sign [INAUDIBLE]..
EUGENE ZHULENEV: Yeah.
Yeah.
SPEAKER 4: That's the internal representation
of control dependencies in the GraphDef total.
SPEAKER 3: I see.
EUGENE ZHULENEV: So yeah.
If your op has two inputs, you will have ab as inputs.
And then you can add as many as you want control dependencies.
At graph execution time, run we'll make sure
that everything in control dependencies
will be executed before your kernel with a special notation.
Here's another example.
So we have variables a and b, are vectors of length 10.
And variable c, it's just a counter, all types of integers.
And we want to get a strided slice from variable a and b.
And then we also want to increment the counter.
Every time we take a slice, we want
to know how many slices did you take for whatever reason.
So this graph doesn't make much sense.
But anyway, and we're only interested in slice 1.
And the graph will look something like this.
We have three variables, work handle ops, constant assigned
ad for the counter, and then we read a variable, this one.
So we read variables from--
this should be b.
So we read variables from their resource, from the handle,
and then we take a strided slice.
And then we fetch slice 1.
And because we fetch slice 1, we can remove slice 0.
It is not needed for slice 1.
We can also remove from the graph a, and that's all.
But you see that the nice property we get, that slice 0
is invalid.
If you tried to take a slice from 20 to 30,
in a variable of size 10, you'll get fail at run time.
But because you don't really need slice 0,
everything is just fine.
As long as you don't fetch slice 0, this graph is totally valid.
But if you do the same graph, but you put your slices
inside a function, so the same kind of thing,
you have three variables, a, b, c. c is a counter.
You do AssignAdd inside a function.
So nice, you don't have to add explicit control dependency.
You'll get your count updated automatically.
Then you take slice 0 and slice 1 and [INAUDIBLE]..
And you invoke a function.
You're not interested in slice 0.
You'll want only to get the second value.
So just keep it.
And you're trying to print it.
So at a GraphDef level and FunctionDef,
it will look something like this.
It will have a good slice function
that takes three sources, a, b, and c.
It will have an increment for the counter.
It will read the variants of a, b, take the slices.
And return slice 0 and slice 1.
And it will have a control [INAUDIBLE] increment
automatically added by automatic [INAUDIBLE] tracking.
So slices don't have any control dependencies
because slices, they do not depend on variable c.
They depend on different variables.
And then you have a function call,
a partition call to the get slice a and b.
And you don't need the first slice.
You take the second one by a function called output 1.
And you're trying to fetch it, or print it, or whatever.
And the problem right now, that this code,
we will fail because the function semantics [INAUDIBLE]
flow currently is strict with regards
to its inputs and outputs.
So before function return, function can return,
it must finish the execution of all the output parameters.
And before function can start executing,
it must finish executing of all the inputs.
Yeah, so there's another property.
So all inputs must be evaluated and must be not dead.
Your function can start running.
And note that it means that if you have inputs from switches,
it's [INAUDIBLE] used to represent
control flow, in TensorFlow.
So there is a special notion of that tensor.
So basically, when you round your graph at runtime,
and you have dead tensors, you don't execute any kernels.
You just update the flag, and propagate this deadness
across the graph.
So we can have a large graph and if note,
which is represented as switch.
And based on the runtime value, you execute only half
of the graph.
So another property of the function
is that all outputs must be related and also not dead.
So, for example, you have a function with the control flow.
And you want to take a gradient of that function.
You have to return all intermediates
from the function, and some of them
might be dead because they might be
from intermediate computations from untaken branch.
And basically, your function can be dead.
But sometimes that [INAUDIBLE] so there
is a special way to avoid that.
And I think it might not be used right now.
But before that, when we generated gradient functions
TensorFlow 2.0 marked all the functions for the fourth pass,
that they carry it on that inputs.
And instead of that inputs, runtime
would return empty tensor, and we'll just continue.
So but the most confusing thing, if you do this code
without the f function notation, that's
a regular Python function.
Get slice. a, b, and c have the same inputs to increment.
But then you have to add explicit control dependency
because they're not using TF2.0 automatic control dependency
tracking.
So we will get graph on the right.
And it is exactly the same graph as you
would get if you don't use functions at all, if you just
write it line by line, one by one.
And it would allow it the full runtime
to prune all the nodes that are not used.
And your graph is valid again.
But the only difference is that you don't have Def dot function
notation.
And even if we completely remove from this counter--
and I don't want to count anything you can see--
so the only difference would be the TF function notation.
And it is a huge difference at runtime.
One graph fails, and another graph is just fine.
And this is the real example.
So Alex gave a presentation about functions in TensorFlow.
And that's how some people use it in their functions running.
So you have a function that returns some environment state,
does action, and does environment [INAUDIBLE]..
You get a, b, c.
And you just run the loop on the action.
And in the end, you run your z function.
So if you would not take this function with def dot function
notation, on every loop iteration
you will get the state, you will do the action,
and you will [INAUDIBLE],, which is not fine.
Another example from TensorFlow probability--
so you have def dot function that has three inputs.
You have-- the first output comes from some constant inside
the function, constant and some computations.
Then you take first, second, and third inputs,
and do some computation and output at position 2 and 3.
And you pass it to exactly the same function,
and then you pass to the next function.
So we have the first input to the first function
is some failed computation.
It might be placeholder that you didn't define.
It must be a strided slice within an inverted bounce
or fail to read from the null file system.
But if you trace this fetch, we are trying to [INAUDIBLE]
the second value.
If you trace across the function boundaries, this one, this one,
yeah.
We ended up in this constant.
And it's totally valid.
If we'll try to fetch the last output of the last function,
and we'll trace it, we'll get to the second input
or the first function.
And it's totally valid.
So as long as we don't try to fetch
this computation or this computation
we don't really need the first input.
And a lot of [INAUDIBLE] in terms of probability
relies on this property of the function,
that the function is not really a function call.
Is just some Python scoping language construct
that allows you to separate your graph construction.
And when people moving to TensorFlow eager
and they annotate all their functions
with [INAUDIBLE] function, it all
breaks because function at runtime level are strict.
But that's not what it was designed,
how people wanted to do functions.
So if you open graph proto--
and this is the first commit when functions
were added to the TensorFlow.
And there's very important things.
So the [INAUDIBLE] may start execution
as soon as some of its inputs are ready.
And if you want to enforce that all your inputs are ready,
you must use double because runtime
is allowed to start executing when the [INAUDIBLE]..
It doesn't have to wait for all your inputs.
And the consumer over their function
may start executing early, as soon as
that function value is ready.
So if you want to ensure that your function has
a strict semantics, you must use doubles,
otherwise runtime is allowed to do whatever it wants and start
executing as early as it wants.
It happened to be that implementation was strict
and people didn't notice that.
And there are a lot of golden terms
that relies on strict semantics.
But we don't really want strict functions.
We'd really love to get laser functions.
So nothing is evaluated until it's really necessary.
And in count, TensorFlow is a good [INAUDIBLE]
necessary means that we want to execute
the primitive TensorFlow kernel.
So primitive TensorFlow kernel is any op
add multiply convolution [INAUDIBLE]..
And composite kernels are right now currently only a function,
I think.
There are some proposals to add more composite kernel support,
but I don't know how it will look like.
But currently, we [INAUDIBLE] functions to be lazy.
And ideally, we would love to make switch lazy as well.
So if you have switch, your conditional execution,
right now, you'll have to execute all the inputs
to the switch, even if you don't use it later.
But it's very hard to get to the current executor.
And also, nonstrict semantics is very important for performance.
So imagine that TF2.0, we have a function resonant.
And we have a huge model that has 10 resonate inferences.
And each resonate has 256 input parameters.
So if these parameters are living on remote parameter
servers, you can't start executing your function
before you finish all these 256 tensors to your local machine
and start training.
And then you have--
well, you'll take a lot of memory
to keep them at the same time on the host of the device,
or whatever.
And it will take a lot of time until you get them,
all the network.
Even if we use TensorFlow 2.0, and the functions
do not get Tensor source inputs.
They just get resources and they do read variable op.
We still have kind of the same problem
because we can't start executing the function
before the previous update to [INAUDIBLE] finished.
So before you can start running your resonant 50 function, even
with the resources as inputs, you
have to wait for all your control
dependencies, of all the previous updates
to all the previous variables.
So even if you don't get a lot of network traffic,
you still wait on parameters [INAUDIBLE]
computing that updates, or maybe something else.
So we really want laser functions
and run them as soon as something [INAUDIBLE]
start efficient parameters running the first layer
and fetch the parameters for the second layer
only when we completed the first layer.
So we didn't get laser functions originally
because functions were added to the TensorFlow on time
much later than original executor and the GraphDef
was created.
And TensorFlow executor at runtime is super strict.
So people often think that TensorFlow is lazy.
So you fetch variable a, and then
you get back and pull only whatever you need.
Well, that's not actually how it works.
So TensorFlow runtime is very strict, and greedy,
and is used in pushes.
And these received lazy [INAUDIBLE]..
It's just a product of pruning.
So we removed-- first, before execution,
we removed all the notes that we don't
need to compute the output.
And then we start looking at the nodes one
by one that are ready, they don't have inputs,
and run them, and update all the consumers
that you are ready to run.
And this is the fundamental property
of how executor should see in terms of tensor flow works.
And it's almost impossible to do anything for that.
And you can't really touch it.
It's super complex.
It's super performance critical.
And adding classification is just impossible.
So that's why we end up with strict functions.
But originally, even in design documents from 2015,
people discussed this problem.
And people wanted to have lazy functions, or lazy semantics
because people at the time thought
that if you have some else layer in [INAUDIBLE] layer
as a function, eventually, you'll
want to wait for all inputs.
It is too critical for performance.
But that was too hard to implement at that time.
And we ended up with strict functions.
But no one used functions at v1.
It was a very rare thing.
And it was not a problem.
But in Def 2.0, with def dot function notation,
we wrap all the functions into FunctionDefs.
And now we have hundreds of thousands
of functions at one time.
And now have all these problems with strict laser semantics.
And concept semantics is not really defined.
So different people think different things
and what should be the semantics and what
is the right semantics.
So right now, we're kind of between strict and lazy
as we can.
But sometimes it's still strict.
So it's a little bit from us.
So to get back, I will lay the semantics.
One way is to fix execution model
at run time, which is impossible without rewriting from scratch.
So we might do it later, but that's not something we can do.
So the easiest way to do that is just
to align all the functions into the graph.
So instead of GraphDef with a petition call to the function
call, you will get one single graph, which is executable.
And function then for the full runtime
can start executing nodes as any authority.
So we don't have any of this function boundary anymore.
But you still have to do it very carefully.
Because if you inline it without--
because if you just align the function body inside the graph,
you might get completely different semantics.
Because some people rely on control dependencies
adhere to the function or the strictness,
and all the side effects, visibility, and TensorFlow 2.0
when it adds automatic control [INAUDIBLE] tracking.
It has some assumptions about what is the program execution
order.
And if function inlining would violate that,
it would be very, very sad situation.
So TensorFlow 2.0 Python front end
has some function semantics and graph construction
semantics that helps a lot to get this program order
semantics.
So all the immutable state is represented
as resource tensors.
So the resource is just a handle to a device and some memory
location.
So you can imagine this is a pointer on GPO or pointer
to the buffer on CPU.
And we pass our sources as inputs of the functions.
And if the function has an input from resource a,
it will have an incoming control edge from the last op in graph
or the last function that has the touch
[INAUDIBLE] resource that has the same resource as input.
So if you have assigned a variable,
and then you pass the resource for the same variable
to a function, you will have a control dependency
from data sign.
And if anyone else is touching the same resource after
the function call or any other op,
TF2.0 will add an outgoing control edge to the next op.
So if we pass a variable to a function,
and then you outside of a function you have added
variable op, TF2.0 will add a control dependencies from
your function call to the [INAUDIBLE]..
So if you do the read, you will observe all the changes
that were made to that variable inside the function body.
So the most important thing, I guess, from TF2 function
notation, that it does automatic control dependence striking.
So when you write your Python function,
and you have some idea what should be your program order
execution semantics, so you add 1 to a.
Then you add 1 to b.
You think that the a to b should happen after a.
And if you add 1 to a, then you read a,
you expect that you see the new value or the variable a.
It was not the case in TF1.
In TF2, when you add TF dot function notation,
it will add all the necessary control dependencies.
So it will add control dependencies
between all the ops that have the same resource as input.
And it will also add control dependencies between all stage
4 ops inside function barrier.
So all your stateful operations will be always executed
in the same order.
And it will give you some sanity.
So if you have multiple prints inside a function barrier,
you should every time observe that prints in the same order.
In TF2.V1 you can observe that prints in any order.
And that's confusing.
And all the stage four ops risk side effects.
Ops that can have side effects, they
will be added to control the control output.
So function run time and functional landing
must respect that.
So we don't want to lose any of the side effects or updates
to the state.
And these are some rules.
So all the side effects to the resources
that happened before function call
must be visible to all the nodes inside function call.
And all the side effects to the resources happened
inside function body must be visible to every op
function using the same resource after the function completed.
Currently, it's implemented.
So you have a control dependencies
from the op that made some side effect to the function call.
And the function call has a control dependencies
to the next stop that might need that side effects.
So to enforce that semantics, functional lighting
has special rules.
So it will add a special note, input control note.
So if your function goal has any control dependencies,
that input control node--
all the control dependencies will be forwarded
to that input control node.
And all the function inputs will have control dependencies
from that node.
So it basically means that you can't
start executing your function before your control
dependencies are satisfied.
All the nodes are executed.
It also means that if your function code doesn't
have any control dependencies, then the full runtime
is free to start running your function value
as soon as anything is ready.
Also, it will add up control node.
So this note will have control just from all the side effects,
from all the control outputs over function.
So if, for some reason, you have a side
effect inside a function, and this side effect is not
connected to one of the control outputs,
one function will be aligned.
TensorFlow is free to prune the side effect,
or so you might end up with some partial observed state updates.
But that should not happen in practice in TensorFlow 2.0,
because all the side effects, when you construct
a graph from Python, all the side effects
should be appropriately connected to control outputs.
But that's not the case of [INAUDIBLE] models.
I think [INAUDIBLE] doesn't use TensorFlow 2.0
automatic control dependency striking in some cases,
or it was not using some time ago.
So that might be violated [INAUDIBLE] in some models,
but I hope that doesn't happen right now.
And also, there's assumption if the function call does not
have an outgoing control edge, it
means that no one [INAUDIBLE] doesn't
care about what's happening inside the function, what
are the side effects.
So if you have an ongoing data edge,
someone needs, the data output.
But if function code doesn't have outgoing control edge,
it means that function might do whatever
it wants inside function value updates, any variables,
counters, print, anything send any data over the network.
And no one cares about what it does.
So that might lead to some troubles.
But I think in TF2.0, that also never happens in practice
because automatic dependency checking for nested function
calls will add other required control dependencies.
And when you execute the top level function,
it should also add all the control dependencies.
But, again, it might happen with some models
that do not use automatic control dependency tracking.
So that's how the function, the function
will look like after landing.
So this is the function from previous example.
So the function takes three resources as inputs.
It reads variables ab, increments the counter,
and the strided slice, one of which is invalid,
and returns both slices, and control output is increment.
So we have function call.
And this will be after aligning the GraphDef.
So we no longer have FunctionDef,
and we no longer have function call.
We have just one graph with multiple nodes.
So we have incoming input control node.
This is now op.
So the function call node didn't have any control dependencies,
so now op has [INAUDIBLE] control dependencies.
We read variables from inputs, and we
depend on input control note.
So any reads from the variables from the inputs
will happen after the input control node executed.
Then we have an increment [INAUDIBLE]..
Then we do two strided slices.
And then we have two identify nodes
for the function writtens.
And we reslice 0 for the first [INAUDIBLE] value
and slice 1 for the second.
And we have an output control node.
And the output control node has control dependencies
from the side effects inside function.
And this function has only the one side effect
assign add counter increment.
So read the variable op mark the state [INAUDIBLE]..
But it is not a side effect because ReadVariableOp can't
modify, can have side effects.
You just observed the states.
So in here, we could read variable op
to this [INAUDIBLE] control node.
But in practice, there are many stay full ops.
That just reading the state entry to variable
op just one of them.
And slice, so previously, we had a function call node,
get slice.
And slice is identity node, the reads the second output.
Now we don't have a function going out anymore.
And slice is just an identity node
that reads directly from the in line to function return.
And we have a--
so we read a variable from the counter.
And it has automatically control dependencies
to the output control node.
So every time we read the counter
we must observe all the side effects
that happened to that counter inside function value.
And now we can do pruning again.
So we don't use written 0.
We can prune it.
We don't use slice 0.
It is invalid, 20 to 30.
We can prune it.
And we don't need the value or the variable.
So we can prune, read the variable op.
And so again, back to the graph, that is valid,
and can be executed just on time without exceptions.
So there are a few more problems.
So when you have a function, and you in line function value,
and function value does not have any device annotations,
you have to decide on what device to place that node.
So before TF2.0, we always had single device function.
So either the function call node is placed on CPU,
or the nodes inside function body would be executed on CPU.
In TF2.0, it's too limited because you have resonate
in your function.
And that function call is on CPU.
You don't want to run [INAUDIBLE] on CPU,
or you might want to use multiple CPUs, GPUs,
or have a function to transfer multiple devices.
So there are multiple strategies,
how to place nodes within line function.
So you can do nothing at all and rely on placer.
You can make force function to be single device.
And you do some time for V1 graphs, for compatibility mode,
primarily, or [INAUDIBLE] for multiple device functions
in TF2.0 that all the nodes inside function value must be
placed on the same job replica and task.
But the device might be empty.
And then we rely on the placer to place them
on the correct device.
Because imagine that you have a function call and [INAUDIBLE]
runtime, and your function call happened
to be on another machine.
And then when you would execute that function call
on that machine at runtime, it would
be able to choose only from the devices available [INAUDIBLE]
machine.
So if you [INAUDIBLE] and don't add any device notation,
place [INAUDIBLE] device placements.
So if the user placed a function call, like resonant function
call, on machine 1, and resonant function site
doesn't have any device notation, and we learn it,
and then we run placer, all the resonant nodes
may be placed on a completely different machine and GPUs,
and it will break assumptions of the user.
So when we [INAUDIBLE] functions currently,
we overwrite job task and replica,
and then end untouched device.
So the placer will pick the right device for the node
even after learning.
Also have a bunch of functions created internal for v1
that do not use control outputs and don't just
control dependency inside function value at all.
So after aligning such functions,
you might end up with completely different execution semantics
and lots of [INAUDIBLE].
What's another fun part of current runtime,
that current function an time, it does not
prove any state full ops.
And it is very different from execution semantics
of the graph, because if you have
state full ops, variable updates inside your graph,
and you don't have control dependencies,
runtime will plummet.
But if you have exactly the same graph inside function,
runtime will execute all the stateful ops.
And this mismatch is very difficult
to handle when you in line your function graph
inside or the outside GraphDef because you
have different notions of what should be pruned and when.
So TF2.0 always inlines all the functions because TF2.0 is
guaranteed to have correct control dependencies
and function control outputs.
But we don't inline functions by default in legacy graphs.
Grappler is responsible for aligning functions
in legacy graphs.
But Grappler does a lot of additional checks
with all the stateful ops as a path to one other output,
that we don't have any side effect for ops inside function
barrier that are not connected to anything because they
will be pruned.
And there are a bunch of other also.
In TFV1, you can get function with mismatched deadness,
which should not be possible.
But that happens.
So Grappler is very conservative.
So it aligns only if it can prove that it is safe,
and it does not change the semantics.
And this functional inlining is a huge source of bugs.
Sometimes people think that function inlining is a problem,
and often it is a source of a program
because it is very complicated and semantics was never defined
properly for the function.
So I mostly had to come up with the semantics to make
all the tests pass and a bunch of [INAUDIBLE] workarounds
for different tests.
I hope that we might be able to get
better semantics at some point.
But right now, it's super messy.
Also, you have much other functional, like functional op.
So, for example, we have a functional if--
so this will basically predicate.
And you have attributes with two functions, then
and else function.
And another op is functional while.
So in V1 we have-- so each next iteration
enter exit nodes to represent while ops.
In V2, we have a functional while.
So you define a function for the body, for the condition.
And then you just run it at runtime as a special op.
Also, you have a case, something like if with multiple branches.
Currently, we lower all these functional ops
to V1 control flow.
So basically, the if node, the if node
becomes a switch node and 2 function call node
for then and else functions.
And these functions are then aligned,
plus any other function call.
So we do that primarily for [INAUDIBLE] semantics.
So if you're on your while or if as the open side graph,
it will have strict semantics.
And some models just expect lazy semantics from your if.
It also very limiting for concurrency.
So V1 control flow.
While ops, for example, you can run multiple [INAUDIBLE]
iterations at a time.
You can [INAUDIBLE] while loop durations.
If you will try to do that with a functional while,
it's just impossible.
You have to wait for every iteration
to completely finish before you can start the next iteration.
So a lot of people want to move to functional ops, functional
control flow.
But in practice, it's very difficult,
primarily because of the performance.
But it makes a lot of analysis easy
because we often have to reconstruct
what was the while loop from the GraphDef
from all the switches, next iteration, entry/exit nodes.
And it is very [INAUDIBLE] prone,
and we have a lot of troubles with it.
So if you would have at the graph optimization
level functional ops, that would help.
And then at the later stage, we can just lower all of them
to switch [INAUDIBLE] to get the good performance.
So here is an example of how functional
if looks like in a graph.
So we have a flag.
That's the Boolean variable.
We read a flag with a read variable op.
We have some constant 0.
So we have two functions plus 1 that
adds to integer input 1 in the [INAUDIBLE] and plus 2.
So we read a flag, a Boolean flag.
Then we have 0.
And the result is if plus 1, plus 2.
So if your flag is true, we add plus 1.
If the flag is false, we add plus 2 to 0.
So the result will be 1 or 2 depending on the flag.
So when we lower this functional if to V1 control flow
constructs, we will have the reveal op for the flag, which
will have a 0 for the cost.
So we have a switch node based on the flag and 0.
So the switch node has two outputs.
It will output the value 0 on the output 0
if the flag is true, and it will output the value 0
on the output 1 if the flag is false.
And the other output, unused output will be dead.
So it will basically prevent execution
from one of these nodes.
So then a function is a function call with a switch output 0.
And else function is another function call
with a switch node, output 1.
So if your flag is false, this node will be dead
and will not be executed.
And then we have a result as a merge.
So we merge the results of the and function and else function.
And one of them is going to be dead,
and another one will be alive.
And this will be the final result.
So after we lower--
so this is after we lower if node to function calls.
And then function learning kicks in.
And we get rid of all the function call nodes.
We basically have then function written.
Just add from switch 1 and else function.
This should be 2.
And we merge the written values of the functions.
Yeah, that was great.
So that's how we get rid of all the functions.
So we have functions as a mental model for the end user,
how do you think about your TensorFlow graph,
how you built your program.
So you no longer think in terms of graph and [INAUDIBLE]..
You think in terms of functions.
But when we get these functions at the runtime,
we still carry them to [INAUDIBLE]
because that's what we have to do for performance
and sometimes for correctness.
But there's kind of a promise of TF, that function notation
in TensorFlow [INAUDIBLE] mode.
If you want [INAUDIBLE] semantics back,
just annotate your function with TF dot function
and you'll get back a graph.
But that's not completely true, because if you have [INAUDIBLE]
function goals annotated with [INAUDIBLE] function,
you would have multiple function deaths,
multiple function call nodes, and you'll
get strict semantics.
And this is not [INAUDIBLE] the semantics of V1.
So the only way to provide users what was promised
is to align all the functions.
And then we'll get back to the single graph
with the lazy semantics, with the pruning
and all the nice properties of that dataflow graph.
[MUSIC PLAYING]