Inside TensorFlow: tf.Keras (part 2) (Inside TensorFlow: tf.Keras (part 2))

字幕列表影片播放

FRANCOIS CHOLLET: So last time, we
talked about a bunch of things.
We talked about the functional API
for building types of layers.
We talked about features that are specific to the functional
API--
things like static input compatibility checks
across layers every time you call
a layer, whole-model saving, model plotting,
and visualization.
We talked about how masking works in the functional API
and about how masks are propagated.
So for instance in this example, the Embedding layer is going
to be generating a mask here because you passed this
argument mask_zero=True.
And this mask is going to be passed
to every subsequent layer.
And in particular, to layers that consume the mask,
like this LSTM layer here.
However, in this LSTM layer, because it's
a sequence reduction layer, it's not
going to be returning all the sequences, but only
the last output.
This is going to destroy the mask,
and so this layer is not going to see a mask anymore.
So it's really a way to handle masking
that works basically magically.
Masking is pretty advanced, so most people who need it
are not actually going to understand
very well how it works.
And the idea is to enable them to benefit
from masking by just doing, like, hey, this Embedding
layer, I want the zeros to mean this is masked,
and then everything in their network
is going to magically know about this as long as they
are using built-in layers.
If you're an implementer of layers,
you actually need to understand how it works.
First of all, you need to understand
what you should be doing if you're
writing a layer that consumes a mask, like an LSTM layer,
for instance.
It's very simple.
You just make sure you have this mask argument in the signature
and expects a structure of tensors
that's going to match the structure of your inputs.
So it's because it's going to be a single tensor.
And the single tensor is going to be a Boolean tensor, where
you have one mask entry per timestep per sample.
So it's typically a 2D Boolean tensor.
If you have a layer that can safely pass through a mask,
for instance, a dense layer or in general any layer that
does not affect the time dimension of its inputs,
you can just enable your layer to pass through its mask
by saying it supports_masking.
It is opt in because a lot of the time,
layers might be affecting the time dimension
of the inputs, in which case the meaning of the mask
would be changed.
And if you do have a layer that changes the time
dimension or otherwise a layer that creates a mask from input
values, it's going to need to implement this compute_mask
method that is going to receive the mask, receive the inputs.
If, for instance, you have an Embedding layer,
it's going to be doing this--
not_equal inputs, 0.
So it's going to be using the input values
to generate a Boolean mask.
If you have a concatenate layer, for instance,
it's not going to be looking at the input values,
but it needs to look at the masks--
the two mask that's being passed--
and concatenate them.
And if one of the masks is [? non, ?] for instance,
we're going to have to generate a mask of 1's.
OK, so that's the very detailed.
Yeah?
AUDIENCE: So just maybe a little bit more detail about masking--
so if you say supports_masking is true like
in the lower-left-hand corner, is that just using some default
version of--
FRANCOIS CHOLLET: Yes, and the default is pass through.
Yeah, so it is.
If you said this, this enables your layer
to use a default [INAUDIBLE] compute_mask, which
just says return mask.
So it get the input and a mask, and just
returns the mask unchanged.
AUDIENCE: So that assumes that the mask
is like the first or second dimension gets masked?
FRANCOIS CHOLLET: The first dimension
gets masked if zero is the base dimension.
AUDIENCE: And then where does this mask argument come from?
Like if I look at the previous slide,
it's not clear to me at all how this mask is being [INAUDIBLE]..
FRANCOIS CHOLLET: So it is generated
by the Embedding layer from the values of the integer inputs.
AUDIENCE: Right, so Embedding layers has a compute_mask
function?
FRANCOIS CHOLLET: Yeah, which is exactly this one actually.
AUDIENCE: And it returns a mask.
FRANCOIS CHOLLET: Yes.
AUDIENCE: So and somehow, the infrastructure
knows to call-- because you enabled masking,
it knows to call the compute_mask [INAUDIBLE]..
FRANCOIS CHOLLET: Yes.
AUDIENCE: The mask gets generated,
but I don't know where it gets put.
FRANCOIS CHOLLET: Where it gets put--
so that's actually something we're
going to see in the next slide, which is
a deep dive into what happens.
When you're in the functional API, you have some inputs.
You've created with Keras that inputs.
And now, you're calling a layer on that.
Well, the first thing we do is check
whether all the inputs are actually symbolic inputs,
like coming from this Input call.
Because there's two ways you could use a layer.
You could call it on the actual value tensors,
like EagerTensors, in which case you're just
going to run the layer like a function
and return the outputs.
Or you could call it symbolically,
which is what happens in the functional API.
Then you run pretty extensive checks
about the shape and the type of your inputs
to raise a handful of error messages
in case of a mistake made by the user.
Then you check if the layer is built. So the layer being built
means its weights are already created.
If the layer was not built, you're
going to use the shape of the inputs to build the layer,
so you call the build method.
And after you've done that, you'll
actually do a second round of input compatibility checks,
because the input spec of the layer
is quite likely to have changed during the build process.
For instance, if you have a dense layer, when
you instantiate it, before it knows its input shape,
its input spec is just the fact that its inputs
should have rank at least two.
But after you've built the layer then
you have an additional restriction,
which is that now the last dimension of the inputs
should have a specific value.
Then the next step is you are going
to check if you should be--
if you are in this case, if this layer expects a mask argument.
And if it does, you're going to be fetching the masks generated
by the parent layer.
You have this compute mask method.
AUDIENCE: So--
FRANCOIS CHOLLET: And--
AUDIENCE: And so where does that come from?
I mean, like, somehow there is a secret communication generated.
FRANCOIS CHOLLET: Yes, Which.
Is a bit of metadata set on the tensor itself.
There's an [INAUDIBLE] Keras mask property,
which is the mask information.
So it is the least error-prone to co-locate the mask
information with the tensor that it refers to.
AUDIENCE: So if you were to do something like--
I don't know, like a [INAUDIBLE] or something
that's not a layer.
FRANCOIS CHOLLET: Yes?
AUDIENCE: Would-- you-- so you get a tensor that doesn't
have a Keras mark argument.
But then you say-- but I guess you also were going
to it into a lambda layer.
FRANCOIS CHOLLET: So what happens
when you call ops that are not layers
is that they get retroactively cast into layers, essentially.
Like, we construct objects that are layers,
that they are going to be internally calling these ops.
But these automatical layers in the general case
do not support masking.
So if you do this, you are destroying the mask,
and you're going to be passing your mask
to a layer that does not support it, which is an error.
So it's not going to--
AUDIENCE: It's not a silent--
FRANCOIS CHOLLET: It's not a silent failure.
It's an error.
If you pass a mask to a layer which
is one of these automatic layers,
in this case, that does not support masking,
it's going to yell at you.
AUDIENCE: So, but wait a minute.
I feel like lots of layers are going to be
like trivial pass-throughs.
Like, if there's a mask, we want to pass it through,
but if there's not a mask, that's fine.
FRANCOIS CHOLLET: Yeah.
So it has to be opt in, again, because any change to the time
dimension of the inputs would need a smarter mask
computation.
And we cannot just always implicitly pass through
the mask, because you don't actually know what the layer is
doing.
AUDIENCE: If you-- couldn't you change to implicit events
through the mask as the shape or the outputs,
the shape of the input?
AUDIENCE: But what about something like a [INAUDIBLE]??
AUDIENCE: That has the same mask.
That should actually respect the mask.
FRANCOIS CHOLLET: That-- I think it's a reasonable default
behavior.
It's not going to work all the time, actually.
You can think of [? adversile ?] counterexamples.
AUDIENCE: [INAUDIBLE]
FRANCOIS CHOLLET: But they're not like,
common counterexamples.
But yeah.
So currently masking with the functional API
is something that's only useful for built-in layers very much.
So it's not really an issue we run into before.
I like the fact that currently it's
opt-in, because this actually saves us
from precisely people generating a mask in semantic layer
and then passing it to a custom layer
or an automatically generated layer that does not support it.
And that could potentially do things you don't expect, right?
So it's better to--
AUDIENCE: And is the mask supported
with an eager execution?
FRANCOIS CHOLLET: Yes, it is.
So in eager execution, essentially
what happens is that the call method, which
is programmatically generated for one of these functional API
models, is basically going to call both--
[INAUDIBLE] the call method of layer and its compute mask
method, and going to call the next layer
with these arguments.
So essentially, very much what you
would be doing if you were to use
masking using subclassing, which is basically,
you call your layer.
You get the outputs, your sublayer.
I don't have a good example in here.
So you call a layer, get these outputs.
Then you generate the mask using compute masks, specifically.
And for the next layer, you're going to be explicitly passing
these [INAUDIBLE].
AUDIENCE: And is it is just keyword argument mask equals?
FRANCOIS CHOLLET: Yes, that's right.
AUDIENCE: So all these layers [INAUDIBLE]..
So if I feel like a mask is going to [INAUDIBLE]
they can always fix it by passing an explicit one?
FRANCOIS CHOLLET: Yes, that's correct.
You can pass arbitrary Boolean tensors well to deboolean
tensors as the mask argument.
So, in general, if you are doing subclassing,
nothing is implicit, and you have
freedom to do whatever you want, basically.
AUDIENCE: Question about masking in the last--
if you have a [INAUDIBLE] [? static ?] case,
like 3D tensor or trying to [INAUDIBLE] data,
is it the mask that propagated it to the--
FRANCOIS CHOLLET: Yes, that's right.
So in this example, for instance,
our sequence direction, the CM layer,
would have been destroying the mask.
But if you were to remove these last two layers
and just say this-- this [INAUDIBLE]
layer that returns sequences is your last layer,
and then you apply loss function that is sequence aware, then
yes.
The model, during fit, is going to generate a sample weight
argument for the loss, which will incorporate the mask,
meaning that any time step that is masked
is going to receive a 2D sample weight of 0.
So yes.
So if you have something like, for instance,
per time step classification, and you
have time steps that are padded on masks,
they will be completely ignored by the loss.
So resuming, what happens when you call
a layer on symbolic inputs?
So after you've retrieved the mask,
you've run the input compatibility checks,
you've built the layer optionally,
you're going to build a graph for the operations done
by this layer.
And have I mentioned two possibilities?
Either the layer can be converted to a graph, which
is the case 99% of the time.
In that case, you're going to use autograph
to generate an autographed division of the call method.
This is useful in case the layer implementer is
using [INAUDIBLE] control flow, like if and for loops,
and so on.
And then you're going to be calling these autographs
version of call on your symbolic inputs.
And in this call, you're going to be incorporating the mask
argument, if it's present, and the training
argument, if applicable.
If the user has not passed an explicit training Boolean
argument in this layer call, then it
will default to the Keras learning phase tensor, which
is a global symbolic tensor that we set its value when
you call fit or evaluate.
In the case-- there are case when the layer is declared
to be dynamic, meaning that it cannot be converted to graph.
This is the case, for instance, for a Tree-LSTM layer
and so on.
So it's very niche cases.
In that case, you would expect the layer implementer
to have implemented static shape inference
to compute output shape method.
It takes input shapes and return output shapes for this layer.
And you're going to use this static shape inference
method to generate new symbolic tensors with the correct shape
and deduction.
And you're going to be returning that.
So once you've created the sub graph corresponding
to this layer, you're going to create a node object, which
is, again, a node in the graph of layer objects
that the functional API builds.
And you're going to set metadata on the outputs corresponding
to the node, so essentially the layer call that created this--
these outputs, and also mask metadata.
So that's it.
It's actually a lot of features in one.
So we talked about dynamic layers,
layers that can be graphed and layers that cannot be graphed.
So last week we saw an example of a very simple batch
normalization layer implemented with subclassing.
And in the call method, you had this if training conditional.
And in training, you're just going
to compute the mean and variance of the current batch.
You're going to normalize the current batch
with its own statistics.
And you're going to be updating the moving variance
and moving mean for your layer.
In inference mode, however, you're
going to be normalizing the current batch with the moving
mean and moving variance.
So there's actually a small subtlety
with this layer, which makes it, fortunately, ungraphable,
which is basically doing assignments
of variables, so updates, in a branch of the conditional.
This would be relatively easy to avoid--
AUDIENCE: Why is that a problem?
FRANCOIS CHOLLET: So it's a problem
because each branch in a TensorFlow control flow V2
is going to be run in different FuncGraph.
And so--
AUDIENCE: [INAUDIBLE] control flow
V2 is perfectly fine as is.
AUDIENCE: It's because you only have an assign [? N1 ?] branch.
If you had a corresponding [INAUDIBLE]----
AUDIENCE: No, no.
Look at how [? the sign-on-- ?]
AUDIENCE: There might be a bug, but it's a bug, if this--
AUDIENCE: Yeah.
FRANCOIS CHOLLET: So, anyway, yeah, it--
AUDIENCE: It sounds--
FRANCOIS CHOLLET: --it's a bug.
So--
AUDIENCE: If you replace the assign [INAUDIBLE] self,
but add_update, then you have a problem
because you could be looking at the answer from a control flow
branch.
But assign is safe.
It's [INAUDIBLE]--
FRANCOIS CHOLLET: So last time I tried, this wasn't working.
If you moved the assign outside of the if statement,
it would be working.
So, but anyway, let's say you have a Tree-LSTM layer.
You cannot graph it.
What do you do?
You pass in the constructor.
When you call super, you pass dynamic [INAUDIBLE]..
This tells the framework that this layer cannot be graphed.
And when using the functional API,
it's never going to be used to build a graph.
And when you call fit or evaluate,
we are always going to run everything eagerly,
even if you don't set explicitly running [INAUDIBLE]..
One thing to keep in mind if you have dynamic layers
is that if you want to use them in the functional API,
you will have to implement a compute output check
method to tell the framework how to do static shape
inference with this layer.
If you cannot do static shape inference,
you cannot use your layer in the function, yeah, unfortunately.
So if you try to use your layer in a symbolic way,
it's going to [INAUDIBLE].
So that's it.
And yeah.
And so when you call fit, it's automatically
going to be run eagerly.
So of course, if you're not using the functional API
and you're not using fit, then this dynamic argument
is irrelevant to you.
So let's talk a little bit about training and inference.
So as we said last week, so compile
is about basically configuring the training procedure.
So it's all about building, essentially,
an execution function, an execution graph.
Meanwhile, fit is about running this execution function
with new data over a data set.
So what happens when you call compile?
Of course, checking the values of the arguments by the user,
because it's always important to do sanity checks
and raise good error messages, if the user has made a mistake.
Then we are going to look at the loss argument.
We're going to map it to the different outputs.
And not all outputs might be passed to loss.
You're going to do the same for metrics as well.
You're going to compute the total loss, which
is the sum of the per output losses,
and any loss that's being added during the forward pass.
You're going to [INAUDIBLE] the trainable weights
of the model and use the total loss
and trainable weights and the optimizer
that you passed to compute gradients.
Then you're going to prepare the inputs, outputs, and updates
for different execution functions, which
are basically FuncGraphs.
We have three different execution functions.
You have the trained function, which
takes input data and input targets
and is going to run the backprop updates and the forward pass
updates, and is going to return the loss and metric values.
Then you have the eval function, which does the same,
except without training any updates.
So it's not doing backprop [INAUDIBLE]..
It just takes input data, targets,
and it comes loss and metrics.
No updates.
And finally, you have the predict function,
which is like eval, except with different outputs.
It's going to-- so it's not going to run any updates.
It's taking the input data and targets
and it's going to be returning the output of the model.
It's not going to be attending the loss or the metrics.
And the way these execution function
are implemented in TensorFlow V2 is as FuncGraphs.
When you are calling a layer symbolically,
so in the functional API, that happens in a global Keras
graph.
And when you're creating an execution function,
you're going to be creating a new scratch FuncGraph,
and you're going to be doing a copy of the relevant subgraph
from the global Keras graph to your new graph.
And then that's this copy that you're going to be running.
So that's-- if you're creating different models,
the execution functions are all separate,
living in different FuncGraphs.
AUDIENCE: So here we have--
so it seems like there's some weird things where--
like for example, an optimizer is not used for eval
and predict?
What happens if you-- can you just not set an optimizer?
FRANCOIS CHOLLET: So, yes, you can just not do compile,
and then do predict.
That works, because you don't need to compile information
in order to run predict.
However, you do need to compile your model
if you want to run eval, because eval
needs to compute loss and metrics, which
are passed in compile.
AUDIENCE: Well, I guess-- but what if you don't--
like, you're only doing eval.
Do you need an optimizer?
FRANCOIS CHOLLET: Technically, yes.
I think it's possible you could pass non as the optimizer,
and that would work.
I don't recall.
AUDIENCE: I think it's required.
FRANCOIS CHOLLET: It's required?
Yes, so in that case you can just pass a default optimizer,
like this string LCD.
It's going to work fine.
AUDIENCE: OK, but I guess then it's
going to create a bunch of graphs
that I'm never going to use, but--
FRANCOIS CHOLLET: So the execution functions
are actually created lazily.
So they're not actually created in compile.
It's a good mental model to have to think
about creating in compile, if you want to think about it,
but actually--
for instance, it's when you call fits that a train function is
going to be created.
It's when you call evaluate that eval function is
going be created, and so on.
So if you just instantiate your model,
compile it, and call evaluate, you're
not going to be creating any graph [INAUDIBLE]
the optimizer, because you're only
creating the eval function, which does not [INAUDIBLE]..
AUDIENCE: I think it'd be [INAUDIBLE]
optimizer equal to [? normal, ?] specifically.
FRANCOIS CHOLLET: But it's kind of a niche use-case.
So anyway, you can definitely instantiate a model
and call predict without having called compile,
and that totally works.
AUDIENCE: Because-- are any of the--
so none of the arguments from compiler use are used for--
FRANCOIS CHOLLET: Are useful [INAUDIBLE]..
That's right.
AUDIENCE: OK.
FRANCOIS CHOLLET: Because [INAUDIBLE] is just
input to output mapping.
AUDIENCE: But aren't outputs also prepared in compile?
So just prepared [INAUDIBLE]?
[INAUDIBLE] predict?
AUDIENCE: [INAUDIBLE] when should people--
let's say they're using model [INAUDIBLE]
predict versus just calling the model and then evaluate.
FRANCOIS CHOLLET: So model that predicts
is going to iterate over the data you passed in many batches
and going to return them by arrays.
Meanwhile, calling the model on an EagerTensor
is the same as calling a layer, so it returns, directly,
this value.
If you have a single batch, there is no difference.
AUDIENCE: In a lot of cases, if you call model on something,
it goes through the eager path, whereas if you call
predict it goes through this function path.
And so if you're sensitive, essentially,
to the graph execution versus the eager time,
predict can be much faster.
FRANCOIS CHOLLET: Yeah.
But in terms of end output, if you have a single batch,
there's no difference.
But the big difference is that to predict, you could pass,
for instance, a data set, right?
Which you cannot pass in call.
Yeah.
So what happens now when you call fit?
There's pretty extensive checking
of the user-provided data, as usual,
checking that the correct data is being passed,
that there's correct shapes or rank and so on.
So optionally, we can set aside the validation splits.
That's only true if the input data is a numpy data
or EagerTensors.
That's not possible if you pass a data set or pattern
generator.
Then we prepare the callbacks.
So importantly, everything that happens dynamically
during training, apart from executing the graph functions,
is structured as a callback.
In particular, the logging that we do internally
and the display that we do of the progress bar,
these are all callbacks.
And if the user also wants to do some action
dynamically during training, they
have to implement callback.
AUDIENCE: But how often are callbacks called?
FRANCOIS CHOLLET: So they're called at different point
during training.
So a callback implements the methods
on_train_begin and on_train_end, which
are called at the very beginning before training
and after training is over.
Then for each epoch you have a method
on_epoch_begin and on_epoch_end, so called
before the start of an epoch and after the epoch is finished.
And finally, there is batch-level methods.
There is on_batch_begin and on_batch_end.
AUDIENCE: But is the infrastructure
able to know whether a particular callback is
implemented on batch beginning, on batch end,
to avoid maybe a per batch overhead?
FRANCOIS CHOLLET: So if the method is not implemented,
there is virtually no overhead to coordinate.
And we need to call it, anyway, for things like the [INAUDIBLE]
aggregator and so on.
AUDIENCE: For what?
FRANCOIS CHOLLET: The logging.
The logging callbacks, basically.
AUDIENCE: I guess in cases where we're
trying to put the whole training on a device, for example,
or we're trying to do, say, remote execution--
FRANCOIS CHOLLET: Yeah.
AUDIENCE: Where per batch, execution might be--
FRANCOIS CHOLLET: Yeah.
So one thing we've considered-- this
is a topic that has come before with TPUs and so on.
One thing we've considered is having a global--
well, an argument in fit, or something,
that specifies how often batch-level callbacks have
been recalled, like for instance every 30 batches and so on.
AUDIENCE: So the batch--
if a batch argument is only called
every 30 batches, or something like that, is that going to--
I mean, how does it work?
Does the-- are callbacks expecting
to see something every batch?
Are they going to still work if they're
called every 30 batches?
FRANCOIS CHOLLET: Typically that's still going to work.
The way it's going to work is that from the perspective
of your callback, it's going to be
the same as if your batches were 30 times bigger.
AUDIENCE: So the batch recaller could
be called [INAUDIBLE] batches?
AUDIENCE: Is it--
FRANCOIS CHOLLET: Yes.
AUDIENCE: Does it get the output?
AUDIENCE: When?
FRANCOIS CHOLLET: This is speculative API, right?
We don't have this to implemented,
but it is one way we've considered
handling the fact that you're probably
going to be wanting, at some point,
to do processing of multiple batches
at once on device with no contact with the host.
AUDIENCE: So I guess, could you just
tell us what on_batch_begin arguments are?
Like, what sort of information is passed at this point?
FRANCOIS CHOLLET: Right, so it receives a log dictionary.
It receives the index of the batch as well.
The log dictionary contains the current value of the total loss
and the metrics.
AUDIENCE: So that-- the loss from the last batch,
or the loss [INAUDIBLE]?
FRANCOIS CHOLLET: So that's the loss from the last batch,
and that's the other metrics, the moving
value of the metrics.
AUDIENCE: But even if we change the callbacks, I think--
and as long as the loop itself is in Python,
then it doesn't really help, right?
Even if you're not calling a callback,
as long as a loop is still in Python, it doesn't really--
AUDIENCE: [INAUDIBLE] try to turn the loop
the loop into a pass function?
AUDIENCE: Yeah, I think that would be required as well.
AUDIENCE: [INAUDIBLE] the expectation
is that the callbacks are operating numpy [INAUDIBLE] not
tensors.
AUDIENCE: And I think we need to change the APIs so
that the callbacks are operating [INAUDIBLE] so
that we-- that would work.
AUDIENCE: I mean, I think in a perfect world,
we would use a mixture of Python and py function to essentially
only run [? py ?] in Python the parts that we want,
while still keeping the outer part.
But I mean, we're not there yet.
AUDIENCE: And since callbacks are passed down through this,
can I rely on the sequence of how they're passed?
Is it the sequence how they're going to pass again?
FRANCOIS CHOLLET: Yes, so the sequence
in which they are called matches the order in which they're
passed in the callbacks list that you pass to fit,
meaning that it is possible for one of your callbacks
to add some values to the log dictionary.
It's the same log dictionary that's
going to be seen by the callbacks after that.
It's possible to do cascading processing methods.
AUDIENCE: And the default callback,
the progress bar and stuff are called at the end?
FRANCOIS CHOLLET: The progress bar and logging stuff
is called at the very end, meaning
that if you add stuff to the log dictionary,
it's going to be displayed.
But the very first callback that's passed
is actually the callback that starts populating the logs.
AUDIENCE: There's also a slight bit of nuance
in that the model itself is set as an attribute on callbacks,
and so any attempt to, essentially, optimize this
or put this in a tf.function scope
would have to be aware of the fact
that it's a legitimate use case for a callback
to start accessing arbitrary parts of the model.
AUDIENCE: Right.
I guess my question is to more about running callbacks
less frequently, not about trying
to run them inside a function.
FRANCOIS CHOLLET: OK.
So our last topic is losses and metrics.
The losses are very simple.
So they're just subclasses of this loss base class.
They just have to implement one method, which
is call, just like layers.
The signatures would be different.
Yeah, the signatures are [INAUDIBLE] y_true, y_pred.
Optionally, you can also have sample weights.
Yeah.
And you just return a loss.
So importantly, as a convention, the loss return
should be one loss value per sample.
So here, for instance, we are reducing on the loss axis,
but we are not returning a scalar.
Metrics are a bit more involved.
There's three different methods you should be
implementing in your metrics.
You have this update state method,
which is very similar to the call method for losses,
so y_true, y_pred, sample weight in the signature.
And the difference is that you're not returning a value.
You're just updating the internal state of your metric.
So in this case, your internal state
is just this one scalar weight called true positives.
So here, you are just adding to this value.
Then the second method you have to implement
is results, which, as the name indicates,
just returns the current value for this metric.
And finally, you need a method to reinitialize
the state of your metric.
And what happens, for instance, when
you call fit is that at every batch,
we're going to be updating the state of the metrics,
including this update state method.
When you want to report a value, which--
well, typically it's also after every batch,
we call this results method.
And at the end of an epoch, we want to reset the metrics,
so we call reset states.
And the way you specify metrics in the compile API
is basically you just have this metrics document
and compile mistakes, a list of these metric instances.
So if you have metrics with signatures or requirements
that do not match this API, that you can update the state
based on y_true, y_pred, and some simple weights,
for instance.
One thing you can do is write a layer that will--
inside its call method, that we call self.add_metric.
We just call our tensor.
And that enables you--
there is two arguments in two pass.
In each pass, name, because that's
what's going to be reported with the progress bar on your logs,
and so on, and then aggregation argument,
which tells the framework how the--
so you can assume that this is basically
called for every batch.
So how does these different values for every batch
get aggregated into a single scalar, right?
And then in the functional API, you
can use this layer like this.
You insert it at some point, and just returns the same
as its input tensors, which you can keep using.
And when you do that, you get a model
with a forward pass that's going to be calling
this add_metric at every batch.
AUDIENCE: Are there any standard, predefined versions
of metric log layers?
FRANCOIS CHOLLET: No, but it looks just like this.
So there's nothing special you have
to do, apart from implementing call with add_metric in it.
AUDIENCE: How does that work in eager execution?
FRANCOIS CHOLLET: In eager execution,
this is literally called at every batch.
AUDIENCE: So add a new metric at every batch,
or if two people created metrics at different--
with different metrics the same name?
Can you cache that?
AUDIENCE: We keep track with the metric, with the name,
and just call the same metric again.
So--
AUDIENCE: Actually, so if you have two layers
and they call that metric, and they both use the same name,
then you're going to have a collision?
AUDIENCE: We raise an error in that case.
AUDIENCE: Oh, so like--
AUDIENCE: We can detect.
AUDIENCE: You've just detected the layer?
But if the layer has two metrics of the same name--
AUDIENCE: Same--
AUDIENCE: --assume that's intentional?
FRANCOIS CHOLLET: So essentially,
in one forward pass, you cannot have the same name twice.
But across different forward passes, so
across different calls of your metric--
AUDIENCE: [INAUDIBLE]
FRANCOIS CHOLLET: --of your model,
you're going to be aggregating the metrics based on the name.
AUDIENCE: So there is some state that gets reset?
FRANCOIS CHOLLET: Yes, that's right.
AUDIENCE: So is this similar to the thing
where layers know if they're being called by another layer?
FRANCOIS CHOLLET: Yes, it's very similar.
It's basically call context.
And at the end of the call context, you reset the state.
So you reset the losses that were created
during the forward pass, and you reset
the state of the symmetric aggregation thing.
Right.
But what if this is not actually enough for you,
because you want a metric that not only sees
the input of your model, but also the targets?
Which is actually relatively common.
One thing you could use--
so, of course, if you're just doing a model
subclassing or writing your own custom training loops,
you have no restrictions whatsoever,
so it's not relevant to you.
But what if you really want to use fit with these very
arbitrary metrics?
Well, one thing you can do is the endpoint layer pattern.
So how does it work?
It's basically a layer that's in the functional
API you would put at the very end of your model,
so it computes predictions, right?
And it's going to take as input whatever you want.
In our case, it's going to take the targets of the models
and logits generated by this dense layer here.
And the targets are an input, right, created here.
And then what is it going to do with inputs and targets?
It's going to compute a loss value
so it returns a scalar, in this case, because we call it
within this [INAUDIBLE] call, which
is different from the plain call method.
So it's automatically reduced in this case.
You can add whatever metrics you want.
Note that if you add the training argument here,
you could be logging different metrics in training
and inference.
And finally, you return what you would want the predict method
to return, so a softmax.
AUDIENCE: Well, it seems to me like you can't
use this with inference at all.
FRANCOIS CHOLLET: So in inference,
you would be using the same layer, for instance.
But what you can do is you have to rewire your model,
because you're not going to have these targets inputs.
But you can use the same layer, and the only thing
you need to be mindful of is that you
should do a conditional check on the inputs that
were passed to see if there is a target key in there or not.
Or you could rely on the training argument.
Also works.
AUDIENCE: So you're saying that if you create this model
like you show it in there, and then I call
and model that predict on it--
FRANCOIS CHOLLET: So if you just reuse this model,
this is a model that's already functioning.
If you reuse it for predict, you're
going to have to pass some value--
some dummy value, for the targets.
So if you want to not have to pass dummy value for targets,
you need to redefine a new model that is going
to be using the same layers.
The only difference is that it's not
going to instantiate this target's input object
and it's not going to be passing the targets
to the LogisticEndpoints.
Any logistic endpoint is going to ignore--
it's not going to attempt to access the target's [? key. ?]
Yeah.
So when you instantiate this model like that, you state,
it starts from inputs and targets.
It returns predictions.
And when you fit it, you fit it with a dictionary or a data set
with returns a dictionary, and you include the target's data
in that dictionary.
So like this, right?
And when you compile it, you are not
going to specify any loss in compile,
because the loss is added entirely
inside that endpoint layer.
So you just specify the optimizer.
AUDIENCE: So is the name in, like-- when
you define the inputs and [? endpoint ?] for target,
is the name in there supposed to match
the key in the dictionary?
FRANCOIS CHOLLET: Yes.
So when you call this layer, you're passing this dict.
This dict is--
AUDIENCE: No, not the LogisticEndpoint.
I mean the input layer.
How does [INAUDIBLE] note as inputs, inputs and targets,
the target instead of the other way around?
FRANCOIS CHOLLET: So, OK.
So it's a bit confusing because we have a naming collision
here, but in your data dictionary,
the keys are supposed to match the names
that you give to your inputs.
AUDIENCE: OK.
FRANCOIS CHOLLET: So here, you're
creating inputs that's named inputs
and you'd create targets that's named targets.
And this is what these keys here are referring to.
So yeah, to deconfuse, we could have
chosen another choice of name for the inputs
to the [? LogisticEndpoint ?] layer.
AUDIENCE: Question.
FRANCOIS CHOLLET: Yeah.
AUDIENCE: Is it possible-- like, in here,
you are declaring the loss and adding loss on the layer level.
Is it possible to just declare the loss
and then use the loss at add_metric, and then don't
do softmax, and then just return the y_true and y_predict,
just pass this through?
Like, later on, like, you still can declare a loss
on the model level?
AUDIENCE: So that--
I don't think there's a way to tell Keras
that one of these outputs of the model is the labels.
FRANCOIS CHOLLET: Yeah.
So if you want to pass your loss in compile,
it means that the targets are going
to be passed separately in fit.
And it means that your loss should match the signature
y_true, y_pred, sample weight.
So if you have anything that's outside of this template,
you actually do need to use add_loss
and to use this pattern, which is very general.
Like, there's literally nothing you cannot implement with this
pattern.
And again, the purpose of this pattern
is to make it possible to use fit, the plain call to fit,
with complicated loss and metric setups.
Yeah.
So what if you have a use case that's even more complicated
and you still want to use fit?
Well, the answer is don't, because fit is really
this built-in thing that's not meant for every use case.
It's meant for common use cases.
If you need more flexibility than what's
provided by this endpoint pattern, for instance,
you should really write your own loops,
which is not difficult at all.
So that's it for the overview of Keras internals.
And I see we actually spent like 45 minutes.
[LAUGHTER]
We did.
So it's nice that we had the two full sessions.
Yeah, so thank you very much for listening
and for the interesting questions.
AUDIENCE: Thank you.
This was very informative.
AUDIENCE: Yeah, thank you.
[APPLAUSE]
AUDIENCE: So could you highlight--
you talked a lot about how everything
works in functional models.
FRANCOIS CHOLLET: Yeah.
AUDIENCE: Could you highlight just any differences
with subclass models or sequential models or anything
like that?
Because I know for sequential models, for example--
FRANCOIS CHOLLET: So--
AUDIENCE: --build happens at a different point than--
FRANCOIS CHOLLET: Yes.
So I covered writing things from scratch towards the beginning,
but basically, there is very, very little to know about it,
because everything is explicit.
And you are in charge of doing everything.
So because of that, there's not much happening under the hood.
There is virtually no hood.
But yeah, basically the first 10 slides in this presentation
covered pretty much everything you need to know.
AUDIENCE: I guess it felt like it
covered the difference in [INAUDIBLE] perspective,
but is there anything, like things that we might expect
to be true about models, like that they're already
built by the time they get to fit or something
like that that may not always be true for a sequential model
[? or something? ?]
FRANCOIS CHOLLET: So if you're using a subclass model
and you're using fit, one thing to keep in mind,
actually, is that when you call fit, the model is not built,
so the framework is going to be looking at the input data you
pass to fit and going to assume that you made no mistake,
and that the model expects exactly the structure of input
data, and is going to use that to build a model.
AUDIENCE: OK.
So it calls build?
AUDIENCE: So build happens in fit in, for example, class--
AUDIENCE: And--
AUDIENCE: --models?
FRANCOIS CHOLLET: So that's only if you're using
a subclass model plus fit.
If you're using a subclass model plus an assumption
look like, yeah, this one, for instance,
there's really nothing special in it, you know?
AUDIENCE: OK.
AUDIENCE: Is the role of train and batch, test and batch,
and evaluate and batch relating to the last slide, where
you said that if you do something complicated,
write your own rules?
FRANCOIS CHOLLET: Yeah.
AUDIENCE: Is that why they exist?
FRANCOIS CHOLLET: Yes.
So the-- these are ways to run the train, eval, operate
execution functions for a single batch.
And they are useful if you want to customize your training
loop, but you don't actually need the level of flexibility
that's provided by the gradient, for instance.
Essentially, you're going to be using that if you're
doing gains, for instance, or if you're
doing reinforcement learning.
Manually computing the gradients with the gradient type and so
on is mostly useful if either you
want to understand every step of what you're doing
and not delegate anything to the framework,
or if you need to do some manipulation on the gradients
themselves, like some form of gradient normalization
that's not covered by the optimizer API,
or you want, maybe, to actually have your gradients be
generated by a different model, like the actual gradients.
Yeah.
These type of advanced use cases involve
manipulating the gradients.
AUDIENCE: Could you go back to the slide
where you showed the subclass model?
FRANCOIS CHOLLET: Which one?
This subclass model?
Let's see.
This one?
AUDIENCE: Sure.
So this doesn't actually define the build?
FRANCOIS CHOLLET: So you don't need to, because would only
be to create variables.
For this model, that needs to know
about the shape of the inputs.
If you don't need to know about the shape of the inputs,
you can just create them in the constructor.
And in this case, this model has no variables of its own.
The only variables of this model come
from the underlying layers.
And when you call the layer for the first time, which
is going to happen in fit, you're
going to be calling this call method, which
in turn is going to be calling the underlying layers.
And that's when they are called, that internally they're going
to be calling the build method.
AUDIENCE: I had this impression that--
Igor and I came up with this case
where build was non-trivial, even for a subclass model?
AUDIENCE: I think you're referring to, like--
I think there is a place where you
call that model with like default inputs,
if those inputs are of such type that can have default values,
like a dummy call--
[INTERPOSING VOICES]
AUDIENCE: --and a build graphing?
FRANCOIS CHOLLET: Yeah, so that's
that specifically when you call fit on the subclass model that
has never been built before.
In that case, the framework is going
to make some assumptions, some inference,
based on the data you pass to fit.
If you want to avoid that, you can either explicitly
implement a build method and call it,
or you could call your MLP instance here
on one EagerTensor once.
And that's going to be sufficient to build the model,
because it's going to run this call method.
Well, first of all, it's going to call the build method
if it's implemented, so yeah, it doesn't do anything.
Then it's going to call the plain call method,
and in doing so it's going to call--
[INAUDIBLE] call for all the layers.
Each layer is going to call build, then
call the plain call.
AUDIENCE: Do [INAUDIBLE] users use things
like add_loss and add_metrics even if they're
doing a custom training model?
FRANCOIS CHOLLET: Yes, that's totally fine.
So using add_loss with custom training loops works like this.
So after your forward pass, model
the classes has been populated.
And so you can just add the sum of these losses
to your main loss value.
For metrics, you're going to have
to query model.metrics and look at the names.
And so for metrics, if you're writing a custom training loop,
it's actually easier to do every step manually.
So start by instantiating your metrics outside of the loop.
Then inside the loop, for each batch, you call update.
Then at the end, or at the end of an epoch, you call results,
and you log that.
Add_metric is usable with custom training loops,
but it's not very ergonomic.
Add_loss is very ergonomic with custom training loops.
AUDIENCE: Are all the bias in model loss scalars
or per batch?
Or--
FRANCOIS CHOLLET: They're all scalars.
We don't support non-scalar losses.
Correct.
AUDIENCE: On the TensorFlow [INAUDIBLE]
how many can I access through Keras core layers?
And then, [INAUDIBLE] to be developing new core layers,
too?
FRANCOIS CHOLLET: What would be an example of--
AUDIENCE: For example, this [INAUDIBLE] in TensorFlow.
FRANCOIS CHOLLET: Yeah.
AUDIENCE: They're not used by Keras?
FRANCOIS CHOLLET: So Keras is meant
to support a number of sparse operations.
For instance, the dense layer is meant to support SparseTensors.
I'm not sure if it's true today, but it's definitely
supposed to work.
Other layers, maybe-- yeah.
I think it's mainly the dense layer.
Maybe activation is one.
But yeah, the large majority of layers, like LSTM,
[? carve, ?] [? pulling, ?] and so on, they're only for dense
data.
Dense flow data.