TensorFlow內部：TensorFlow Lite (Inside TensorFlow: TensorFlow Lite)

字幕列表影片播放

JARED DUKE: Thanks everybody for showing up.
My name is Jared.
I'm an engineer on the TensorFlow Lite team.
Today I will be giving a very high level overview
with a few deep dives into the TensorFlow Lite
stack, what it is, why we have it, what it can do for you.
Again, this is a very broad topic.
So there will be some follow up here.
And if you have any questions, feel free to interrupt me.
And you know, this is meant to be enlightening for you.
But it will be a bit of a whirlwind.
So let's get started.
First off, I do want to talk about some
of the origins of TensorFlow Light
and what motivated its creation, why we have it
in the first place and we can't just use TensorFlow on devices.
I'll briefly review how you actually use TensorFlow Lite.
That means how you use the converter.
How you use the runtime.
And then talk a little bit about performance considerations.
How you can get the best performance on device
when you're using TensorFlow Lite.
OK.
Why do you need TensorFlow Lite in your life?
Well, again, here's some kind of boilerplate motivation
for why we need on device ML.
But these are actually important use cases.
You don't always have a connection.
You can't just always be running inference in the cloud
and streaming that to your device.
A lot of devices, particularly in developing countries,
have restrictions on bandwidth.
They can't just be streaming live video
to get their selfie segmentation.
They want that done locally on their phone.
There's issues with latency if you need
real time object detection.
Streaming to the cloud, again, is problematic.
And then there's issues with power.
On a mobile device, often the radio
is using the most power on your device.
So if you can do things locally, particularly with a hardware
backend like a DSP or an MPU, you
will extend your battery life.
But along with mobile ML execution,
there are a number of challenges with memory constraints,
with the low powered CPUs that we have on mobile devices.
There's also a very kind of fragmented and heterogeneous
ecosystem of hardware backends.
This isn't like the cloud where often
you have a primary provider of your acceleration backend
with, say, NVIDIA GPUs or TPUs.
There's a large class of different kinds
of accelerators.
And there's a problem with how can we
actually leverage all of these.
So again, TensorFlow works great on large well-powered devices
in the cloud, locally on beefy workstation machines.
But TensorFlow Lite is not focused on these cases.
It's focused on the edge.
So stepping back a bit, we've had TensorFlow
for a number of years.
And why couldn't we just trim this down
and run it on a mobile device?
This is actually what we call the TensorFlow mobile project.
And we tried this.
And after a lot of effort, and a lot of hours,
and blood, sweat, and tears, we were
able to create kind of a reduced variant of TensorFlow
with a reduced operator set and a trimmed down runtime.
But we were hitting a lower bound
on where we could go in terms of the size of the binary.
And there was also issues in how we
could make that runtime a bit more extensible,
how we could map it onto all these different kinds
of accelerators that you get in a mobile environment.
And while there have been a lot of improvements
in the TensorFlow ecosystem with respect to modularity,
it wasn't quite where we needed it
to be to make that a reality.
AUDIENCE: How small a memory do you need to get to?
JARED DUKE: Memory?
AUDIENCE: Yeah.
Three [INAUDIBLE] seem too much.
JARED DUKE: So this is just the binary size.
AUDIENCE: Yeah.
Yeah.
[INAUDIBLE]
JARED DUKE: So in app size.
In terms of memory, it's highly model dependent.
So if you're using a very large model,
then you may be required to use lots of memory.
But there are different considerations
that we've taken into account with TensorFlow Lite
to reduce the memory consumption.
AUDIENCE: But your size, how small is it?
JARED DUKE: With TensorFlow Lite?
AUDIENCE: Yeah.
JARED DUKE: So the core interpreter runtime
is 100 kilobytes.
And then with our full set of operators,
it's less than a megabyte.
So TFMini was a project that shares
some of the same origins with TensorFlow Lite.
And this was, effectively, a tool chain
where you could take your frozen model.
You could convert it.
And it did some kind of high level operator fusings.
And then it would do code gen. And it would kind of
bake your model into your actual binary.
And then you could run this on your device and deploy it.
And it was well-tuned for mobile devices.
But again, there are problems with portability
when you're baking the model into an actual binary.
You can't always stream this from the cloud
and rely on this being a secure path.
And it's often discouraged.
And this is more of a first party solution
for a lot of vision-based use cases and not a general purpose
solution.
So enter TensorFlow Lite.
Lightweight machine learning library
from all embedded devices.
The goals behind this were making ML easier,
making it faster, and making the kind of binary size and memory
impact smaller.
And I'll dive into each of these a bit more in detail
in terms of what it looks like in the TensorFlow Lite stack.
But again, the chief considerations
were reducing the footprint in memory and binary size,
making conversion straightforward,
having a set of APIs that were focused primarily on inference.
So you've already crafted and authored your models.
How can you just run and deploy these on a mobile device?
And then taking advantage again of mobile-specific hardware
like these ARM CPUs, like these DSP and NPUs
that are in development.
So let's talk about the actual stack.
TensorFlow Lite has a converter where
you ingest the graph def, the saved model, the frozen graphs.
You convert it to a TensorFlow Lite specific model file
format.
And I'll dig into the specifics there.
There's an interpreter for actually executing inference.
There's a set of ops.
We call it the TensorFlow Lite dialect
of operators, which is slightly different than
the core TensorFlow operators.
And then there's a way to plug in these different hardware
accelerators.
Just walking through this briefly, again,
the converter spits out a TFLite model.
You feed it into your runtime.
It's got a set of optimized kernels
and then some hardware plugins.
So let's talk a little bit more about the converter itself
and things that are interesting there.
It does things like constant folding.
It does operator fusing where you're
baking the activations and the biased computation
into these high level operators like convolution, which
we found to provide a pretty substantial speed up
on mobile devices.
Quantization was one of the chief considerations
with developing this converter, supporting
both quantization-aware training and post-training quantization.
And it was based on flat buffers.
So flat buffers are an analog to protobufs, which are
used extensively in TensorFlow.
But they were developed with more real time considerations
in mind, specifically for video games.
And the idea is that you can take a flat buffer.
You can map it into memory and then read and interpret that
directly.
There's no unpacking step.
And this has a lot of nice advantages.
You can actually map this into a page and it's clean.
It's not a dirty page.
You're not dirtying up your heap.
And this is extremely important in mobile environments
where you are constrained on memory.
And often the app is going in and out of foreground.
And there's low memory pressure.
And there's also a smaller binary size
impact when you use flat buffers relative to protobufs.
So the interpreter, again, was built from the ground
up with mobile devices in mind.
It has fewer dependencies.
We try not to depend on really anything at base.
We have very few absolute dependencies.
I already talked about the binary size here.
It's quite a bit smaller than--
the minimum binary size we were able to get with TensorFlow
Mobile was about three megabytes for just the runtime.
And that's without any operators.
It was engineered to start up quickly.
That's kind of a combination of being able to map your models
directly into memory but then also having a static execution
plan where there's--
during conversion, we basically map out
directly what the sequence of nodes that would be executed.
And then for the memory planning,
basically there's a pass when you're running your model where
we prepare each operator.
And they kind of cue up a bunch of allocations.
And those are all baked into a single pass where we then
allocate a single block of memory and tensors
are just fused into that large contiguous block of memory.
We don't yet support control flow.
But I will be talking about that later in the talk.
It's something that we're thinking about and working on.
It's on the near horizon for actual shipping models.
So what about the operator set?
So we support float and quantized types
for most of our operators.
A lot of these are backed by hand-tuned, neon, and assembly
based kernels that are specifically
optimized for ARM devices.
Ruy is our newest GEMM backend for TensorFlow Lite.
And it was built from the ground up with mobile execution
in mind, a [INAUDIBLE] execution.
We support about 120 built-in operators right now.
You will probably realize that that's quite
a bit smaller than the set of TensorFlow ops, which
is probably into the thousands by now.
I'm not exactly sure.
So that can cause problems.
But I'll dig into some solutions we have on the table for that.
I already talked about some of the benefits
of these high level kernels having
fused activations and biases.
And then we have a way for you to kind of, at conversion time,
stub out custom operators that you would like.
Maybe we don't yet support them in TF Lite
or maybe it's a one off operator that's not yet
supported in TensorFlow.
And then you can plug-in your operator implementation
at runtime.
So the hardware acceleration interface,
we call them delegates.
This is basically an abstraction that
allows you to plug in and accelerate
subgraphs of the overall graph.
We have NNAPI, GPU, EdgeTPU, and DSP backends on Android.
And then on iOS, we have a metal delegate backend.
And I'll be digging into some of these and their details
here in a few slides.
OK.
So what can I do with it?
Well, I mean this is largely a lot of the same things
that you can do with TensorFlow.
There's a lot of speech and vision-related use cases.
I think often we think of mobile inference
as being image classification and speech recognition.
But there are quite a few other use
cases that are being used now and are in deployment.
We're being used broadly across a number of both first party
and third party apps.
OK.
So let's start with models.
We have a number of models in this model repo
that we host online.
You can use models that have already
been authored in TensorFlow and feed those into the converter.
We have a number of tools and tutorials
on how you can apply transfer learning to your models
to make them more specific to your use case,
or you can author models from scratch
and then feed that into the conversion pipeline.
So let's dig into conversion and what that actually looks like.
Well, here's a brief snippet of how
you would take a saved model, feed that into our converter,
and output a TFLite model.
It looks really simple.
In practice, we would like to say
that this always just works.
That's sadly not yet a reality.
There's a number of failure points that people run into.
I've already highlighted this mismatch
in terms of supported operators.
And that's a big pain point.
And we have some things in the pipeline to address that.
There's also different semantics in TensorFlow that aren't yet
natively supported in TFLite, things like control-flow, which
we're working on, things like assets,
hash tables, TensorLists, those kinds of concepts.
Again, they're not yet natively supported in TensorFlow Lite.
And then certain types we just don't support.
They haven't been prioritized in TensorFlow Lite.
You know, double execution, or bfloat16, none of those,
or even FP16 kernels are not natively supported
by the TFLite built-in operators.
So how can we fix that?
Well, a number of months ago, we started a project called--
well, the name is a little awkward.
It's using select TensorFlow operators in TensorFlow Lite.
And effectively, what this does is
it allows you to, as a last resort,
convert your model for the set of operators
that we don't yet support.
And then at runtime, you could plug-in this TensorFlow
select piece of code.
And it would let you run these TensorFlow kernels
within the TFLite runtime at the expense of a modest increase
in your binary size.
What does that actually mean?
So the converter basically, it recognizes these TensorFlow
operators.
And if you say, I want to use them,
if there's no TFLite built-in counterpart,
then it will take that node def.
It'll bake it in to the TFLite custom operator that's output.
And then at runtime, we have a delegate
which resolves this custom operator
and then does some data marshaling
into the eager execution of TensorFlow, which again would
be built into the TFLite runtime and then marshaling
that data back out into the TFLite tensors.
There's some more information that I've linked to here.
And the way you can actually take advantage of this, here's
our original Python conversion script.
You drop in this line basically saying
the target ops set includes these select TensorFlow ops.
So that's one thing that can improve the conversion
and runtime experience for models that aren't yet
natively supported.
Another issue that we've had historically--
our converter was called TOKO.
And its roots were in this TFMini project,
which was trying to statically compute and bake
this graph into your runtime.
And it was OK for it to fail because it would all
be happening at build time.
But what we saw is that that led to a lot
of hard to decipher opaque error messages and crashes.
And we've since set out to build a new converter based
on MLIR, which is just basically tooling
that's feeding into this converter
helping us map from the TensorFlow dialect of operators
to a TensorFlow Lite dialect of operators
with far more formal mechanisms for translating
between the two.
And this, we think will give us far better debugging,
and error messages, and hints on how
we can actually fix conversion.
And the other reason that motivated this switch
to a new converter was to support control flow.
This will initially start by supporting functional control
flow forms, so if and while conditionals.
We're still considering how we can potentially
map legacy control flow forms to these new variants.
But this is where we're going to start.
And so far, we see that this will unlock
a pretty large class of useful models,
the RNN class type models that so far
have been very difficult to convert to TensorFlow Lite.
TensorFlow 2.0.
It's supported.
There's not a whole lot that changes on the conversion
end and certainly nothing that changes on the TFLite end
except for maybe the change to that saved model
is now the primary serialization format with TensorFlow.
And we've also made a few tweaks and added some sugar
for our conversion APIs when using quantization.
OK.
So you've converted your model.
How do you run it?
Here's an example of our API usage in Java.
You basically create your input buffer, your output buffer.
It doesn't necessarily need to be a byte buffer.
It could be a single or multidimensional array.
You create your interpreter.
You feed it your TFLite model.
There are some options that you can give it.
And we'll get to those later.
And then you run inference.
And that's about it.
We have different bindings for different platforms.
Our first class bindings are Python, C++, and Java.
We also have a set of experimental bindings
that we're working on or in various states of both use
and stability.
But soon we plan to have our Objective C and Swift
bindings be stable.
And they'll be available as the normal deployment libraries
that you would get on iOS via CocoaPods.
And then for Android, you can use our JCenter or BingeRate
ARs for Java.
But those are primarily focused on third party developers.
There are other ways you can actually reduce
the binary size of TFLite.
I mentioned that the core runtime is 100 kilobytes.
There's about 800 or 900 kilobytes
for the full set of operators.
But there are ways that you can basically
trim that down and only include the operators that you use.
And everything else gets stripped by the linker.
We expose a few build rules that help with this.
You feed it your TFLite model.
It'll parse that in output.
Basically, a CC file, which does the actual op registration.
And then you can rely on your linker
to strip the unused kernels.
OK.
So you've got your model converted.
It's up and running.
How do you make it run fast?
So we have a number of tools to help with this.
We have a number of backends that I talked about already.
And I'll be digging into a few of these
to highlight how they can help and how you can use them.
So we have a benchmarking tool.
It allows you to identify bottlenecks when actually
deploying your model on a given device.
It can output profiles for which operator's
taking the most amount of time.
It lets you plug in different backends
and explore how this actually affects inference latency.
Here's an example of how you would
build this benchmark tool.
You would push it to your device.
You would then run it.
You can give it different configuration options.
And we have some helper scripts that kind of help
do this all atomically for you.
What does the output look like?
Well, here you can get a breakdown of timing
for each operator in your execution plan.
You can isolate bottlenecks here.
And then you get a nice summary of where time is actually
being spent.
AUDIENCE: In the information, there
is just about operation type or we also
know if it's the earlier convolution of the network
or the later convolutions in the network or something like that?
JARED DUKE: Yeah.
So there's two breakdowns.
One is the run order which actually
is every single operator in sequence.
And then there's the summary where
it coalesces each operator into a single class.
And you get a nice summary there.
So this is useful for, one, identifying bottlenecks.
If you have control over a graph and then the authoring side
of things, then you can maybe tailor
the topology of your graph.
But otherwise, you can file a bug on the TFLite team.
And we can investigate these bottlenecks
and identify where there's room for improvement.
But it also affords--
it affords you, I guess, the chance
to explore some of the more advanced performance techniques
like using these hardware accelerators.
I talked about delegates.
The real power, I think, of delegates
is that it's a nice way to holistically optimize
your graph for a given backend.
That is you're not just delegating each op one by one
to this hardware accelerator.
But you can take an entire subgraph of your graph
and run that on an accelerator.
And that's particularly advantageous for things
like GPUs or neural accelerators where
you want to do as much computation on the device as
possible with no CPU interop in between.
So NNAPI is the abstraction in Android for accelerating ML.
And it was actually developed fairly closely in tandem
with TFLite.
You'll see a lot of similarities into the high level op
definitions that are found in NNAPI
and those found in TFLite.
And this is effectively an abstraction layer
at the platform level that we can hook into on the TensorFlow
Lite side.
And then vendors can plug in their particular drivers
for DSP, for GPUs.
And with Android Q, it's really getting to a nice stable state
where it's approaching parity in terms of features and ops
with TensorFlow Lite.
And there are increasingly--
there's increased adoption both in terms of user base
but also in terms of hardware vendors
that are contributing to these drivers more recently
we've released our GPO back end and we've also open source.
This can yield a pretty substantial speedup
on many floating point convolution models,
particularly larger models.
There is a small binary size cost that you have to pay.
But if it's a good match for your model,
then it can be a huge win.
And this is-- we found a number of clients
that are deploying this with things like face detection
and segmentation.
AUDIENCE: Because if you're on top of [INAUDIBLE] GPU.
JARED DUKE: Yeah, so on Android, there's a GLES back end.
There's also an OpenCL back end that's
in development that will afford a kind of 2
to 3x speed up over the GLES back end.
There's also a Vulcan back end, and then
on iOS, it's metal-based.
There's other delegates and accelerators
that are in various states of development.
One is for the Edge TPU project, which can either
use kind of runtime on device compilation,
or you can use or take advantage of the conversion step
to bake the compiled model into the TFLite graph itself.
We also announced, at Google I/O,
support for Qualcomm's Hexagon DSPs
that we'll be releasing publicly soon-ish.
And then there's some more kind of exotic optimizations
that we're making for the floating point CPU back end.
So how do you take advantage of some of these back ends?
Well, here is kind of our standard usage
of the Java APIs for inference.
If you want to use NnAPI, you create your NnAPI delegate.
You feed it into your model options, and away you go.
And it's quite similar for using the GPU back end.
There are some more sophisticated and advanced
techniques for both an API and GPU on our interop.
This is one example where you can basically use a GL texture
as the input to your graph.
That way, you avoid needing to copy--
marshal data back and forth from CPU to GPU.
What are some other things we've been working on?
Well, the default out of the box performance
is something that's critical.
And we recently landed a pretty substantial speed up there
with this ruy library.
Historically, we've used what's called gemmlowp
for quantized matrix multiplication,
and then eigen for floating point multiplication.
Ruy was built from the ground up basically
to [INAUDIBLE] throughput much sooner
in terms of the size of the inputs to, say, a given matrix
multiplication operator, whereas more
desktop and cloud-oriented matrix multiplication libraries
are focused on peak performance with larger sizes.
And we found this, for a large class of convolution models,
is providing at least a 10% speed-up.
But then on kind of our multi-threaded floating point
models, we see two to three times the speed-up,
and then the same on more recent hardware that has these neon
dot product intrinsics.
There's some more optimizations in the pipeline.
We're also looking at different types--
Sparse, fp16 tensors to take advantage of mobile hardware,
and we'll be announcing related tooling
and features support soon-ish.
OK, so a number of best practices here
to get the best performance possible--
just pick the right model.
We find a lot of developers come to us with inception,
and it's hundreds of megabytes.
And it takes seconds to run inference,
when they can get just as good accuracy,
sometimes even better, with an equivalent MobileNet model.
So that's a really important consideration.
We have tools to improve benchmarking and profiling.
Take advantage of quantization where possible.
I'm going to dig into this in a little bit
how you can actually use quantization.
And it's really a topic for itself,
and there will be, I think, a follow-up
session about quantization.
But it's a cheap way of reducing the size of your model
and making it run faster out of the box on CPU.
Take advantage of accelerators, and then
for some of these accelerators, you can also
take advantage of zero copy.
So with this kind of library of accelerators
and many different permutations of quantized or floating point
models, it can be quite daunting for many developers, probably
most developers, to figure out how
best to optimize their model and get the best performance.
So we're thinking of some more and working on some projects
to make this easy.
One is just accelerator whitelisting.
When is it better to use, say, a GPU or NnAPI versus the CPU,
both local tooling to identify that for, say,
a device you plugged into your dev machine
or potentially as a service, where we can farm this out
across a large bank of devices and automatically determine
this.
There's also cases where you may want to run parts of your graph
on different accelerators.
Maybe parts of it map better to a GPU or a DSP.
And then there's also the issue of when different apps are
running ML simultaneously, so you have a hotware detection
running at the same time you're running selfie segmentation
with a camera feed.
And they're both trying to access the same accelerator.
How can you coordinate efforts to make sure everyone's playing
nicely?
So these are things we're working out.
We plan on releasing tooling that can improve this
over the next quarter or two.
So we talked about quantization.
There are a number of tools available now
to make this possible.
There are a number of things being worked on.
In fact, yesterday, we just announced
our new post-training quantization
that does full quantization.
I'll be talking about that more here
in the next couple of slides.
Actually, going back a bit, we've
long had what's called our legacy quantized training
path, where you would instrument your graph at authoring time
with these fake quant nodes.
And then you could use that to actually generate
a fully quantized model as the output from the TFLite
conversion process.
And this worked quite well, but it was--
it can be quite painful to use and quite tedious.
And we've been working on tooling
to make that a lot easier to get the same performance
both in terms of model size reduction
and runtime acceleration speed-up.
AUDIENCE: Is part about the accuracy-- it seems
like training time [INAUDIBLE].
JARED DUKE: Yeah, you generally do.
So we first introduced this post-training quantization
path, which is hybrid, where we are effectively just quantizing
the waits, and then dequantizing that at runtime
and running everything in fp32, and there
was an accuracy hit here.
It depends on the model, how bad that is,
but sometimes it was far enough off
the mark from quantization aware training
that it was not usable.
And so that's where--
so again, with the hybrid quantization,
there's a number of benefits.
I'm flying through slides just in the interest of time.
The way to enable that post-training quantization--
you just add a flag to the conversion paths,
and that's it.
But on the accuracy side, that's where we came up
with some new tooling.
We're calling it per axis or per channel quantization, where
with the waits, you wouldn't just
have a single quantized set of parameters
for the entire tensor, but it would
be per channel in the tensor.
And we found that that, in combination with feeding it
kind of an evaluation data set during conversion time,
where you would explore the range of possible quantization
parameters, we could get accuracy that's almost on par
with quantization aware training.
AUDIENCE: I'm curious, are some of these techniques
also going to be used for TensorFlow JS,
or did they not have this--
do they not have similarities?
They use MobileNet, right, for a browser?
JARED DUKE: They do.
These aren't yet, as far as I'm aware, used or hooked
into the TFJS pipeline.
There's no reason it couldn't be.
I think part of the problem is just very different tool
chains for development.
But--
AUDIENCE: How do you do quantized operations
in JavaScript?
[INAUDIBLE]
JARED DUKE: Yeah, I mean I think the benefit isn't
as clear, probably not as much as if you were just
quantizing to fp16.
That's where you'd probably get the biggest win for TFJS.
In fact, I left it out of these slides,
but we are actively working on fp16 quantization.
You can reduce the size of your model by half,
and then it maps really well to GPU hardware.
But I think one thing that we want to have is
that quantization is not just a TFLite thing,
but it's kind of a universally shared concept
in the TensorFlow ecosystem.
And how can we take the tools that we already
have that are sort of coupled to TFLite
and make them more generally accessible?
So to use this new post-training quantization
path, where you can get comparable accuracy to training
time quantization, effectively, the only difference
here is feeding in this representative data
set of what the inputs would look like to your graph.
It can be a-- for an image-based model,
maybe you feed it 30 images.
And then it is able to explore the space of quantization
and output values that would largely
match or be close to what you would
get with training-aware quantization.
We have lots of documentation available.
We have a model repo that we're going to be investing
heavily in to expand this.
What we find is that a lot of TensorFlow developers--
or not even TensorFlow developers-- app developers
will find some random graph when they search Google or GitHub.
And they try to convert it, and it fails.
And a lot of times, either we have
a model that's already been converted
or a similar model that's better suited for mobile.
And we would rather have a very robust repository
that people start with, and then only
if they can't find an equivalent model,
they resort to our conversion tools or even authoring tools.
AUDIENCE: Is there a TFLite compatible section in TFHub?
JARED DUKE: Yeah, we're working on that.
Talked about the model repo training.
So what if you want to do training on device?
That is a thing.
We have an entire project called The [INAUDIBLE]
Team Federated Learning Team, who's focused on this.
But we haven't supported this in TensorFlow Lite
for a number of reasons, but it's something
that we're working on.
There's quite a few bits and components that still have yet
to land to support this, but it's something
that we're thinking about, and there is increasing demand
for this kind of on-device tuning or transfer
learning scenario.
In fact, this is something that was announced at WWDC, so.
So we have a roadmap up.
It's now something that we publish publicly
to make it clear what we're working on,
what our priorities are.
I touched on a lot of the things that
are in the pipeline, things like control flow and training,
proving our runtime.
Another thing that we want to make easier
is just to use TFLite in the kind of native types
that you are used to using.
If you're an Android developer, say, if you have a bitmap,
you don't want to convert it to a byte buffer.
You just want to feed us your bitmap, and things just work.
So that's something that we're working on.
A few more links here to authoring apps
with TFLite, different roadmaps for performance and model
optimization.
That's it.
So any questions, any areas you'd
like to dive into more deeply?
AUDIENCE: So this [INAUDIBLE].
So what is [INAUDIBLE] has more impact like a fully connected
[INAUDIBLE]?
JARED DUKE: Sorry.
What's--
AUDIENCE: For a speed-up.
JARED DUKE: Oh.
Why does it?
AUDIENCE: Yeah.
JARED DUKE: So certain operators have been,
I guess, more optimized to take advantage of quantization
than others.
And so in the hybrid quantization path,
we're not always doing computation in eight-bit types.
We're doing it in a mix of floating point and eight-bit
types, and that's why there's not always the same speed-up
you would get with like an LSTM and an RNN versus a [INAUDIBLE]
operator.
AUDIENCE: So you mentioned that TFLite is
on billions of mobile devices.
How many apps have you seen added to the Play Store that
have TFLite in them?
JARED DUKE: Tim would have the latest numbers.
It's-- I want to say it's into the tens of thousands,
but I don't know that I can say that.
It's certainly in the several thousands,
but we've seen pretty dramatic uptick, though, just
tracking Play Store analytics.
AUDIENCE: And in the near term, are you thinking more
about trying to increase the number of devices that
are using TFLite or trying to increase
the number of developers that are including it
in the applications that they built.
JARED DUKE: I think both.
I mean there are projects like the TF Micro, where
we want to support actual microcontrollers
and running TFLite on extremely restricted, low power arm
devices.
So that's one class of efforts on--
we have seen demand for actually running TFLite in the cloud.
There's a number of benefits with TFLite
like the startup time, the lower memory footprint that
do make it attractive.
And some developers actually want
to run the same model they're running on device in the cloud,
and so there is demand for having
like a proper x86 optimized back end.
But at the same time, I think one of our big focuses
is just making it easier to use-- meet developers
where they're at.
And part of that is a focus on creating
a very robust model repository and more idiomatic
APIs they can use on Android or iOS
and use the types they're familiar with,
and then just making conversion easy.
Right now, if you do take kind of a random model
that you found off the cloud and try
to feed it into our converter, chances
are that it will probably fail.
And some of that is just teaching developers
how to convert just the part of the graph they want,
not necessarily all of the training that's surrounding it.
And part of it is just adding the features and types
to TFLite that would match the semantics of TensorFlow.
I mean, I will say that in the long run,
we want to move toward a more unified path with TensorFlow
and not live in somewhat disjoint worlds,
where we can take advantage of the same core runtime
libraries, the same core conversion pipelines,
and optimization pipelines.
So that's things that we're thinking about for the longer
term future.
AUDIENCE: Yeah, and also [INAUDIBLE]
like the longer term.
I'm wondering what's the implication of the ever
increasing network speed on the [INAUDIBLE] TFLite?
[INAUDIBLE],, which maybe [INAUDIBLE] faster than current
that we've [INAUDIBLE] take [INAUDIBLE] of this.
JARED DUKE: We haven't thought a whole lot about that,
to be honest.
I mean, I think we're still kind of betting on the reality
that there will always be a need for on device ML.
I do think, though, that 5G probably
unlocks some interesting hybrid scenarios, where you're
doing some on device, some cloud-based ML,
and I think for a while, the fusion of on device hotware
detection, as soon as the OK, Google is detected,
then it starts feeding things into the cloud.
That's kind of an example of where there is room
for these hybrid solutions.
And maybe those will become more and more practical.
Everyone is going to run to your desk
and start using TensorFlow Lite after this?
AUDIENCE: You probably already are, right?
[INAUDIBLE] if you have one of the however many apps that was
listed on Tim's slide, right?
JARED DUKE: I mean, yeah.
If you've ever done, OK, Google, then you're
using TensorFlow Lite.
AUDIENCE: [INAUDIBLE].
Thank you.
JARED DUKE: Thank you.
[APPLAUSE]