字幕列表 影片播放 列印英文字幕 JARED DUKE: Thanks everybody for showing up. My name is Jared. I'm an engineer on the TensorFlow Lite team. Today I will be giving a very high level overview with a few deep dives into the TensorFlow Lite stack, what it is, why we have it, what it can do for you. Again, this is a very broad topic. So there will be some follow up here. And if you have any questions, feel free to interrupt me. And you know, this is meant to be enlightening for you. But it will be a bit of a whirlwind. So let's get started. First off, I do want to talk about some of the origins of TensorFlow Light and what motivated its creation, why we have it in the first place and we can't just use TensorFlow on devices. I'll briefly review how you actually use TensorFlow Lite. That means how you use the converter. How you use the runtime. And then talk a little bit about performance considerations. How you can get the best performance on device when you're using TensorFlow Lite. OK. Why do you need TensorFlow Lite in your life? Well, again, here's some kind of boilerplate motivation for why we need on device ML. But these are actually important use cases. You don't always have a connection. You can't just always be running inference in the cloud and streaming that to your device. A lot of devices, particularly in developing countries, have restrictions on bandwidth. They can't just be streaming live video to get their selfie segmentation. They want that done locally on their phone. There's issues with latency if you need real time object detection. Streaming to the cloud, again, is problematic. And then there's issues with power. On a mobile device, often the radio is using the most power on your device. So if you can do things locally, particularly with a hardware backend like a DSP or an MPU, you will extend your battery life. But along with mobile ML execution, there are a number of challenges with memory constraints, with the low powered CPUs that we have on mobile devices. There's also a very kind of fragmented and heterogeneous ecosystem of hardware backends. This isn't like the cloud where often you have a primary provider of your acceleration backend with, say, NVIDIA GPUs or TPUs. There's a large class of different kinds of accelerators. And there's a problem with how can we actually leverage all of these. So again, TensorFlow works great on large well-powered devices in the cloud, locally on beefy workstation machines. But TensorFlow Lite is not focused on these cases. It's focused on the edge. So stepping back a bit, we've had TensorFlow for a number of years. And why couldn't we just trim this down and run it on a mobile device? This is actually what we call the TensorFlow mobile project. And we tried this. And after a lot of effort, and a lot of hours, and blood, sweat, and tears, we were able to create kind of a reduced variant of TensorFlow with a reduced operator set and a trimmed down runtime. But we were hitting a lower bound on where we could go in terms of the size of the binary. And there was also issues in how we could make that runtime a bit more extensible, how we could map it onto all these different kinds of accelerators that you get in a mobile environment. And while there have been a lot of improvements in the TensorFlow ecosystem with respect to modularity, it wasn't quite where we needed it to be to make that a reality. AUDIENCE: How small a memory do you need to get to? JARED DUKE: Memory? AUDIENCE: Yeah. Three [INAUDIBLE] seem too much. JARED DUKE: So this is just the binary size. AUDIENCE: Yeah. Yeah. [INAUDIBLE] JARED DUKE: So in app size. In terms of memory, it's highly model dependent. So if you're using a very large model, then you may be required to use lots of memory. But there are different considerations that we've taken into account with TensorFlow Lite to reduce the memory consumption. AUDIENCE: But your size, how small is it? JARED DUKE: With TensorFlow Lite? AUDIENCE: Yeah. JARED DUKE: So the core interpreter runtime is 100 kilobytes. And then with our full set of operators, it's less than a megabyte. So TFMini was a project that shares some of the same origins with TensorFlow Lite. And this was, effectively, a tool chain where you could take your frozen model. You could convert it. And it did some kind of high level operator fusings. And then it would do code gen. And it would kind of bake your model into your actual binary. And then you could run this on your device and deploy it. And it was well-tuned for mobile devices. But again, there are problems with portability when you're baking the model into an actual binary. You can't always stream this from the cloud and rely on this being a secure path. And it's often discouraged. And this is more of a first party solution for a lot of vision-based use cases and not a general purpose solution. So enter TensorFlow Lite. Lightweight machine learning library from all embedded devices. The goals behind this were making ML easier, making it faster, and making the kind of binary size and memory impact smaller. And I'll dive into each of these a bit more in detail in terms of what it looks like in the TensorFlow Lite stack. But again, the chief considerations were reducing the footprint in memory and binary size, making conversion straightforward, having a set of APIs that were focused primarily on inference. So you've already crafted and authored your models. How can you just run and deploy these on a mobile device? And then taking advantage again of mobile-specific hardware like these ARM CPUs, like these DSP and NPUs that are in development. So let's talk about the actual stack. TensorFlow Lite has a converter where you ingest the graph def, the saved model, the frozen graphs. You convert it to a TensorFlow Lite specific model file format. And I'll dig into the specifics there. There's an interpreter for actually executing inference. There's a set of ops. We call it the TensorFlow Lite dialect of operators, which is slightly different than the core TensorFlow operators. And then there's a way to plug in these different hardware accelerators. Just walking through this briefly, again, the converter spits out a TFLite model. You feed it into your runtime. It's got a set of optimized kernels and then some hardware plugins. So let's talk a little bit more about the converter itself and things that are interesting there. It does things like constant folding. It does operator fusing where you're baking the activations and the biased computation into these high level operators like convolution, which we found to provide a pretty substantial speed up on mobile devices. Quantization was one of the chief considerations with developing this converter, supporting both quantization-aware training and post-training quantization. And it was based on flat buffers. So flat buffers are an analog to protobufs, which are used extensively in TensorFlow. But they were developed with more real time considerations in mind, specifically for video games. And the idea is that you can take a flat buffer. You can map it into memory and then read and interpret that directly. There's no unpacking step. And this has a lot of nice advantages. You can actually map this into a page and it's clean. It's not a dirty page. You're not dirtying up your heap. And this is extremely important in mobile environments where you are constrained on memory. And often the app is going in and out of foreground. And there's low memory pressure. And there's also a smaller binary size impact when you use flat buffers relative to protobufs. So the interpreter, again, was built from the ground up with mobile devices in mind. It has fewer dependencies. We try not to depend on really anything at base. We have very few absolute dependencies. I already talked about the binary size here. It's quite a bit smaller than-- the minimum binary size we were able to get with TensorFlow Mobile was about three megabytes for just the runtime. And that's without any operators. It was engineered to start up quickly. That's kind of a combination of being able to map your models directly into memory but then also having a static execution plan where there's-- during conversion, we basically map out directly what the sequence of nodes that would be executed. And then for the memory planning, basically there's a pass when you're running your model where we prepare each operator. And they kind of cue up a bunch of allocations. And those are all baked into a single pass where we then allocate a single block of memory and tensors are just fused into that large contiguous block of memory. We don't yet support control flow. But I will be talking about that later in the talk. It's something that we're thinking about and working on. It's on the near horizon for actual shipping models. So what about the operator set? So we support float and quantized types for most of our operators. A lot of these are backed by hand-tuned, neon, and assembly based kernels that are specifically optimized for ARM devices. Ruy is our newest GEMM backend for TensorFlow Lite. And it was built from the ground up with mobile execution in mind, a [INAUDIBLE] execution. We support about 120 built-in operators right now. You will probably realize that that's quite a bit smaller than the set of TensorFlow ops, which is probably into the thousands by now. I'm not exactly sure. So that can cause problems. But I'll dig into some solutions we have on the table for that. I already talked about some of the benefits of these high level kernels having fused activations and biases. And then we have a way for you to kind of, at conversion time, stub out custom operators that you would like. Maybe we don't yet support them in TF Lite or maybe it's a one off operator that's not yet supported in TensorFlow. And then you can plug-in your operator implementation at runtime. So the hardware acceleration interface, we call them delegates. This is basically an abstraction that allows you to plug in and accelerate subgraphs of the overall graph. We have NNAPI, GPU, EdgeTPU, and DSP backends on Android. And then on iOS, we have a metal delegate backend. And I'll be digging into some of these and their details here in a few slides. OK. So what can I do with it? Well, I mean this is largely a lot of the same things that you can do with TensorFlow. There's a lot of speech and vision-related use cases. I think often we think of mobile inference as being image classification and speech recognition. But there are quite a few other use cases that are being used now and are in deployment. We're being used broadly across a number of both first party and third party apps. OK. So let's start with models. We have a number of models in this model repo that we host online. You can use models that have already been authored in TensorFlow and feed those into the converter. We have a number of tools and tutorials on how you can apply transfer learning to your models to make them more specific to your use case, or you can author models from scratch and then feed that into the conversion pipeline. So let's dig into conversion and what that actually looks like. Well, here's a brief snippet of how you would take a saved model, feed that into our converter, and output a TFLite model. It looks really simple. In practice, we would like to say that this always just works. That's sadly not yet a reality. There's a number of failure points that people run into. I've already highlighted this mismatch in terms of supported operators. And that's a big pain point. And we have some things in the pipeline to address that. There's also different semantics in TensorFlow that aren't yet natively supported in TFLite, things like control-flow, which we're working on, things like assets, hash tables, TensorLists, those kinds of concepts. Again, they're not yet natively supported in TensorFlow Lite. And then certain types we just don't support. They haven't been prioritized in TensorFlow Lite. You know, double execution, or bfloat16, none of those, or even FP16 kernels are not natively supported by the TFLite built-in operators. So how can we fix that? Well, a number of months ago, we started a project called-- well, the name is a little awkward. It's using select TensorFlow operators in TensorFlow Lite. And effectively, what this does is it allows you to, as a last resort, convert your model for the set of operators that we don't yet support. And then at runtime, you could plug-in this TensorFlow select piece of code. And it would let you run these TensorFlow kernels within the TFLite runtime at the expense of a modest increase in your binary size. What does that actually mean? So the converter basically, it recognizes these TensorFlow operators. And if you say, I want to use them, if there's no TFLite built-in counterpart, then it will take that node def. It'll bake it in to the TFLite custom operator that's output. And then at runtime, we have a delegate which resolves this custom operator and then does some data marshaling into the eager execution of TensorFlow, which again would be built into the TFLite runtime and then marshaling that data back out into the TFLite tensors. There's some more information that I've linked to here. And the way you can actually take advantage of this, here's our original Python conversion script. You drop in this line basically saying the target ops set includes these select TensorFlow ops. So that's one thing that can improve the conversion and runtime experience for models that aren't yet natively supported. Another issue that we've had historically-- our converter was called TOKO. And its roots were in this TFMini project, which was trying to statically compute and bake this graph into your runtime. And it was OK for it to fail because it would all be happening at build time. But what we saw is that that led to a lot of hard to decipher opaque error messages and crashes. And we've since set out to build a new converter based on MLIR, which is just basically tooling that's feeding into this converter helping us map from the TensorFlow dialect of operators to a TensorFlow Lite dialect of operators with far more formal mechanisms for translating between the two. And this, we think will give us far better debugging, and error messages, and hints on how we can actually fix conversion. And the other reason that motivated this switch to a new converter was to support control flow. This will initially start by supporting functional control flow forms, so if and while conditionals. We're still considering how we can potentially map legacy control flow forms to these new variants. But this is where we're going to start. And so far, we see that this will unlock a pretty large class of useful models, the RNN class type models that so far have been very difficult to convert to TensorFlow Lite. TensorFlow 2.0. It's supported. There's not a whole lot that changes on the conversion end and certainly nothing that changes on the TFLite end except for maybe the change to that saved model is now the primary serialization format with TensorFlow. And we've also made a few tweaks and added some sugar for our conversion APIs when using quantization. OK. So you've converted your model. How do you run it? Here's an example of our API usage in Java. You basically create your input buffer, your output buffer. It doesn't necessarily need to be a byte buffer. It could be a single or multidimensional array. You create your interpreter. You feed it your TFLite model. There are some options that you can give it. And we'll get to those later. And then you run inference. And that's about it. We have different bindings for different platforms. Our first class bindings are Python, C++, and Java. We also have a set of experimental bindings that we're working on or in various states of both use and stability. But soon we plan to have our Objective C and Swift bindings be stable. And they'll be available as the normal deployment libraries that you would get on iOS via CocoaPods. And then for Android, you can use our JCenter or BingeRate ARs for Java. But those are primarily focused on third party developers. There are other ways you can actually reduce the binary size of TFLite. I mentioned that the core runtime is 100 kilobytes. There's about 800 or 900 kilobytes for the full set of operators. But there are ways that you can basically trim that down and only include the operators that you use. And everything else gets stripped by the linker. We expose a few build rules that help with this. You feed it your TFLite model. It'll parse that in output. Basically, a CC file, which does the actual op registration. And then you can rely on your linker to strip the unused kernels. OK. So you've got your model converted. It's up and running. How do you make it run fast? So we have a number of tools to help with this. We have a number of backends that I talked about already. And I'll be digging into a few of these to highlight how they can help and how you can use them. So we have a benchmarking tool. It allows you to identify bottlenecks when actually deploying your model on a given device. It can output profiles for which operator's taking the most amount of time. It lets you plug in different backends and explore how this actually affects inference latency. Here's an example of how you would build this benchmark tool. You would push it to your device. You would then run it. You can give it different configuration options. And we have some helper scripts that kind of help do this all atomically for you. What does the output look like? Well, here you can get a breakdown of timing for each operator in your execution plan. You can isolate bottlenecks here. And then you get a nice summary of where time is actually being spent. AUDIENCE: In the information, there is just about operation type or we also know if it's the earlier convolution of the network or the later convolutions in the network or something like that? JARED DUKE: Yeah. So there's two breakdowns. One is the run order which actually is every single operator in sequence. And then there's the summary where it coalesces each operator into a single class. And you get a nice summary there. So this is useful for, one, identifying bottlenecks. If you have control over a graph and then the authoring side of things, then you can maybe tailor the topology of your graph. But otherwise, you can file a bug on the TFLite team. And we can investigate these bottlenecks and identify where there's room for improvement. But it also affords-- it affords you, I guess, the chance to explore some of the more advanced performance techniques like using these hardware accelerators. I talked about delegates. The real power, I think, of delegates is that it's a nice way to holistically optimize your graph for a given backend. That is you're not just delegating each op one by one to this hardware accelerator. But you can take an entire subgraph of your graph and run that on an accelerator. And that's particularly advantageous for things like GPUs or neural accelerators where you want to do as much computation on the device as possible with no CPU interop in between. So NNAPI is the abstraction in Android for accelerating ML. And it was actually developed fairly closely in tandem with TFLite. You'll see a lot of similarities into the high level op definitions that are found in NNAPI and those found in TFLite. And this is effectively an abstraction layer at the platform level that we can hook into on the TensorFlow Lite side. And then vendors can plug in their particular drivers for DSP, for GPUs. And with Android Q, it's really getting to a nice stable state where it's approaching parity in terms of features and ops with TensorFlow Lite. And there are increasingly-- there's increased adoption both in terms of user base but also in terms of hardware vendors that are contributing to these drivers more recently we've released our GPO back end and we've also open source. This can yield a pretty substantial speedup on many floating point convolution models, particularly larger models. There is a small binary size cost that you have to pay. But if it's a good match for your model, then it can be a huge win. And this is-- we found a number of clients that are deploying this with things like face detection and segmentation. AUDIENCE: Because if you're on top of [INAUDIBLE] GPU. JARED DUKE: Yeah, so on Android, there's a GLES back end. There's also an OpenCL back end that's in development that will afford a kind of 2 to 3x speed up over the GLES back end. There's also a Vulcan back end, and then on iOS, it's metal-based. There's other delegates and accelerators that are in various states of development. One is for the Edge TPU project, which can either use kind of runtime on device compilation, or you can use or take advantage of the conversion step to bake the compiled model into the TFLite graph itself. We also announced, at Google I/O, support for Qualcomm's Hexagon DSPs that we'll be releasing publicly soon-ish. And then there's some more kind of exotic optimizations that we're making for the floating point CPU back end. So how do you take advantage of some of these back ends? Well, here is kind of our standard usage of the Java APIs for inference. If you want to use NnAPI, you create your NnAPI delegate. You feed it into your model options, and away you go. And it's quite similar for using the GPU back end. There are some more sophisticated and advanced techniques for both an API and GPU on our interop. This is one example where you can basically use a GL texture as the input to your graph. That way, you avoid needing to copy-- marshal data back and forth from CPU to GPU. What are some other things we've been working on? Well, the default out of the box performance is something that's critical. And we recently landed a pretty substantial speed up there with this ruy library. Historically, we've used what's called gemmlowp for quantized matrix multiplication, and then eigen for floating point multiplication. Ruy was built from the ground up basically to [INAUDIBLE] throughput much sooner in terms of the size of the inputs to, say, a given matrix multiplication operator, whereas more desktop and cloud-oriented matrix multiplication libraries are focused on peak performance with larger sizes. And we found this, for a large class of convolution models, is providing at least a 10% speed-up. But then on kind of our multi-threaded floating point models, we see two to three times the speed-up, and then the same on more recent hardware that has these neon dot product intrinsics. There's some more optimizations in the pipeline. We're also looking at different types-- Sparse, fp16 tensors to take advantage of mobile hardware, and we'll be announcing related tooling and features support soon-ish. OK, so a number of best practices here to get the best performance possible-- just pick the right model. We find a lot of developers come to us with inception, and it's hundreds of megabytes. And it takes seconds to run inference, when they can get just as good accuracy, sometimes even better, with an equivalent MobileNet model. So that's a really important consideration. We have tools to improve benchmarking and profiling. Take advantage of quantization where possible. I'm going to dig into this in a little bit how you can actually use quantization. And it's really a topic for itself, and there will be, I think, a follow-up session about quantization. But it's a cheap way of reducing the size of your model and making it run faster out of the box on CPU. Take advantage of accelerators, and then for some of these accelerators, you can also take advantage of zero copy. So with this kind of library of accelerators and many different permutations of quantized or floating point models, it can be quite daunting for many developers, probably most developers, to figure out how best to optimize their model and get the best performance. So we're thinking of some more and working on some projects to make this easy. One is just accelerator whitelisting. When is it better to use, say, a GPU or NnAPI versus the CPU, both local tooling to identify that for, say, a device you plugged into your dev machine or potentially as a service, where we can farm this out across a large bank of devices and automatically determine this. There's also cases where you may want to run parts of your graph on different accelerators. Maybe parts of it map better to a GPU or a DSP. And then there's also the issue of when different apps are running ML simultaneously, so you have a hotware detection running at the same time you're running selfie segmentation with a camera feed. And they're both trying to access the same accelerator. How can you coordinate efforts to make sure everyone's playing nicely? So these are things we're working out. We plan on releasing tooling that can improve this over the next quarter or two. So we talked about quantization. There are a number of tools available now to make this possible. There are a number of things being worked on. In fact, yesterday, we just announced our new post-training quantization that does full quantization. I'll be talking about that more here in the next couple of slides. Actually, going back a bit, we've long had what's called our legacy quantized training path, where you would instrument your graph at authoring time with these fake quant nodes. And then you could use that to actually generate a fully quantized model as the output from the TFLite conversion process. And this worked quite well, but it was-- it can be quite painful to use and quite tedious. And we've been working on tooling to make that a lot easier to get the same performance both in terms of model size reduction and runtime acceleration speed-up. AUDIENCE: Is part about the accuracy-- it seems like training time [INAUDIBLE]. JARED DUKE: Yeah, you generally do. So we first introduced this post-training quantization path, which is hybrid, where we are effectively just quantizing the waits, and then dequantizing that at runtime and running everything in fp32, and there was an accuracy hit here. It depends on the model, how bad that is, but sometimes it was far enough off the mark from quantization aware training that it was not usable. And so that's where-- so again, with the hybrid quantization, there's a number of benefits. I'm flying through slides just in the interest of time. The way to enable that post-training quantization-- you just add a flag to the conversion paths, and that's it. But on the accuracy side, that's where we came up with some new tooling. We're calling it per axis or per channel quantization, where with the waits, you wouldn't just have a single quantized set of parameters for the entire tensor, but it would be per channel in the tensor. And we found that that, in combination with feeding it kind of an evaluation data set during conversion time, where you would explore the range of possible quantization parameters, we could get accuracy that's almost on par with quantization aware training. AUDIENCE: I'm curious, are some of these techniques also going to be used for TensorFlow JS, or did they not have this-- do they not have similarities? They use MobileNet, right, for a browser? JARED DUKE: They do. These aren't yet, as far as I'm aware, used or hooked into the TFJS pipeline. There's no reason it couldn't be. I think part of the problem is just very different tool chains for development. But-- AUDIENCE: How do you do quantized operations in JavaScript? [INAUDIBLE] JARED DUKE: Yeah, I mean I think the benefit isn't as clear, probably not as much as if you were just quantizing to fp16. That's where you'd probably get the biggest win for TFJS. In fact, I left it out of these slides, but we are actively working on fp16 quantization. You can reduce the size of your model by half, and then it maps really well to GPU hardware. But I think one thing that we want to have is that quantization is not just a TFLite thing, but it's kind of a universally shared concept in the TensorFlow ecosystem. And how can we take the tools that we already have that are sort of coupled to TFLite and make them more generally accessible? So to use this new post-training quantization path, where you can get comparable accuracy to training time quantization, effectively, the only difference here is feeding in this representative data set of what the inputs would look like to your graph. It can be a-- for an image-based model, maybe you feed it 30 images. And then it is able to explore the space of quantization and output values that would largely match or be close to what you would get with training-aware quantization. We have lots of documentation available. We have a model repo that we're going to be investing heavily in to expand this. What we find is that a lot of TensorFlow developers-- or not even TensorFlow developers-- app developers will find some random graph when they search Google or GitHub. And they try to convert it, and it fails. And a lot of times, either we have a model that's already been converted or a similar model that's better suited for mobile. And we would rather have a very robust repository that people start with, and then only if they can't find an equivalent model, they resort to our conversion tools or even authoring tools. AUDIENCE: Is there a TFLite compatible section in TFHub? JARED DUKE: Yeah, we're working on that. Talked about the model repo training. So what if you want to do training on device? That is a thing. We have an entire project called The [INAUDIBLE] Team Federated Learning Team, who's focused on this. But we haven't supported this in TensorFlow Lite for a number of reasons, but it's something that we're working on. There's quite a few bits and components that still have yet to land to support this, but it's something that we're thinking about, and there is increasing demand for this kind of on-device tuning or transfer learning scenario. In fact, this is something that was announced at WWDC, so. So we have a roadmap up. It's now something that we publish publicly to make it clear what we're working on, what our priorities are. I touched on a lot of the things that are in the pipeline, things like control flow and training, proving our runtime. Another thing that we want to make easier is just to use TFLite in the kind of native types that you are used to using. If you're an Android developer, say, if you have a bitmap, you don't want to convert it to a byte buffer. You just want to feed us your bitmap, and things just work. So that's something that we're working on. A few more links here to authoring apps with TFLite, different roadmaps for performance and model optimization. That's it. So any questions, any areas you'd like to dive into more deeply? AUDIENCE: So this [INAUDIBLE]. So what is [INAUDIBLE] has more impact like a fully connected [INAUDIBLE]? JARED DUKE: Sorry. What's-- AUDIENCE: For a speed-up. JARED DUKE: Oh. Why does it? AUDIENCE: Yeah. JARED DUKE: So certain operators have been, I guess, more optimized to take advantage of quantization than others. And so in the hybrid quantization path, we're not always doing computation in eight-bit types. We're doing it in a mix of floating point and eight-bit types, and that's why there's not always the same speed-up you would get with like an LSTM and an RNN versus a [INAUDIBLE] operator. AUDIENCE: So you mentioned that TFLite is on billions of mobile devices. How many apps have you seen added to the Play Store that have TFLite in them? JARED DUKE: Tim would have the latest numbers. It's-- I want to say it's into the tens of thousands, but I don't know that I can say that. It's certainly in the several thousands, but we've seen pretty dramatic uptick, though, just tracking Play Store analytics. AUDIENCE: And in the near term, are you thinking more about trying to increase the number of devices that are using TFLite or trying to increase the number of developers that are including it in the applications that they built. JARED DUKE: I think both. I mean there are projects like the TF Micro, where we want to support actual microcontrollers and running TFLite on extremely restricted, low power arm devices. So that's one class of efforts on-- we have seen demand for actually running TFLite in the cloud. There's a number of benefits with TFLite like the startup time, the lower memory footprint that do make it attractive. And some developers actually want to run the same model they're running on device in the cloud, and so there is demand for having like a proper x86 optimized back end. But at the same time, I think one of our big focuses is just making it easier to use-- meet developers where they're at. And part of that is a focus on creating a very robust model repository and more idiomatic APIs they can use on Android or iOS and use the types they're familiar with, and then just making conversion easy. Right now, if you do take kind of a random model that you found off the cloud and try to feed it into our converter, chances are that it will probably fail. And some of that is just teaching developers how to convert just the part of the graph they want, not necessarily all of the training that's surrounding it. And part of it is just adding the features and types to TFLite that would match the semantics of TensorFlow. I mean, I will say that in the long run, we want to move toward a more unified path with TensorFlow and not live in somewhat disjoint worlds, where we can take advantage of the same core runtime libraries, the same core conversion pipelines, and optimization pipelines. So that's things that we're thinking about for the longer term future. AUDIENCE: Yeah, and also [INAUDIBLE] like the longer term. I'm wondering what's the implication of the ever increasing network speed on the [INAUDIBLE] TFLite? [INAUDIBLE],, which maybe [INAUDIBLE] faster than current that we've [INAUDIBLE] take [INAUDIBLE] of this. JARED DUKE: We haven't thought a whole lot about that, to be honest. I mean, I think we're still kind of betting on the reality that there will always be a need for on device ML. I do think, though, that 5G probably unlocks some interesting hybrid scenarios, where you're doing some on device, some cloud-based ML, and I think for a while, the fusion of on device hotware detection, as soon as the OK, Google is detected, then it starts feeding things into the cloud. That's kind of an example of where there is room for these hybrid solutions. And maybe those will become more and more practical. Everyone is going to run to your desk and start using TensorFlow Lite after this? AUDIENCE: You probably already are, right? [INAUDIBLE] if you have one of the however many apps that was listed on Tim's slide, right? JARED DUKE: I mean, yeah. If you've ever done, OK, Google, then you're using TensorFlow Lite. AUDIENCE: [INAUDIBLE]. Thank you. JARED DUKE: Thank you. [APPLAUSE]
B1 中級 TensorFlow內部:TensorFlow Lite (Inside TensorFlow: TensorFlow Lite) 2 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字