字幕列表 影片播放 列印英文字幕 SHANQING CAI: We're going to do this presentation about how to debug TensorFlow programs. We're going to focus specifically on TF 2, because TF 2 is the stable release, and it will have long-term support going forward. But there are also places where we're going to mention TF 1. And when we do, we'll make that clear so you know which version of TensorFlow we're talking about. So first of all, I want to define a scope of debugging. And the reason why you should do this is because the word "debugging" is an overloaded term in machine learning. Different people use it to refer to different things, sometimes in confusing ways. So in the scope of this talk, debugging refers to specific things, really, that mainly have to do with the correctness of your TensorFlow program, like mathematical implementation bugs. For example, when you are implementing a new [INAUDIBLE] type or a new loss function, you may run into DType issues or shape issues, or just straight bugs in the math. And the techniques we'll cover will also be useful for debugging pre-processing and post-processing parts your TensorFlow program. And one big focus will be the debugging of the issues like NaN and infinity in our models, which happen very frequently during TF model training. [INAUDIBLE] will talk about a specific tool called TensorTracer, which is very useful for catching the root cause of NaNs and infinities on TPUs and other devices. And we're not going to talk about how to debug bugs in op kernels themselves or bugs in hardware, because it's more specific to the hardware and for the op kernel that you're using. However, the methods we'll cover will be useful for you to debug models that are affected by those kernel or hardware bugs. At least, it will be useful for you to narrow down the cause of the model bug to the level of op kernels or hardware. And the tools and techniques we'll cover will also be useful in case you want to just peek into a model to understand what's going on. And so one example would be answering a question like, why is my model making a wrong prediction on a [INAUDIBLE] for example? So you will be able to peek into the model and look at the layer activations and the intermediate tensors and answer that question. So one use case that's kind of relevant to that is when you are porting model from one version of the library to another, or from one library to another, like from TensorFlow to TFLite, or from TensorFlow to TF.js, or from TensorFlow to PyTwitch or more from PyTwitch to TensorFlow. You will often see divergence between the two implementations of the same model, and you want to quickly identify the root cause of the divergence. And the tools and techniques we'll cover will also be useful for that purpose. What we're not going to cover, however, are the debugging cases like debugging the model performance and looking at the accuracy of models after training, like model evaluation and analysis. We're not going to cover how to debug fairness issues in models, either Those are also important kinds of TensorFlow debugging, but those are outside the scope of this talk. There are great tools for those, like some dashboards in TensorBoard and What-If Tool, and fairness indicator, and so forth. But I'm not going to talk about those here. Any questions so far? OK. So here's a brief outline of this presentation. We're going to talk about how to debug tensor values. We're going to talk about how to look at the placement of ops on different devices, like CPUs and GPUs. It's a very commonly asked question. We're going to look at how to debug the structures of graphs in TensorFlow 2, including the graphs from tf.function and graphs that are optimized for the runtime. And then in section 4, we're going to cover the special topic of step debugging, which is to use an IDE to step over your code line by line. And then in the fifth section, we're going to move from low-level API to high-level API. And the specific high-level I will focus is on tf.keras, because tf.keras is the official high-level API in TF 2, and also because I'm not as familiar with other high-level APIs like [INAUDIBLE] and so forth. And in section 6, we're going to talk about the debugging of numerical issues like NaNs and infinity. And finally, I'm going to present to work on TensorFlow Debugger, including the existing V1 features and the V2 features that we're currently working on. So first, let's take a look at how to debug tensor values. So here's a very simple example. So it's very straightforward. You are not decorating your functions with tf.function decorator. So everything is executed eagerly in TF 2. And there you can simply use the print statements to look at the values of tensors, like in this simple example here. So y is an eager tensor. If you do print, you will see the value in the stdout printout. So it's very similar to print the value of the numpy arrays, with the caveat that if the tensor lives on the GPU, then printing it will involve copying from the GPU to your host, which may be a little bit costly. And oftentimes, when the size of the tensor is too big, you probably don't want to look at the entire tensor, because there are going to be millions of elements. What you sometimes want is to do a reduce operation on the tensor and then look at some numerical summaries of the tensor, like what's the minimum value, what's the maximum value, and what's the mean. It's also a useful technique that's fully supported in eager mode. So EagerTensor str and repr methods use numpy under the hood, which means that you can use the set_printoptions function from numpy to control the details of how the values of tensors are printed. For instance, if you use the precision [INAUDIBLE] arguments, you can adjust basically how many decimal points are printed in the floats type tensors. You can also adjust the threshold element count beyond which ellipses will be used in the printing, which is useful for cases where you want to look at the values of huge tensors, like thousands of elements. Of course, the story is not always as straightforward as this. The program is often not executing purely eagerly. And sometimes you have tf.function, sometimes your function is decorated by TensorFlow itself and converted into a graph. So there, if you use the print statements, then the results of the printing may not be what you would expect. So here the user intends to look at the value of n as each iteration of the while loop. So the user puts a print statement here naively. And the result has only one printed line, even though the while loop is executed multiple times. And the contents of the printed text is also kind of confusing to a naive user. And the reason is that when tf.function is used, the code here is transformed into a tf.Graph. And the print statement gets executed during that function-to-graph transformation. And the contents you see here is actually a node in the graph instead of the value of the tensor at runtime. So the correct approach is to use tf.print. So tf.print will modify the graph. It will actually add a couple of nodes to the graph so you can look at the value of the n inside the tf.function. So here at the bottom here, you can see the value of n as each iteration of the while loop is printed. So it's quite useful. So here is a homework problem for you. So the examples I've shown so far are all for simple tensors, like float32 tensor or an integer tensor. What if the tf.print statement is used on a ragged tensor or a sparse tensor? So those are the major composite tensor types in TensorFlow. So you can try that. By the way, I inserted a link to the Colab Notebook for all the code examples in this presentation. So you can look at the slides. And if you want to play with a code examples, you can use that Notebook. So here is a second homework code. You can use the code to see what happens if you use tf.print on a sparse tensors. OK. So sometimes, the user doesn't want to just print the value of the tensors. The user wants to programmatically extract the value of the tensors so they can be used for like programmatic debugging or downstream computation. This code snip here shows how you can pull out intermediate tensors from a toy implementation of a TF Dense layer. So the function originally returns only the final outputs of the Dense layer. But maybe for some reason you want to look at the intermediate steps, like the results of the matmul or the results of adding with a bias. So what you can do here is you can actually append these two tensors to the return values of the tf.function. And then you'll be able to access these intermediate values when you call the layer. What's slightly more complicated is if those tensors are inside the control flow structures. So for instance, if you want to programmatically access the value of n at every iteration of the while loop, you can't simply just add n to the return value here. What you need to do here is to use tf.TensorArray and then append that tf.TensorArray as each iteration of the while loop. And then you be able to see the full history of how n changes. It's slightly complicated. So the TensorFlow Debugger tool I will present at the end of this talk hopefully will make this simpler. So having covered tensor values, I'm going to talk about how to debug the placement of ops on different devices, mainly CPUs and the GPUs. So it's a very frequently asked question, because the users want to make sure that their heavy computation is computed on the GPU, not on the CPU. So again, if the program is running purely eagerly, then it's pretty straightforward. You can just call an API called tf.debugging.set log device placement equal True. And then when those operations are executed eagerly, you will see lines being printed to the stdout. For instance, when the multiplication operation areas run, you will see a line that tells you that the log op is running on a GPU. And when the print statement here is running, it's actually running a Print-V2 op on the CPU. So here you can see clearly where the ops are running, whether it's on CPU or GPU, and if you have multiple GPUs, which GPU is running an op. So one thing you need to know here is that it's only going to log information when the op is placed for the first time. If you have the same op executing multiple times eagerly, it's not going to print it multiple times. So that mechanism prevents a flood of information to your stdout. So a more realistic scenario is where you have tf.functions and graphs. So there, set_log_device_placement equal to True will still work. You will see not only the placement for the eager ops, like in the green box here, but you will also see the placement of the graph ops in the purple box on the bottom here. But one caveat here is that you need to be aware that, even though the eager lines are printed to stdout, currently the graph logs are printed to info log. The implication of that is that if you're doing this in a Jupyter Notebook or Colab, then you will not be able to see the bottom parts of the log. But there is actually a way in Colab to capture the log so you can see both. It's just something you need to be aware of. So in the graph placement logs, the text inside the parentheses are for the op type, and the text outside the parentheses to the left of the parentheses are for the name of the op. So here are some other important things to know about set_log_device_placement. So it works for both eager operations and for graph operations. But they're allotted to different places, as I mentioned. And also, be aware that the fact that an op is logged at graph construction time does not guarantee that the op will be executed at runtime. And that's because TensorFlow has its built-in graph optimization step called Grappler, and Grappler may change the placement, or it may prune the op away from the graph, or it may fuse the op into a larger op, and so forth. I'm going to talk about that in a coming slide. And also, be aware that set_log_device_placement currently does not work fully for TPUs. So it's mainly useful for debugging CPUs and GPUs currently. OK. So I'm going to move on to the section about debugging graph structures. So here you have a tf.function. And then how do you look at the graph of the compilation of that tf.function? So the answer to that is to use the method called get_concrete_function on that function object. So get_concrete_function should be called only after that tf.function is called for the first time. If you call that before the function is compiled, then that method will not even exist. And when you call get_concrete_function, you need to pass an argument. So the argument can be the same argument as you pass when you are calling the function. And the reason why you need to pass that argument is because the same Python function can be compiled into different tf.Graphs, depending on the DTypes and shapes of the [INAUDIBLE] arguments. You can also pass tensor text as the arguments. And the return value of get_concrete_function here is an object that has a graph property. The graph property is a tf.Graph on a Python level. To see its structure, you can call as_graph_def on a method of the graph object. And the returned value is a text float for the graph, as shown on the right here. So the text float here is basically a repeated fields of nodes. It tells you which nodes there are in the graph and how they're connected to each other. So there are properties like name, and op, and attributes, and so forth. So if you're not familiar with the format, you should spend some time looking at some examples, because it's very critical, very important for TensorFlow code. But the important thing here is that, for any realistic models and realistic tf.functions, the size of the graph then is going to be too big. It's not going to be friendly to reading a text. And that's why graphical tools like TensorBoard will be important. So you can start TensorBoard, the binary. So even if you have an empty logdir, you can still switch to the Graph Dashboard by using the dropdown menu. And then inside the Graph Dashboard, you should be able to see a button called Choose File. And you can use Choose File to upload the contents of the pbtxt of graph. And then TensorBoard will be able to show you the structure of the graph, as shown in this example here. So some important properties to know about TensorBoard's Graph Dashboard is that that information flow is generally from bottom to top. So the inputs are usually on the bottom. And at the top, you're seeing the outputs. And also, TensorBoard's graph visualizer will group nodes by name scope by default. So that's the reason why you often see those big, colorful boxes. And you can double-click each box to expand into it. So it's quite handy for debugging large models. And it also handles FunctionDefLibrary correctly. So FunctionDefLibraries are basically nested graphs. So it's used frequently in TF 2, like in control flow structures. So a TF 2 while loop will contain two functions, like one for the condition of the while loop and the other for the body of the while loop. Those are also color boxes that you can double-click to expand into. If you're Google internal, then you should be able to use a special import from Colab. And that will enable you to look at the graph inside the Colab Notebook. So I find that to be slightly handier than looking at graph structures in TensorFlow itself, because that means I don't have to switch back and forth between two different tabs of your browser. So here is an example. So, as we mentioned before, you can append variables to the return list of the tf.function to access intermediate tensors. And in this graph being visualized by our TensorBoard, you can see two extra identity nodes that correspond to the added return values. And that's because TF 2 currently uses identity nodes to mark the return values of tf.functions. So here's a graph for a function that's slightly more complicated. So it involves control flow structures, including while loops and if-else conditions. So these are the boxes that are the FunctionDefLibraries that I mentioned before. So you can see a box for the true branch of the if-else condition. You can see another box for the false branch of the if-else condition. And the box here in red is the condition of the while loop, and so forth. So if you are very careful and if you spend some time, you can see the correspondence between these ops in the graph and also the operations in the Python code. But that's in general how to do it. And it requires an expertise to see the correspondence between the graph nodes and the Python operators or Python functions. So that's one gap where the TF Debugger tool that I will talk about tries to fill. So in TF Debugger tool, you will be able to look at the graph structures and the source code side by side. So it will be easier for you to establish the correspondence between the Python functions or Python operators and the nodes of the graph. OK. So what if the function is not executing on a single device but it's executing on multiple devices or multiple hosts using distribution strategy? So before I talk about that, I'm going to tell you about a useful API for mocking out virtual devices. So for instance, if you have only one physical GPU on your machine, and you want to do some testing or some debugging on a distribution strategy that involves four different GPUs, then you can use the API called set_virtual_device_configuration to create four logical GPUs. And you can use the API called list_logical_device to confirm that. It's a very useful technique for testing and debugging TensorFlow functions that involve multiple devices. So once we have set the four logical GPUs, we can use MirroredStrategy to basically create a variable on the four GPUs. And we can construct a function that will basically incorporate and take that variable on the four GPUs. So the function here comes-- dist_f is the function that involves the replication. And you can use the get_concrete_function method as before to look at graph structure. So you can upload the graph pbtxt to TensorBoard to see its structure. And in the structure, you can see four boxes. And those four boxes correspond to the four GPUs. So the technique here is useful for debugging graphs in mirrored strategies and other distribution strategies. So this slide here shows you how tf.print works in terms of the graph. So each time you call tf.print inside a tf.function, it will append a pair of nodes to your graph. So the first node here will converge your tensor, the input tensor, into a string. And the second one will actually print that string into stdout or info out or whatever output stream the printout is configured to. So here's an example for you. It's also available in the Colab Notebook, so you can play with it a little. It is very interesting. So the question here is, what happens if there is no return value from the function? So I forgot to mention that the reason why these PrintV2 ops get executed at runtime because they are attached as control dependencies to the final output identity node of the graph. So these correspond to the dashed lines in the graph. So the homework problem is about finding out how the print op gets executed when the tf.function does not involve a return value. So when you use tf.print, you need to be aware that it may inadvertently change how graph optimization works at runtime. So in the code snippet on the left here, we're computing the harmonic mean of a tensor. However, there is a line in the code which constructs an op. But the output tensor of that op, which is a Min op, there's not usually any downstream computation. Now, when the tf.function is actually at runtime, Grappler is going to do its job, and it's going to prune out that Min op. So the Min op will not actually get executed at runtime. However, if you use tf.print, you will change the optimization. And you're basically going to attach the output tensor of the Min op to the string format from the PrintV2, as I mentioned before. But the important thing to note here is that if you use the get_concrete_function method to debug the graph structure, you will always see a Min op. And that's because at the Python level, TensorFlow AutoGraph faithfully only converts the Python function into a graph. It's not trying to do any optimization. Instead, it will hand the graph to Grappler for downstream optimization. So the question here is, how can we debug the optimized graph that are generated by Grappler? So that leads us to the next section. So in order to see the Grappler output graph, you need to use Bazel build. So when you call this, you need to specify an environment variable called TF_DUMP_GRAPH_PREFIX and point it to any directory you have write access to. And then you have to specify the flag called vmodule. So that tells the meta_optimizer, which is a part of Grappler, to be verbose and dump information to the folder. And after the program runs, you will see a bunch of files in the folder. So those are the outputs from each pass of Grappler. So Grappler performance graph optimization in steps, kind of like a compiler. So the final output, which is usually called after_MetaOptimizer something, is using the graph of interest. So it will tell you the structure of the graph that gets executed at runtime. So using the technique here, you will be able to compare the runtime graph of the two code snippets that we have seen before. So in the code snippet on the left, and also the graph on the left, you see that the Min op is not present, because it's pruned away by Grappler. However, in the code snippet on the right here, and also the graph on the right here, you can see that the Min op is present. And it feeds input into the two ops that correspond to tf.print. So as you can see, the process here is convoluted and complicated. So TF Debugger will try to present both the Python graph and the runtime graph to you, so you don't have to do any Bazel builds or any special flags or environment variables. OK. So now let's talk about the interesting topic of step debugging. So by step debugging, I mean using a Python IDE or [INAUDIBLE] to step over lines of the source code one by one. Some people prefer that over print debugging. So the useful API here if you want to start debugging is the tf.config.experimental run functions eagerly. So if you call that function with the eager element True, then you're basically telling the tf.function to not compile functions into graphs. And all the code here will run eagerly. And you will be able to use either print, or you can use step debugging, or can use breakpoints in your favorite IDE. But one important caveat you want to keep in mind is that it works for all cases except tf.data. Because tf.data works in a special way, it always converts Python functions into graphs before it runs them. So I'm going to show an example for that in an upcoming slide. This slide here shows an example of using VSCode to do step debugging on a tf.function after you call experimental run functions eagerly True. However, if you don't call that function-- I mean, if you don't call experimental run functions eagerly, or if you call it False, then it's not a good idea to step debug your tf.function. And for some IDEs, if you add a breakpoint, it's not even going to hit that breakpoint. And in other IDEs, it will hit that breakpoint, but the stepping pattern after that breakpoint will be very confusing. And the reason for that has to do with the internal details of how AutoGraph works. And I refer you to the presentation made by Dan Moldovan on AutoGraph last year. I think it's publicly available. So understanding that, it will probably not be too hard for you to understand why this strange behavior is happening here. So the strange behavior is that you're inserting a print statement in both branches of the if-else condition, and you see that when the function is called, both branches get executed. Yeah. So the slide here shows you an example in which experimental run functions eagerly does not work on a map function that you pass to tf.Dataset. So even if you comment out the tf.function decorator for the to_multi_hot function, it's still going to be converted into a graph, and it will run in a graph fashion instead of running eagerly. So in order to debug intermediate tensors inside a function, you must use tf.print. If you do print, you're only going to print the symbolic tensors in the graph. But TensorFlow Debugger will also make it easy for you to debug the values insides a tf.function passed to tf.Dataset. OK. So so far, we have been talking about how to debug low-level constructs of TensorFlow, including ops, and tensors, and graphs. But many users also use high-level APIs like tf.keras. And then they also want to peek into their models. So in the following slides, I'm going to talk about some tools and techniques available for debugging Keras models. So one very frequently asked question is, how do I get the intermediate layer outputs, meaning the intermediate layer activation from a tf.keras model? So one way to do it is to construct a second model, which is the debug_model in the example here. The second model has the same inputs as the original model. But the outputs will be the original model's output plus the outputs from the layers you're interested in. And then when you call debug_model.predict or simply call debug_model as a tf.function, you'll be able to see not only the final output of the Keras model, but also the intermediate outputs. So this approach is useful to look at the final outputs of each layer. If you want to have the intermediate tensors inside each layer, it's not that useful. You have to use tf.print method or the other techniques mentioned in earlier parts of the presentation. And TF Debugger will also make it easier for you to look at layer internal tensors. So one other useful thing to know when you are debugging a tf.keras model is to use the TensorBoard callback. So the TensorBoard callback, which is under the tf.keras.callbacks namespace, is a callback you can pass to your model.fit. What it will do is it's going to log loss functions and metrics to the logdir when the model is training. But for debugging purposes, it will also log the graph of the model to the logdir. So you can just open the Graph Dashboard of TensorBoard and look at the graph structure. So there you see that the layers of the model are organized in those boxes that you can double-click to expand. And that's thanks to the work done by the authors of tf.keras, who have been very careful in specifying the correct name scopes for each layer. But the other important and useful thing to note is that the tensors are marked as those arrows that connect the layers. And if you look carefully, you can see those very small fonts. Those small fonts are the shapes of the tensors to the extent known at graph construction time. So for instance, the dropout layer here outputs two-dimension tensor of size question mark times 5. So the question mark is the undetermined batch dimension. OK. So having covered high-level API debugging, let's move on to the next section, which is about how to debug NaNs and infinities you're running. So that's a very frequently occurring debugging task in TensorFlow, and probably accounts for about half of the questions that we get asked. So if we're talking about the tools, I want to show you some common causes for NaNs and infinities in TensorFlow models. So they can be caused by a lack of value clipping. Like when you have a division operation in your TensorFlow program, like some sort of normalization, if you forget to add an epsilon or very small positive value to your denominator, it's likely to run into infinities at runtime, especially in the face of variability of input stream data. And that also applies to the math log operation. And sometimes, NaNs and infinities in your TensorFlow model can be caused by bugs in op kernels themselves or even in hardware. So for instance, we have seen a bug recently that involves [INAUDIBLE] kernel on TPUs outputting infinities and NaNs, even when the inputs are totally valid. And sometimes, the NaNs and infinities can be also caused by exploding gradients, especially when your learning rate is too high. So in that case, to quote the famous meme-- just keep calm and decrease the learning rate. And then as [INAUDIBLE] will tell us about, sometimes the NaNs and infinities can also be caused by problematic training examples. So debugging, then, the root cause of the NaN and infinity is different from the print debugging and the graph structure debugging we have talked about in earlier parts of the presentation. And that's because, to find the root cause of NaNs and infinities, you don't know where to look, because that's exactly what you're trying to find out. You could insert tf.print statements to every single tensor in your model. But it's not going to work for a realistic model, which can include up to tens of thousands of tensors. So that's why we need specialized tools to help you debug the root cause of NaNs and infinities. So I'm going to present two tools here. The first one is a new API. It's called tf.debugging.enable check numerics. So it's a relatively new API. It just came into existence in TF 2.1, which was released about a month ago. So what the API does here is that you can simply add one line of code to your TF program. And when the TF program runs, like when the model trains, it's going to check every floating type tensor in our TF program, including the eagerly computed tensors and tensors inside graphs and tf.functions. And as soon as any floating type tensor contains NaNs or infinity in their output, then the program will error out with a helpful error message, as the one shown on the right here. So the error message here contains a bunch of useful information for debugging, including the name of the op, the runtime DType and the shape of the tensor, as well as a stack trace. So we know that the stack trace from TensorFlow error messages are usually very verbose and hard to understand. And the API here tries to infer, or try to guess which frames of the stack trace correspond to the user's own program. And it highlights those frames with an arrow. So hopefully it will be easier for you to find the important frames in the stack trace. And the API here is also general in the sense that it works for both forward pass and backward pass. It works for low-level API and high-level APIs, including Keras. It also works if you are stuck with an old TF 1 API. And it should work on CPU and GPU and TPU. So one question you might want to ask is, what's the performance overhead of this? And it's an important question, because to find the root cause of NaNs and infinities, the overhead needs to be as low as possible. Sometimes, the NaNs and infinities don't happen until like a few hours or even a few days into training. So thanks to the work of an intern, Anthony, the overhead of this API is low. So we have benchmarked the API on a bunch of models. So here's an example from the transformer v2 model. When it's training on the CPU, if you enable [INAUDIBLE] check, then it's going to get about 30% overhead. If the model is trained on GPU, then overhead is slightly higher. It's about 75%. But it's not that high. So it may be even a good idea for you to turn this API on in your tests for [INAUDIBLE] checks. So this API here is useful, but it's also limited in the sense that it only tells you what happens when the NaN or infinity happens. It tells you about op. But it has no information about the moments or the history of the execution leading up to that moment. OK. So what TensorFlow Debugger is can be thought of as basically a combined tool that will help you achieve almost all the debugging tasks that we have mentioned in earlier parts of the presentation, including looking at tensor values, the placement of ops and devices, graph structures, also step debugging and numerical issues like NaNs and infinities. So far, we haven't put a lot of thought into high-level API support like Keras. But it's on our radar. So there are two different versions of TF Debugger, V1 and V2. So V1 was a part of TF 1. So it's centered around the old tf.Session API. So it's basically a set of wrappers for your sessions. So it's still available in TensorFlow. If you are still using TF 1 APIs, it might be useful to you. So there are two different wrappers-- the command line interface interface wrapper and the TensorBoard wrapper. When you wrap the session objects, you don't have to make any other changes to your TensorFlow code. When Session.run runs, it's going to present you with debugging information. If you use the command line interface interface wrapper, then Session.run will basically drop into an interactive terminal based program in your terminal. And these screenshots are showing you that the command line interface will show you the list of tensors that are executed. And you can click those tensor names to look at the details of the tensors, like the op placement, the values of the tensors, and so on. It will also show you on source code and annotate each line in the source code with the ops that are created at that line. So currently, we're working on V2 of TF Debugger. The reason why we want to invest in is are obviously, first we want to bring the tool up to speed with our current API which has no tf.Session, but it involves eager execution and tf.function. And also, in earlier parts of the presentation, you have seen that print debugging and tf.print are useful for a lot of debugging cases, but it's not useful in all cases, especially when you want to debug some code deep inside the TensorFlow codebase itself. So we also want to incorporate some lessons we learned from V1 of the tool. First, we want the tool to be general enough to work on all hardware types. TF Debugger V1, because it predates TPU in TensorFlow, it does not work for TPU. It only works for CPU and GPU. But TF Debugger V2 will work for all the major hardware types-- CPUs, GPUs, and TPUs. And secondly, we want the overhead to be as low as possible. And also, we learned that there are some improvements that we can make to the UX of the frontend. So TF Debugger V2 in a nutshell will involve this process. So the user has a TF program that he or she wants to debug. Then they can just insert one-line call into their tf.function and specify a logdir. So the logdir can be the same logdir as your TensorBoard logdir. And then if your TensorBoard has started, then you can switch to the Debugger V2 Dashboard in TensorBoard to look at the debug information. So the frontend work is currently underway. So I've been reaching out to various people, like people at the TensorFlow team and outside TensorFlow team at Google to get their feedback. If you're interested in [INAUDIBLE] this or telling us about your specific debugging use case, please reach out to me. And I will be more than happy to work with you to make sure that the new tool will be useful for your problem. So here are some UI mocks that UX researchers helped us make. So it's going to be the look of the new Debugger V2 plugin in TensorBoard. It's going to show you the execution history on the top. It's going to show you both eager execution ops and of tf.functions. And you can zoom into tf.functions to look at the graph structure and the list of tensors that are computed inside the tf.function. And the top left section will highlight important events, like the generation of NaNs and infinities, and the repeated function compiles which might hurt your performance, and so forth. And more importantly, on the bottom section you will be able to associate your graph ops with your source code or associate eager execution events with your source code. This will make it easier for you to find the way back from your bug into your source code. And it should speed up your bug fixing process. And finally, some advice. So the authors of TensorFlow have done a lot of work recently to improve the error messages. So next time you get an error message in TensorFlow, be patient and read through the error message, especially the sections labeled as "in user code." It may contain some surprisingly useful information for you to debug your problems. And lastly, some machine learning bugs are not machine learning bugs, but they're just general programming bugs. So here is a puzzle for you to chew on. It's a small problem. So here, the user is trying to code two of the files for the features and for the labels. And the user feeds them into a function to construct a data set. And the data set is fed into the fit call to train the model. But for some reason, the model training is not very good. The accuracy is much worse than expected. And what's the reason for that? So it's a puzzle for you. If you're interested in the answer, reach you to me, and I'll be happy to tell you the answers. But the point is that some bugs are just general problem bugs, not machine learning bugs per se. Thank you very much for your attention. [MUSIC PLAYING]
B2 中高級 TensorFlow內部。TF調試 (Inside TensorFlow: TF Debugging) 2 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字