字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] THORSTEN KURTH: Hello, and thank you, everybody, for attending the afternoon sessions. My name is Thorsten Kurth. And I'm an application performance specialist at NERSC. And my day-to-day work is helping scientists optimize their codes for contemporary supercomputer systems. Today I'm going to talk about a project I care about because it combines three different things I'm excited about. This is big computers, so exascale. It's deep learning. And it's climate change because it will affect everybody, every one of us sooner or later. So this is a team effort. And I want to thank, at this point, everybody in this collaborative effort between NERSC, Nvidia, UC Berkeley, and Oak Ridge for making this a success. So thank you at this point. So I want to talk about our extreme weather phenomena. So why are they important? They're important because they can incur a lot of damage and loss of life and these kind of things. For example, 2017, the damage to the US economy was about 200 billion for the combined extreme weather events. So these can be hurricanes, or tropical cyclones, and, for example, atmospheric rivers because they can cause heavy flooding and major disruption. So we won't understand these events better. But what a typical climate data analysis is-- so for example, you have these simulations, which look into the future up to 100 years. You run different models and get these. So on your left, you see the output of the simulations. And they basically contain 14 million observables for a three-hour interval. And then you have like 100 years worth of that. And what people usually do when you look at the IPCC report, for example, or in popular magazines, they boil it down to a couple of numbers. For example, temperature rise, sea level rise, these kind of things. However, if the temperature increases by one degrees or two, that matters. But it might not matter to you if you live in the middle of the Sahara, right? It might matter to you, though, if you are in different regions of the globe-- and also the sea level rise. So the thing is now, what you want to do is you want to have a geospatial analysis of climate change. So how does climate change impact your life where you live? So we want to answer things like, will there be more hurricanes, for example? And if yes, will they be more intense? Will they make more landfalls? If they stay over the sea, it's usually not as bad as when they hit the coastline. And for atmospheric rivers, for example, 50% of all rain in California is due to atmospheric rivers. So it's an important question to ask if we will get more water, like more rain, due to this. And you, for example, think about forest fires, like the campfire last year we had in the Bay Area. We had a hard time breathing for two weeks. It's really a question if you get more or fewer of these. And this is really dependent on these atmospheric rivers, for example. So insurance industry-- for example, water planners-- a lot of different people need to know what they need to get up for. So how can we do this? So we have this high fidelity climate simulations. And what we, for example, can start with-- picking out the these events. For example, hurricanes and atmospheric rivers. Let's start with these. And image segmentation techniques can offer pixel-level resolution. So they can do a per-pixel classification to pick these events out and then correlate them geospatially with the underlying region, for example. And deep learning, as you know, is very successful in here because, for example, the whole autonomous driving industry is doing that day in, day out. And there's a lot of research going on in this direction. So the data set we have is of 20 terabytes. So we have like 400 terabyte in storage. But for this work, we use 20 terabytes of it. And what I call an image here is more like a tensor. It's a three-dimensional tensor of this 1152 times 768 times 16. And the channels are not RGB. They present observables like wind speed, temperature, pressure, for different altitudes, and these kind of things. So they're general observables. We have free classes. So background, which is not nothing interesting going on. Then the tropical cyclones, or hurricanes, and the atmospheric rivers. Fortunately, these events are still rare in the future. So 95% of the pixels are background, which is good for us. But it's harder to train a model on that because of this high imbalance. And another thing which makes it different from the classical, let's say, streets in segmentation is that all the objects here are-- so first, there's a lot of stuff going on in the background. It's not static or slow moving. And also the objects themselves, they change rapidly in size and shape, right? So even when you look at this image, this satellite image from the hurricane, even as an expert, you don't know actually where you want to say, like where this hurricane starts or ends, right? So the labels are pretty fuzzy. So talking about that, how did we get those? Of course, the best would be using human annotated labels. But for that data, we didn't have that at the time. We are currently working on that, though. So for this effort, we use some algorithmic labeling, which is an old school approach in the sense that it's basically based on future engineering together with some thresholding to get the binary masks. One can say, OK, why don't you do the predictions with these algorithms, then? Because you have a lot of shortcomings in this algorithm. So they are regional dependent. Even for different thresholds get vastly different labels. So however, they're still good enough to fit in a network with it. And it can pick up better features, as I will show you later. So for image segmentation architecture, we picked DeepLab version 3+ variant. So it was developed by Google. And basically, it has an-- as all these segmentation network has an encoder, which extracts the features. And the decoder part, which then makes the predictions and the skip connections in order to feed the features at different levels from the encoder stage into the decoder to improve the prediction quality. So the original DeepLab had a [INAUDIBLE] interpolation as a decoder. And we replaced this with a fully deconvolution decoder. I think the original choice was made for training reasons because it's easier to train the [INAUDIBLE] interpolater because it doesn't have a lot of weights. So our model has 44.7 million parameters. And the training cost for a single step on a single sample-- so forward, backward-- is 14.4 teraflop, which is 14.4 times 10 to the 12 floating point operations. And on a modern GPU, like this Nvidia V100, you can only fit two batches in half precision or one in single precision on the GPU. So what you need to do is you need to train it in parallel. And we took a purely data parallel approach here. So we used Horovod for this. So Horovod is basically a framework which hooks into the TensorFlow graph in synchronous fashion. And reduces tensors across all the workers as they are ready to be reduced. It does this using MPI. So it provides MPI callback function. MPI is Message Passing Interface. It's a very common framework for exchanging messages between different-- in a distributed memory system such as HPC systems. The good thing is that since a lot of people in HPC use it, it's very highly optimized usually for these supercomputers. You're still, of course, responsible for sharding your data set, distribute the data, and all these kind of things. So we ran on the Summit supercomputer system. So this is the number one supercomputer in the world. So there's this top 500 list, which is updated twice a year. So this is the system at Oak Ridge National Laboratory. It consists of 4,600 nodes. It has two Power CPUs in them and six Nvidia V100 GPUs with Tensor Cores. They are connected using his high speed NVLink interconnect, which is very nice. So we can do all reductions within the node very efficiently. And it also features 800 gigabyte of nonvolatile memory per note, which is quite cool because you can stage part of your data set into that and read it almost with a DRAM speed. So it's almost as fast as reading it from main memory, but it's much bigger. So the network is pretty fast and low latency. And what I want to point out here, though, is that we talk a lot about exascale computing, so capability of 10 to the 18 floating point operations per second in double precision. So this is the next generation of systems want to deploy or develop and deploy. But really look at it. If you can stick with half precision, so if you can basically have an application which can utilize half precision almost for most of the computations, you have an exascale system available right now. So it's there. It's in Oak Ridge. You can just go and use it. So there are some performance optimizations necessary, of course. So when you think about deep learning, you have to optimize the whole pipeline, right? Starting from like the data-- where do you read it from? Where to stage it in? Then how do you feed it efficiently to accelerator, right? The accelerator is so fast that you need to feed them efficiently that they don't stall waiting for that data. For the computational part, you want to minimize the data organization, for example. And the reductions also need to be very efficient, right? Because you want to reduce the gradients at a very, very high frequency. One thing we also use was some overlapping or grading pipelining or asynchronous approach you call it where you do not reduce the gradients-- they do not compute the fresh gradients and produce them and then integrate them. But instead, you come on the GPU. You compute fresh gradients. And then on the CPU, you read all to the gradients from the last step from a buffer. Reduce those asynchronously to the competition of the new gradients. And integrate them into the model. So by that you can overlap these two steps very nicely. So this is a plot for the performance we got. So you see, the throughput metric of images per second, or call it samples per second, versus the number of GPUs, if you divide it by 6, you get the number of nodes. And the other y-axis is basically a translation of this image throughput metric into a more HPC metric of petaflops per second-- so 10 to the 15 operations per second. So what you see is the FP32. So the single precision points are blues. So I don't want to talk about these. What you can see that the FP16, so the half precision performance much, much better, right? So the Tensor Cores can, in theory, deliver 125 teraflops per card. And that is what you see is vast performance difference. The dashed line represents the ideal case. in the ideal case, where you don't have any lost due to communication, you would be basically on this line. So we are a bit below with the solid red line but not far things. I think it's 70-something percent, 79% scanning efficiency. And also what you see that the lacked version-- so where you can basically overlap the computation of the communication very nicely-- it's very crucial to do this here because the GPUs are so fast that they really need to wait for it or reduce otherwise. So and after we saw this, we thought, OK, we can go to a couple more nodes. But we might not still hit the exaflop mark, which is this 1,000 petaflops per second. So we restructured the decoder a little bit, and not like from the predictive power. But we removed some additional data transpositions. And we ran it on a couple of more nodes and actually got there. So the performance number we got at that scale was 1.13 exaflops in FP16. So half precision on 27,360 GPU. And that is so far the biggest deep learning calculation I'm aware of. So this is the training loss. This is on a slightly lower scale. We don't have this full history for the big scale. However, what you can see-- the case I to make here is that the select version, although it's partially asynchronous, but it's like predictable asynchronous in a way that the network at the beginning is a bit unstable. So basically the training [INAUDIBLE] grows. So it oscillates heavily. But then when you just wait long enough, it will outperform the unlagged version. So that, of course, is not true for every arbitrary like deep learning network. But for us, it's definitely true. And I think it's definitely worth a try if you have a problem like that. So talking about the results, I have a video for this. So on the left-hand side, you see the predicted weather patterns by the model. In the right-hand side, you see the ground truth. So I have three things to say. So first, there's some qualitative agreement and also quantitative agreement, which is satisfactory. What you also see is that there are more predicted events than actually in the labels. And that is mainly because the aggressive thresholding, sometimes forgets to label stuff. So when you maybe show some of these samples where we overpredict atmospheric rivers, for example, to experts, they say, yes. Actually, the model picked up an atmospheric river which was not present in the ground truth. And then you can also see that the ground truth, you see the video is flickering. And this is because-- there's like a frame before and after where it, for example, picked up an atmospheric river but a frame in between where it did not. But of course, it should be continuous. It should not be like this. So the model actually predicts something which is much more continuous and much more smooth. Even if it did not-- the temporal dependence into account. So that is quite interesting. So my conclusions are-- so TensorFlow is one of the first applications which reached exascale performance, although only in FP16. But still it's remarkable. And I think this is a community achievement. And HPC systems are suitable for these workloads. Of course, there are some insufficiencies-- for example, the file system. So we needed this large, [INAUDIBLE] storage in order to feed the data efficiently. If you try to read from a distributed file system, it's very bad because HPC file systems are optimized for writing large chunks of data but not doing random reads, OK? So if you want to design HPC system in the future, which is very suitable for deep learning, you need to take this into account. So this is also a very important. And also, we want to talk to storage people to help us to develop better distributed storage which can cope with these workflows better. This work was awarded the ACM Gordon Bell prize at the last supercomputing conference. This price usually awarded for an interesting and challenging science problem for which you need massive amounts of compute to solve it. And then you can show that you actually use this massive amount of compute efficiently to solve it. So this is the paper link. Thank you very much for your attention. [MUSIC PLAYING]
B1 中級 用於氣候分析的Exascale深度學習(TF Dev Summit '19) (Exascale Deep Learning for Climate Analytics (TF Dev Summit ‘19)) 2 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字