字幕列表 影片播放 列印英文字幕 Let's just see how we would add together two floating point numbers if we've got 42 and so in floating point representation that would be 100 100 100 times 2 to the 1 2 3 4 5 so let's add on 6 so 1 1 o is 6 and that's times 2 to the 2 so we now need to add these two numbers together Now before we would just add them together by going that was that that was that that Plus that that's it. We can't do that with Floating point numbers because the bit patterns for these things being that these are going to look like very very different things What we have to do is first of all Line them up to the bits in the same place. So we need to shift this one down so that the bit here which represents 4 is in the same position as the bit that represents 4 here And so the number of spacing we need to shift This right is a difference between the big one and the little one in this case. It's three places. So 1 2 3 Spaces so we shift it 3 spaces to the right And so the first that was rather than just adding them together. We have a Step. Now what we've got to expand them out of the bit representations because remember that this would actually be 0 1 which is encoded 0 1 0 1 0 And this the 8 bit exponent this is going to be what 127 plus 5 which is 128 plus 4 so that's going to be 1 0 0 One zero zero so it's gonna be something like that So we've got that so that's what that's represented by and this one is going to be similar It's going to be represented by zero. We've got one two, three, four, five six seven eight bits The ones already encoded implicitly zero and zeros down there ignore them for now And we're gonna store. This is 1 0 0 0 0 0 1 so the numbers were actually got in memory in our computer are represented like this so the first thing we have to do is Get them to a point we can and we can't just add these two numbers together anymore And we can see that they're simply by looking over here if we had 1 + 1 We'd get 1 0 Z answer which mean with the answer to have a 1 here which means something go from positive numbers to any Negative number which is definitely wrong. So we need to unpack this representation into a form that we can add together now one way we could do that is just work out how many bits we would need and Assign the bits into the right place and do that, but we can actually use some sort of tricks We know for example if we're adding two numbers together with a certain number of bits in this case 24 bits The biggest number that we could add two numbers together and get a result Would have a value of two it roughly around two to the 25 The other thing we know is one of these numbers going to have a greater Exponent than the other So what we can do is we could say, okay Let's keep that one where it is and shift this one or divide this one by two So that the exponent on it would be the same. So if we shift this one place to the left We'd end it was this a zero point one times two to the three another place to the left It would be zero point and so on times two to the four Until we end up with that one lined up there and that becomes times two to the five and then we have zero Zero point two zero zero 1 there so We did the first step. We need to unpack them from the representations into forms that we can add together and then we need to shift this one so that the Exponents are the same. So we take the smaller one and shift it So the exponents Alanya now we can add those numbers together So we can now add these because locally we can produce a number one bit bigger than this if we add them together One plus one is two for example. So 0 plus 0 is 0 1 plus 1 is 0 carry 1 0 plus 1 plus 1 is 0 Carry 1 1 plus 1 is 0 carry 1 0 plus 0 plus 1 is 1 1 2 to the 5 and then we ended up here times 2 to the 5 as already 6 on to 42 And I've got 48 as a result. So he's done the maps and I could write that back now, but potentially we could have ended up with a 2 here if we added up 1 and 1 for example would get 2 and So we need to do a final step once we've done the addition which is to normalize this back potentially into the normal form which in this case would be 1 point 1 0 0 0 0 times 2 to the 5 so the reason that floating point numbers take much longer to process Is that as well as doing the addition which you can do in exactly the same way? You also have to take the bits unpack them from the representation shift them along So they match up things then do the addition and then potentially shift them back to get it back into the normalized form the standard scientific Representation the other problem you get is even though we can pack all these numbers Into 32 bits the representation When we slide them along we may end up needing More than 32 bits as many as 48 To represent things because if we have to slide this one Along to the point here when we're doing the maps that we actually need 48 bits to do the calculation Of course That means you don't have to do on the 32 bit CP you've two additions for That half and then that half and carry the value over from one to the other which again would slow things down In hardware, you can build your representations to take care of this if you've got 64 bit doubles You know that you perhaps don't need more Than certain number of bits to represent you and you can build the hardware to take all this and it ends up being Much faster that must be quite fiddly to do with standard hardware So is that why we end up with this custom hardware this floating-point unit. It's not most much fiddly. I mean most computers Preserve the carry when they add two bradleys together so if you had two 32-bit numbers that produces value greater than 32 bits they preserve that bit and let you add it on so you can use multiple registers to do it But you just have to then do Two operations to add operations one after the other if you know the operations are going to do this You can build your hardware to do that in one go so we could build hardware that would add these together There are lots of things you can spot where you could early out So for example, if the exponent was such that these end up so far apart That you know adding this onto this where there's all zero bits along here assumed Isn't going to make any difference to this you can say, well actually I don't need to do that I'm just ignore it. If you know the number zero you can ignore it and so on So there's this ways you can speed things up when writing the software and I suspect the hardware just some of the things although probably Isn't lead to the interesting thing if you think about the way the mathematics work Unlike integer numbers where multiplying integer numbers is trickier than addition Because you end up having to do lots of shifts and adds into the different things multiplying to floating point numbers is relatively straightforward compared to addition because We just have to multiply the two Mantises adding the extra bit back in if it's there and Then add the exponents together So multiplication actually becomes much simpler to do with floating point numbers and addition Because the addition requires us to unpack everything and push the bits around to get things in the right place now I've got the token so I can load a value in add the valley from register into it and Store it back and hand the token and now I've got the token again I can load something into It into my register add something onto it so it back and pass the take it on and I've got it so I can load The value in add the value from a register story back
A2 初級 浮點數(第二部分:Fp加法) - Computerphile (Floating Point Numbers (Part2: Fp Addition) - Computerphile) 2 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字