浮點數（第一部分：Fp與固定） - Computerphile (Floating Point Numbers (Part1: Fp vs Fixed) - Computerphile)

字幕列表影片播放

We were talking a few weeks ago about how we can add additional processes into a computers (sic) to do specialist tasks
One of the things we talked about was floating point processors
Now these days they're built directly onto your CPU
But, you can still get some CPUs –some of the variants of the ARM CPU– and certainly if you go back in history
Some of the other CPUs you can get didn't have a floating-point unit. They could do math but they could only do integer math
So they could add 1 and 2 but they couldn't add 1.5 and 2.5
Well you "could" but you had to write the software to handle the fractional part of the number an do the math
and stick it back together in the right order. So I was wondering
What's the difference in speed that a floating-point person would actually make?
So as I said most computers we have these days have floating-point units of some sort built-in
So I went back to one of my old ones and I decided to write a program to test it
So I wrote a very simple program which does a 3d spinning cube. So the program is relatively straightforward
It's got a series of eight points which restores representation into a series of matrix transformations on them
To get them into the screen coordinates and then we draw the lines
so I did this using floating point maps and the programs running here and we can see it's
Reasonably quick it takes no point not for five a second to calculate where all the screen coordinates need to be for each frame
sometimes varies but that's in general what it takes so I then went off onto a
popular auction site
beginning with a letter E and
Ordered myself a floating point ship for the Falcon and I then inserted it
into the machine and I recompiled the program this time to use
The floating point unit. So this version it's doing floating point maps
It's using the fractions
But it's all being done in software this machine code instructions for the six 8030 chip in there
To calculate all those different floating point numbers
so then we compiled the program this time to actually use the floating point unit and
This version runs about 4.5 times faster it takes no point
Not one seconds rather no point naught four five seconds to do exactly the same calculations. This is exactly the same source code
I just recompile a using GCC to produce a version that used the floating-point
Unit and you can actually see that the graphics are slightly smoother and the time is much less
So the fact we can speed it up by doing it in hardware. Perhaps isn't that surprising?
There's lots of tasks where you could either implement it in software or implement it in harder
And if you implement it in hard way it's often
Faster to do so so we are going to try that but it's actually worth thinking about what's involved in adding up
Floating-point numbers Tom did a good video
Right back at the beginning computer file looking at how floating-point numbers are represented as a sort of high-level and it will say well
It - naught point nine nine nine nine nine nine nine then after a while
you'll stop but actually when you get down to the sort of
Level of the computer's having to deal with them and see how they're stored
It's quite interesting to see how they're stored and how then manipulating them
Are you writing software to do something simple like adding two numbers together?
Actually ends up being quite an involved task compared to adding together two binary numbers
So let's look at how we add numbers together
So let's let's take a number and to save time I've printed out the bits. So let's take the number
Let's say 42 because why everyone who uses that so we've got here the number 42 is one zero
One zero one zero and then we need to fill the rest of these with zeros. Well ignore that for now
So that's bit naught over on the right hand side through two bit
One two three, and these of course are the same with the powers of 2
So 2 2 zeros is ones two to the one is two four eight
Just like we have powers of 10 when we do decimal numbers. So let's everyone to add together 42 and
23 so I've got another binary number here 23 so the same bits and
We'll just basically do addition. So 0 plus 1 Shaun is
1 good
yeah, ok, 1 plus 1 is
0 and we have to carry the 1 0 plus 1 plus 1
Okay. Yeah 1 plus 0 plus 1
So yeah
We've run it up 42 this numbers 23 42 plus 23 is 65 and sorry produced
65 in binary is a result
So rubbing up in binary is a relatively straightforward thing
What we do is we take from the right each pair of bits add them together
We produce a sum bit and occasion
We also produce a carry bit and then we add the carry on in the next column just like we do
when we do decimal arithmetic
And you can generate systems that represent
Decimals or by symbols
I guess they'd be called or fractional numbers
Using this so you can use a system which is quite common
Was used in Adobe Acrobat as used on the iOS for 3d graphics at one point
which is fixed point numbers where you say that say about 32 bits the top 16 bits are going to represent the sort of
Integer part the bottom 16 bits are going to represent
the fractional part and the basic way to think about that is you multiplied every number by 6 on
65,536 shifts everything along and then when you want to produce the final result you divide it all by 6
65536 now the problem with fixed point numbers is that they have a fixed scale
Fixed is key in the name. So for example, if we use
32-bit fixed point numbers splitting into 16 bits and 16 bits. That's great. We can go up to
65,000 or so on the integer part, but if we need to get to 70,000 we can't story
Likewise we can go to 1
65536 the other thing we'd agree to go to 1
130
1072 sort of a thing we can't because we don't have that resolution on occasion
We need the bits down here to represent very small quantities and occasion
We want them to represent very large quantities for something like 3d graphics or graphics in general
Fixed point numbers can work. Well for general-purpose things. They don't work that well
So what people tend to do is they use floating-point numbers, which is the right things as tom said in
scientific notation so rather than writing
102 4 we write it as 1 point 0 to 4 times 10 to the 3 so we're using scientific notation
We can do the same in binary rather than writing
101
One, oh, we can write one point
zero one zero
One times two this time rather than 10 to the 1 2 3 4 so we can write it 2 to the 4
So what floating-point numbers do is that they say okay rather than representing
Numbers using a fixed number as bits for each we're going to represent them in scientific notation effectively. We're the sort of
Number that were then going to multiply by 2 to the something to shift it to the right point to make things absolutely clear
I'm going to use
Decimal numbers here to represent the 2 times 10 to the 4 so I will cheat and use this here
but of course it would be
102 the 1 0 0 so I guess the question that remains is how do we represent this in a computer?
We've got to change this notation, which we can write nicely on a piece of paper to represent the binary number
Multiplied by a power of 2, but how do we represent that in the computer?
what we need to do is take this and find an encouraging which
Represents it as a series of bits that the computer can then deal with
So we're going to look at 32 bit of floating point numbers mainly because the number of digits I have to fill in
become
Relatively smaller to deal with then rather than doing 64 bit
We could have done 16 bit sign things, but they use the same thing
It's just the way they break it down change your slightly how many bits are assigned to each section
So we've got our 32 bits and we need to represent
this number in there we start off by splitting this up into
A few different things. So the first bits or the most significant bit are in the number. The one on the left over here is
The sign bit and that says whether it's a positive number
We just let it be zero or negative number what which case it will be what?
So unlike two's complement
Which David's looked at in the past two's complement is equivalent to the ones complement with one?
Added to it. The sign is represented purely as being a zero means positive one means negative
We just have one bit represented with that
They then say we're going to have eight bits, which we're going to use to represent the exponent this bit here
I what power of 2 which gives us 255 or so
Different powers of two we can use we'll come back to how that's represented in a second and then the rest of it
Is used to represent the mantissa as its referred to so the remaining 23 bits are at the 32 are used to represent
The remaining bit of the number, okay
So they've got 23 bits to represent the number which is n gonna be multiplied by the 8 bit exponent
They said every single possible floating-point number you're gonna write down is going to have a 1 as its most significant digit
Except 1 0 say, ok. We'll treat 0 as a special case and to represent 0 they just set all bits to be zeros
So we know that this is going to be 1 what we know is 1 so we don't need to encode it
It's always going to be 1 so actually these
23 bits here are the bits that come after the 1 so it's one dollar
So on which are all the bits that come after here
So we we sort of don't encode that bit because we know it's there one way to think about floating-point numbers
Is there a sort of lossy compression mechanism for?
real
numbers floating-point real
Fractional numbers because we're taking a number which is some representation and we're compressing it into these bits
But we lose some information and we can see that in a second
we run a little demo and we'll see that actually it can't represent all numbers and
It's surprising sometimes which numbers it can't represent and which can each it can so we can then start writing
Numbers in this form and to simplify things. I've printed out a form like this
So if you want to write out the number one, it's one point naught naught naught naught
Times 2 to the power of 0 so it's one point
na-na-na-na-na-na-na-na naught
times 2 to the power of
0 which is 1 so it's 1 times 1 which is 1 and of course the sign bit because it's positive
would be 0 to say that so we could write that out as
The number so we can start assigning these things to the different bits
We put a 0 there cuz it's positive and the mantissa is all 0 so we just fill them up with
Zeros, and that leaves us with this 8 bit here
We've got to represent 2 to the power of 0 now they could have decided to just put 0 in there
But then the number 1 would have exactly the same a bit patent
There's the number zero and introduced that's potentially not a good idea
so what they actually say we're going to do is we're going to take the power which will go from mind 127 through to
127 and then they add 127 on it. So our exponent here. Our power of 2 is 0
so 0
plus
127 obviously is 127 so we encode
127 into these remaining bits so 0 1 1 1 1
1 1 1
So to encode the number 1 like that we encode it into the binary representation
0 for the sign bit 0 1
127 for the exponent and then because we know that one's already encoded the rest of it becomes 0 this is a lossy system
We can encode some numbers, but we're only encoding 24 significant bits where they are within the number of encoding changes
But we're only encoding 24 significant bits
So that's just right program
That takes a number 1 6 7 7 7 - 1 5 an integer number and adds one to it
And we'll do this in a loop
We'll add one to the result and add one to the world and print out the values
So we think that one six seven seven seven two one six one six seven seven seven
Two one seven and we'll do this with both for an integer variable. So a 32-bit integer and also with a
32-bit float, so got to money without program written here on the computer
So we set up a float why we set up the variable. I to be
16 million 770
7215 checking things binary there and we set Y to equal I so they both start off with the same value
We're then going to print them out. Where's the decimal and the floating point? Well, I'm also going to print out the hexadecimal
Representations of the bits so we can see what's going on
We're then going to add 1 to the value of y and add 1 to the value of ice
So we're going to increment them both
So let's run this program
To not million mistakes that's always a good sign and let's run it so we get
And 15 and we get
16777215 point normal or not? What would expect?
16,777,216 and the same there. So now we had one on it again and we get for the integer value
16777215 point number that's not a right. Okay, so that's not right what's going on there? Well, if we think about how we represent this
Let's think about the number one six seven seven seven two one six
That number is one times two to the 24 and I sort of tricked by
Generating it this way at the beginning. There's one with lots of zeros after it times two to the 24
we have only 23 bits to represent this bit in here if
We want to add on
an extra bit
We would need 24 bits here. We've only got 23 we can't do it
So we can't represent it. If we added up to each time, it would work fine. So actually as we get some larger numbers
We still have the same number of significant bits
or significant digits
But we can't store certain values. Well as we can with integers, so it's a lossy compression system
basically, we can store a large range of values for anything from
minus 2 to the power of 127 through 2 2 to the power of
127 or we can go very very low and have numbers as small as 2 to the minus 127
But we only have a certain they were pursuing
So if we deal with very very large numbers that we've still only got 23 bits
We have the precision and if we do them very small, but numbers we can got 23 bits worth of precision, which is fine
We can cope with that because often when you're dealing with big numbers
You're not worried about the small fiddly decimal places in your small four significant figures if you're measuring how far it is from the Earth
to Alpha Centauri in
millimeters plus or minus a few
Millimeters a few points of a millimeter isn't going to make much difference that sort of thing
So it's a compression where it's just a lossy system
you
All for this videos gonna mean writing zeroes 23 times, maybe I should have done 16-bit numbers