Ordinary integers can only represent integral values. "Floating-point numbers" can represent non-integral values. This is useful for engineering, science, statistics, graphics, and any time you need to represent numbers from the real world, which are rarely integral!

Floats store numbers in an odd way--they're really storing the number in scientific notation, like

x = + 3.785746 * 10

Note that:

- You only need one bit to represent the sign--plus or minus.
- The exponent's just an integer, so you can store it as an integer.
- The 3.785746 part, called the "mantissa", can be stored as the integer 3785746 (at least as long as you can figure out where the decimal point goes!)

In binary, you can represent a non-integer like "two and three-eighths" as "10.011". That is, there's:

- a 1 in the 2's place (2=2
^{1}) - a 0 in the 1's place (1=2
^{0}) - a 0 in the (beyond the "binary point") 1/2's place (1/2=2
^{-1}), - a 1 in the 1/4's place (1/4=2
^{-2}), and - a 1 in the 1/8's place (1/8=2
^{-3})

x = + 3.785746 * 10

It's common to "normalize" a number in scientific notation so that:

- There's exactly one digit to the left of the decimal point.
- And that digit ain't zero.

In binary, a "normalized" number *always* has a 1 at the left of the decimal point (if it ain't zero, it's gotta be one). So sometimes there's no reason to even store the 1; you just know it's there!

(Note that there are also "denormalized" numbers, like 0.0, that don't have a leading 1. This is how zero is represented--there's an implicit leading 1 only if the exponent field is nonzero, an implicit leading 0 if the exponent field is zero...)

1.2347654 * 10

But to three decimal places,

1.234 * 10

which is to say, adding a tiny value to a great big value might not change the great big value at all, because the tiny value gets lost when rounding off to 3 places. This "roundoff" has implications.

For example, adding one repeatedly will eventually stop doing anything:

float f=0.73;(executable NetRun link)

while (1) {

volatile float g=f+1;

if (g==f) {

printf("f+1 == f at f=%.3f, or 2^%.3f\n",

f,log(f)/log(2.0));

return 0;

}

else f=g;

}

Recall that for integers, adding one repeatedly will *never* give you the same value--eventually the integer will wrap around, but it won't just stop moving like floats!

For another example, floating-point arithmetic isn't "associative"--if you change the order of operations, you change the result (up to roundoff):

1.2355308 * 10

1.2355308 * 10

In other words, parenthesis don't matter if you're computing the exact result. But to three decimal places,

1.235 * 10

1.234 * 10

In the first line, the small values get added together, and together they're enough to move the big value. But separately, they splat like bugs against the windshield of the big value, and don't affect it at all!

double lil=1.0;(executable NetRun link)

double big=pow(2.0,64);

printf(" big+(lil+lil) -big = %.0f\n", big+(lil+lil) -big);

printf("(big+lil)+lil -big = %.0f\n",(big+lil)+lil -big);

float gnats=1.0;(executable NetRun link)

volatile float windshield=1<<24;

float orig=windshield;

for (int i=0;i<1000;i++)

windshield += gnats;

if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";

else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";

In fact, if you've got a bunch of small values to add to a big value, it's more roundoff-friendly to add all the small values together first, then add them all to the big value:

float gnats=1.0;(executable NetRun link)

volatile float windshield=1<<24;

float orig=windshield;

volatile float gnatcup=0.0;

for (int i=0;i<1000;i++)

gnatcup += gnats;

windshield+=gnatcup; /* add all gnats to the windshield at once */

if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";

else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";

Roundoff can be very annoying, but it doesn't matter if you don't care about exact answers, like in simulation (where "exact" means the same as the real world, which you'll never get anyway) or games.

One very frustrating fact is that roundoff depends on the precision you keep in your numbers. This, in turn, depends on the size of the numbers. For example, a "float" is just 4 bytes, but it's not very precise. A "double" is 8 bytes, but it's more precise. A "long double" is 12 bytes (or more!), but it's got tons of precision.

for (int i=1;i<1000000000;i*=10) {(executable NetRun link)

double mul01=i*0.1;

double div10=i/10.0;

double diff=mul01-div10;

std::cout<<"i="<<i<<" diff="<<diff<<"\n";

}

On the NetRun Pentium4 CPU, this gives:

i=1 diff=5.54976e-18That is, there's a factor of 10^-18 difference between double-precision 0.1 and the true 1/10!

i=10 diff=5.55112e-17

i=100 diff=5.55112e-16

i=1000 diff=5.55112e-15

i=10000 diff=5.55112e-14

i=100000 diff=5.55112e-13

i=1000000 diff=5.54934e-12

i=10000000 diff=5.5536e-11

i=100000000 diff=5.54792e-10

Program complete. Return 0 (0x0)

for (double k=0.0;k<1.0;k+=1.0/6.0) {The trouble is of course that 1/6 can't be represented exactly in floating-point, so if we add our approximation for 1/6 six times, we haven't quite hit 1.0, so the loop executes one additional time. There are several possible fixes for this:

printf("k=%a (about %.15f)\n",k,k);

}

- Don't use floating-point as your loop variable. Loop over an integer i (without roundoff), and divide by six to get k.

- Or you could adjust the loop termination condition so it's "k<1.0-0.00001", where the "0.00001" provides some safety margin for roundoff.
- Or you could use a lower-precision comparison, like
"(float)k<1.0f". This also provides roundoff margin, because
the comparison is taking place at the lower "float" precision.