IEEE P754 Hi-Tech Arithmetic C 3.09

Representation

The format for the representation of floating point numbers in computers, in general, is that dictated by the IEEE 754 standard. discrete subset of the real field.

The IEEE 754 standard defines a form of representation of the real number: the real number is first divided into the integer part and the fractional part and transformed into a binary base, then once normalized ¹ is put in the form +/- 1, mantissa x 2 ^{E .}

For example, 1000.0 can be written + 1 × 10 ³ in base ten and + 0.1953125 x 2 ¹⁰ in base two.

In our computer the numbers in Floating Point are written in this form:

A bit for the Sign of Mantissa, 7 bits for the Exponent and 24 bits for Mantissa. This is because the designers of the Hi-tech C compiler slightly modified the IEEE P754 standard (P means single precision) to better handle FP numbers with the Z80 processor machine registers. Using two double 8-bit registers they could hold the entire FP number in registers HL and DE.

Let’s now see an example of adding numbers in FP. The number 3489.0 in decimal base is encoded in hexadecimal as 4C DA 10 00 . The compiler places these values in the machine registers: in HL we will have 4C DA and in DE we will have 10 00 . So in the machine registers HL and DE we have:

H	L	D	E
4C	DA	10	00
0100 1100	1101 1010	0001 0000	0000 0000

In H bit 7 is the sign (1 negative, 0 positive) and the remaining 7 bits are for the exponent that distinguish 128 values. 2 ⁰ is fixed at the decimal value 65 ( 100 0001 binary, 41 hexadecimal): values greater than 65 are positive powers; values below 65 are negative powers. In this case we have that the sign is positive (bit7 = 0), the exponent 100 1100 is 76 decimal. We said that 65 decimal equals the power 2 ⁰ , with the difference 76 – 65 we have the exponent of two, so 2 ¹¹ .

Let us now see the mantissa which is contained in the 3 bytes of the L and DE registers.

1101 1010 0001 0000 0000 0000

The first bit on the left is the weight of 2 ⁰ , the second of 2 ^-1 , the third of 2 ^–² and so on, so multiplying the powers of two by the bits corresponding to the position in the mantissa we obtain (omitting the products in the summation null):

1 + 1/2 + 1/8 + 1/16 + 1/64 + 1/2048 = 1.70361328125

So we have 1.70361328125 * 2 ¹¹ or 3489.0 in decimal base.

Sum and difference

Now let’s see how two numbers can be added in FP form.

Let’s consider the two numbers:

4C DA 10 00		0100	1100	1101	1010	0001	0000	0000	0000
CD DB 01 01		1100	1101	1101	1011	0000	0001	0000	0001

The first number starts with 0 so the sign of the mantissa is positive, the second starts with 1 so the sign of the mantissa is negative.

The second FP number, the exponent (remember that the exponent are the bits in bold) is 77 decimal. We know that the offset is 65 decimal so the exponent of 2 is 12 decimal. According to the calculation previously exposed, the mantissa gives us 1.71096813678.

So 1.71096813678 * 2 ¹² results -7008.1255 (approximately) taking into account the sign.

Let’s now see an algorithm that allows us to add two numbers in FP format.

compare the values of the exponents
- if the exponents have different values then denormalize the FP with the smallest exponent by making it equal to the largest exponent with consequent shift to the right the bits of mantissa ² .
Do the 2’s complement of the mantissa of negative FP
add
normalize

Let’s try to clarify with an example, let’s consider the two previous FP numbers: 1.70361328125 x 2 ¹¹ and 1.71096813678 x 2 ¹² .

The representation in hexadecimal and in binary of the two numbers is:

4C DA 10 00		0100 1100 1101 1010 0001 0000 0000 0000
CD DB 01 01		1100 1101 1101 1011 0000 0001 0000 0001

We see that the difference between the exponents is 1 (in bold), so the FP number with the lowest exponent 2 ¹¹ must become 2 ¹² .

After the operation of unitary increment of the exponent with translation to the right by one place of the mantissa, we have that the first number FP becomes:

4D 6D 08 00

0100 1101 0110 1101 0000 1000 000 0000

which is now 0.851806640625 x 2 ¹² or again 3489.0 in decimal base (note that 0.851806640625 is half of 1.70361328125). Now the exponents are the same (both 4D), apart from the 1 bit of the sign of the second FP number.

To add the mantissas it is necessary to complement ³ to 2 the mantissa of the second FP number which is negative. The 2’s complement operation is necessary because the addition machine instructions do not consider the sign of numbers; only one digital electronic circuit, the adder, is used for both addition and subtraction. Let’s now take the second FP number:

CD DB 01 01

1100 1101 1101 1011 0000 0001 0000 0001

By inverting the bits of the mantissa and adding 1 we have:

CD 24 FE FF

0100 1101 0010 0100 1111 1110 1111 1111

Now we can add the two FP numbers:

4D 6D 08 00	0100 1101 0110 1101 0000 1000 0000 0000	+
CD 24 FE FF	0100 1101 0010 0100 1111 1110 1111 1111	=
CD 92 06 FF	1100 1110 1001 0010 0000 0110 1111 1111

The first bit of mantissa has the value 1, which indicates that in the sum operation we have obtained a negative mantissa in 2’s complement. We must therefore do the 2’s complement only of mantissa, obtaining (in bold):

CD 6D F9 01

1100 1110 0110 1101 1111 1001 0000 0001

-0.859161496162415 x 2 ¹² = -3519.1255

Finally we normalize the result:

CC DB F2 02

1100 1100 1101 1011 1111 0010 0000 0010

That is -1.71832299232483 x 2 ¹¹ equal to -3519.1255 that is the result of the sum between 3489.0 and -7008.1255.

2’s Complement

The representation of the sign in binary arithmetic is given by the complement method. Now let’s see the representation in 2’s complement with 8 bits.

1	1	0	0	1	0	0	0	Binary
128	64	32	16	8	4	2	1	Normal
-128	64	32	16	8	4	2	1	2’s Complement

The binary representation with available K bits uses 2 ^k-1 -1 distinct values. If we wanted to write -56 in 2’s complement with 8 bits available:

256 – 56 = 200 or 128 + 64 + 0 + 0 + 8 + 0 + 0 + 0 → 1100 1000

Interpreting 1100 1000 as in the third row of the previous table the 2’s complement:

-128 + 64 + 0 + 0 + 8 + 0 + 0 + 0 = -56

Summarizing 1100 1000 is worth 200 decimal with the normal positional interpretation of powers of 2, while it is worth -56 decimal if we use the positional interpretation of the complement of 2.

Now let’s see how to calculate the binary representation of an assigned negative number:

-74

-128 + X = -74

X = 128 – 74 = 54

54 = 0 + 0 + 32 + 16 + 0 + 4 + 2 + 0 = 0011 0110

To have the negative value we set bit 7 to 1, therefore

1011 0110 (which in decimal base is 182).

As a test, if we add 74 to 182 we have a result 256, so 74 and 182 are complementary in 8 bits.

We now show an example of subtraction. We make the subtracting negative with the 2’s complement method. The sum between binary numbers is done by adding the corresponding bits of the addends considering the carry (1 + 1 = 0 with the carry of 1). Let’s take the following values to do the subtraction:

14 – 12 = 2

The value 14 expressed with 6 bits is 00 1110 while the value 12 is 00 1100.

We complement the second value to 2:

00 1100 → 11 0011

Let’s add 14 and -12

00 1110 +

11 0011 =

100 0001

Now we add 1 to the result and omit bit 7:

00 0010 → 2

Note that if we had used an 8-bit representation we would have added only two zeros to the left and the carry would have been in the ninth bit on the left, for the purposes of the calculation we would have obtained the same result.

Overflow

By adding two values that result in a value outside the range that can be represented in two’s complement, the result overwrites the sign bit, then in this case there is an overflow: the number of bits is not sufficient for the representation of the number.

Basically to see if there is an overflow in the calculation we have to consider:

operations between numbers of different sign are always correct as it is not possible to exit from the numerical extension
operations between numbers of the same sign are correct if the result keeps the sign of the addends.

Now let’s see how it is recognized if the subtraction has generated an overflow:

(-9) + (–28) = -37

1	1	1	0	0	1	0	0	0	carry over
	1	1	1	1	0	1	1	1	-9
	1	1	1	0	0	1	0	0	-28
	1	1	0	1	1	0	1	1	-37

Correct result, from two negative numbers we have a negative number (similarly if we had two positive numbers and one positive result).

(-73) + (-73) = -146

1	0	1	1	0	1	1	1	0	carry over
	1	0	1	1	0	1	1	1	-73
	1	0	1	1	0	1	1	1	-73
	0	1	1	0	1	1	1	0	110

In this case, however, the sum of two negative numbers gave an incorrect positive result.

Below there is a listing of a C program that shows what we have explained. This program accepts a floating point number as input (within the limits of the compiler representation) and displays the binary representation, the Hi-tech modified IEEE754P hexadecimal form and more.

#include &lt;stdio.h&gt;;
#include &lt;libv.h&gt;;

void printBinary(unsigned int n, int i)
{
      int k;
      for (k = i - 1; k &gt;= 0; k--)
      {
            if ((n &gt;&gt; k) &gt; 1)
               printf("1");
            else
               printf("0");
      }
}

typedef union
{
      float f;
      struct
      {
            unsigned int mantissa0:8 ;
            unsigned int mantissa1:8 ;
            unsigned int mantissa2:8 ;
            unsigned int exponent: 7;
            unsigned int sign: 1;
      } raw;
} myfloat;

void printIEEE(myfloat var)
{
      printf("%d | ", var.raw.sign);
      printBinary(var.raw.exponent, 7);
      printf(" |");
      printBinary(var.raw.mantissa2, 8);
      printf("|");
      printBinary(var.raw.mantissa1, 8);
      printf("|");
      printBinary(var.raw.mantissa0, 8);
      printf("|\n");
}

int main()
{
      int vett[24];
      int j, k, num, espo;
      float numero;
      float vero;
      myfloat var;
      printf("Rappresentazione IEEE P754 modificata Hi-tech C\n");
      printf("Versione CP/M Gennaio 2021 - Vito BLASI\n");
      printf("\n");

      printf("Numero reale :");
      scanf("%f", &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;var.f);
      printf("\n");

      printf("S |7 bit esp|  24 bit mantissa         |\n");
      printIEEE(var);

      printf("\n");

      printf("Forma esadecimale : ");

      j = var.raw.sign*128+var.raw.exponent;
      printf("%02x ",j);
      printf("%02x ",var.raw.mantissa2);
      printf("%02x ",var.raw.mantissa1);
      printf("%02x ",var.raw.mantissa0);
      printf("\n");

      num = var.raw.mantissa2;
      for (k = 7; k &amp;amp;amp;amp;amp;amp;gt;= 0; k--)
      {
            if ((num&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;gt; k) &amp;amp;amp;amp;amp;amp;amp; 1)
               vett[k+16]=1;
            else
               vett[k+16]=0;
      }

      num = var.raw.mantissa1;
      for (k = 7; k &amp;amp;amp;amp;amp;amp;gt;= 0; k--)
      {
            if ((num &amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;gt; k) &amp;amp;amp;amp;amp;amp;amp; 1)
               vett[k+8]=1;
            else
               vett[k+8]=0;
      }

      num = var.raw.mantissa0;
      for (k = 7; k &amp;amp;amp;amp;amp;amp;gt;= 0; k--)
      {
            if ((num &amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;gt; k) &amp;amp;amp;amp;amp;amp;amp; 1)
               vett[k]=1;
            else
               vett[k]=0;
      }

      numero = 0;
      j=0;
      for (k = 23; k &amp;amp;amp;amp;amp;amp;gt;= 0; k--)
      {
            numero = numero + vett[k]*1.0/power(2,j);

            j++;
      }

      printf("Forma esponente 2 : ");

      if (var.raw.sign==1)
      printf("-");

      printf("%1.12f * ",numero);

      printf("2^%d \n",var.raw.exponent-65);

      espo = var.raw.exponent-65;
      vero = numero*power(2.0,(float)(espo));
      printf("Forma esponente 10: ");
      printf("%8.7e ",vero);

      return 0;
}

1 Normalization: it is advisable to represent the number, with a value at most equal to one before the comma, multiplied by 2 raised to an appropriate value.

2 Since the shift to the right must be done on three registers (L and DE) we start with an SRL of the L register that shifts the bits one place to the right, puts the LSB in the Carry and a zero in the MSB. Then it shifts the bits one place to the right of the D register by copying the Carry to the MSB and then putting the LSB into Carry. Finally, the same operation in the E register. The result is a right shift on 24 bits with the carry entering to the left in the DE register and loss of the least significant bit.

3 2’s complement: invert the single bits of the binary value and finally add 1 to the LSB. For example: 1001 0111 becomes 0110 1000 to which we add 1 obtaining 0110 1001. That is, 158 decimal becomes 104 to which added 1 results in 105.