Topic: Project VTQ-13c - Part 3 - The MATH Libraries
3. The MATH - Libraries
The 8bit micro-controller used in my calculators do not support any floating-point-operations at all. The floating-point-math provided by the controller’s manufacturer library is restricted to 32bit IEEE numbers with about 6 to 7 decimal figures of accuracy only. Obviously I had to implement my own math routines to meet the desired range of 10^1000 and accuracy of at least 12 decimal figures.
In the article "What number representation to choose for a homebrew calculator?" advantages and drawbacks of the different formats habe been discussed thoroughly including arguments for a binary-type representation. It showed up that it is hard to decide which binary-compatible base to choose. Actually it depends on the operation what base to be optimal:
Base 256: Addition, Subtraction, Multiplication, Square-Root
Base 16: Logarithm, Exponentials, Trigonometry (all function based on Meggitt’s algorithms).
Base 2: Division
Since I started with basic arithmetics I fell for the 256’s system as the calculator’s main format. If I would start again today, I probably would take the 16’s system instead of. The reasons are the following: (1) I could get rid off format conversions needed to compute transcendentals. (2) The 16’s system is better balanced in terms of average precision (In the 256’s system, if you got an "1" on the left side of the radix point, you actually have only 41 bits of precision instead of 48 due to the 7 leading zero-bits. Using a 16’s system, degradation goes to 45 bits only in the worst case).
The VTQ-13’s MATH-libraries are divided into 2 portions:
MATH-0: Arithmetics & auxiliaries
MATH-1: Transcendentals & base-conversion & more auxiliaries
It would go far beyond the scope of this article to discuss all math-routines in depth, so I restrict myself to some special points of interrests. But, feel free to ask me, if you want to know more about subjects not deeply covered by this article!
3.1 Auxiliaries : Base-Operations
It may read strangely, starting exactly with auxiliaries, but I am doing this by purpose: The basic operations I’m showing, give a hint to the way a simple 8bit controller has to deal with 64bit registers.
Let’s first compare an original classic HP-CPU to a modern byte-oriented controller. Have you ever wondered why a HP-35 needs only 768 instructions for all it’s functionality? The answer can be found in it’s processor’s architecture. A HP-CPU is specially designed to perform all type of elementary operations on a whole decimal register or on a part of it (shifts, adds, subtracts etc.) in one single microcode-operation!
If you like to get experience with HP processors, play with the emulators as provided by Jacques. Or even better: Try to get Eric Smith’s Java-implementation of the HP-emulator. By studying this piece of code, you will be able to back-engineer the whole instruction-set of the HP-classics in a few hours.
Now let’s go back to the VTQ’s auxiliaries: What do they do? Well, nothing but emulate a subset of the HP instruction set - for the binary world obviously. Unfortunately, a call to such functions is much more expensive than a single microcode instruction on a HP processor. You have to assess 12 to 30 bytes for each call, depending on the parameter-list’s size. This is why 768 bytes will never be enough for a non-HP-standard controller...
In my set of auxiliaries you can find following operations:
shifting a register to the right or to the left by bits, nibbles or bytes.
comparing mantissae of two floating-point registers
getting & setting exponents and signs
swapping, copying clearing registers
basic arithmetics on mantissae: increment, addition, subtraction
operations on BCD-registers (getting, setting & incrementing nibbles)
The programming language C is considererd to be a quite low-level, but it prooved not low-level enough to operate basic auxiliaries in an optimal way in terms of speed and memory footprint. This is why I decided to implement most of the auxiliaries in raw assembler. The result shocked me after my first tests: Footprint could be reduced to a factor of 2 to 3 compared to the early C-versions, and the speed increased up to 14 times!
Let’s have a detailed look at a case study (adding mantissae):
The original C-code reads like this. We can see that we have to process the carries from one elementary addition to the next-one manually, since the elementary 8bit additions provided by C do not signal any overflow at all:
void addMantissa(uint08* a, uint08* b){
uint08 i, carry = 0;
for (i=0; i<=7; ++i){
*b += *a + carry;
carry = (!carry && *b < *a) || (carry && *b <= *a) ? 1 : 0;
++a;
++b;
}
}
In assembler we obviously get the current carry as a carry-flag that can be fed into the next operation directly. For my implementation in assembler I even renounced to construct a real loop, because it is not worth while for only 8 iterations. My solution uses a fully inlined sequence:
void addMantissa(uint08* a, uint08* b){
_asm
MOVFF 0xFD9, 0xFF3 // save old address pointer
MOVFF 0xFDA, 0xFF4
MOVLW 0xFD // copy pointer to register a
MOVFF 0xFDB, 0xFE9
MOVLW 0xFE
MOVFF 0xFDB, 0xFEA
MOVLW 0xFB // copy pointer to register b
MOVFF 0xFDB, 0xFE6
MOVLW 0xFC
MOVFF 0xFDB, 0xFDA
MOVF 0xE5, 1, 0
MOVFF 0xFE7, 0xFD9
MOVLW 0x00 // clear carry
MOVWF 0xD8, 0
MOVF 0xEE, 0, 0 // addition chain
ADDWFC 0xDE, 1, 0
MOVF 0xEE, 0, 0
ADDWFC 0xDE, 1, 0
MOVF 0xEE, 0, 0
ADDWFC 0xDE, 1, 0
MOVF 0xEE, 0, 0
ADDWFC 0xDE, 1, 0
MOVF 0xEE, 0, 0
ADDWFC 0xDE, 1, 0
MOVF 0xEE, 0, 0
ADDWFC 0xDE, 1, 0
MOVF 0xEE, 0, 0
ADDWFC 0xDE, 1, 0
MOVF 0xEE, 0, 0
ADDWFC 0xDE, 1, 0
MOVFF 0xFF3, 0xFD9 // restore address pointer
MOVFF 0xFF4, 0xFDA
_endasm
}
The assembler version takes 82 bytes instead of 200 and runs 6.4 micros instead of 61.
3.2 Multiplication
As we all know, multiplying two floating-point numbers takes 3 steps:
1. add the exponents
2. multiply the mantissae
3. normalize the result, if necessary
We will take a look at step 2 only: HP calculators do it by a shift & add scheme as it is very well described in Jacques’s articles about arithmetics. I could have done this the same way but I din’t. Why? My controller happens to provide a special feature: There is an unsigned 8x8 bit multiplier inside, yielding a 16bit result. It operates very fast using just one single instruction cycle! It would have been regrettable, not to exploit the advantage of a fast multiplier.
Unfortunately we again suffer from problems with the C-Compiler, which doesn’t let us use the benefits of a hardware multiplier. We even don’t know if he uses the multiplier at all. If you multiply two numbers of 8 bits, you will get a 8 bit result (the upper byte will be truncated). If you use 16bit-numbers for multiplying, you will get a 16bit result, but you will have to wait 4 times as long.
One more time we have to escape to the assembler’s world: I implemented a special 8x56 bit multiplier that processes one line in a shift & add scheme: A single byte is multiplied by a whole register’s extended mantissa (56 bits instead of 48). In this assembler-routine I used all advantage of my hardware multiplyer: While the original code without hardware suport ran about 1250 micros, the final version takes only 216 micros!
Basically the line-multiplyer performs
A x B6 B5 B4 B3 B2 B1 B0
Where B is a 56bit-number and A is one Byte. If we multiply subsequently A with all Bi and add the results ABi:H ABi:L immediately, we would have to consider unnecessary many carry-bits in each operation. That’s why a different scheme is used in my algorithm:
Multiply A x B6 00 B4 00 B2 00 B0 to P
Multiply A x 00 B5 00 B3 00 B1 00 to Q
Then add P and Q.
This way we have no carries in the two partial multiplications (results don’t exceed 16 bits). In the third step we have to process carry-bits only once. The line multiplyer completes in only 15 micros yielding in a quite fast floating-point multiplier (0.28 to 0.40 ms, depending on normalization process).
Implementation of the final mantissa multiplier:
void mulMantissa(uint08* a, uint08* b){
sint08 pa;
uint08 line[8];
uint08 akku[8];
clearRegister(akku);
shiftLeft(b, 0);
for (pa=5; pa>=0; --pa){
lineMultiply(b, line, a[pa]);
addMantissa(line, akku);
fastShiftRight(b);
}
fastShiftRight(akku);
copy(akku, b);
}
3.3 Division
Again, division is done the classic way as described by Jacques, using the subtract-and-shift method (pseudo division) on the mantissae, and by subtracting the exponents.
Now we have to think about number systems once more: If we divide in the 256’s system, registers have 6 digits each. In an average situation we have to perform 128 x 6 = 768 subtractions. If we would use a 16’s system, we would have to consider 12 digits and 8 subtractions per digit: 96 subtractions in an average situation. And for a 2’s system? 48 digits and one subtraction each.
As we know, it is quite easy to convert a number from the 256’s system to the 2’s system. So why not take the advantage of a better performance?
Description:
First we adjust a so it is at least as big as b. If a is to small compared to b, we had to perform 255 subtractions prior to the first shift in the worst case. So we shift left a to prevent this case: We don’t want more than one subtraction per shift.
Prior to perform the pseudo division, we shift both numbers one byte to the left to increase accuracy. In fact we perform a 56-bit division.
void divMantissa(uint08* a, uint08* b){
uint08 result[8];
uint08 offset = 0;
int i;
clearRegister(result);
if (compareMantissa(a,b) < 0){ // adjust a to prevent spin-out
shiftLeft(a, 0);
++offset;
}
shiftLeft(a, 0); // shift to get more space for accuracy
shiftLeft(b, 0);
for (i=0; i<56; ++i){ // subtract & shift (pseudo division)
bitShiftLeft(result);
if (compareMantissa(a,b) <= 0){
*result |= 0x01;
subMantissa2(b, a);
}
bitShiftRight(a);
}
bitShiftLeft(result); // re-adjust result
shiftRight(result, 1-offset);
copy(result, b);
}
3.4 Logarithms
Thanks to Jacques, the algorithms for logarithms and exponentials are described deeply and very well, so it was feasible to port an implementation to the VTQ-13 in a reasonable piece of time. The main difference between HP-calculators and my own device is the choice of the number’s format. In HPs the natural logarithm is computed. Since I use a binary-type format, logarithms to the base 2, 16 or 256 is more recommendable. For reasons explained in my article "What number representation to choose for a homebrew calculator?", base 16 yields an optimal balance between performance and memory footprint for constants. This is why my calculator basically computes log16 and 16^x. Other logarithms then can be easily found by multiplying a suitable constant. Exponentials of all kind (b^e) are computed with 16^(e · log16(b)).
For an implementation "only" two things are to be done:
Understand Jacques’s explanations about HP-35’s algorithms.
Migrate the concept to the 16’s system
Since the VTQ-13 stores numbers in 256’s system, they have to be converted into the base 16 prior to calculation:
1. Double the exponent (because 256^e = 16^2e).
2. If mantissa >= 16: increment exponent by one and shift mantissa to the right by one nibble.
The orignial algorithm is basing on the formula
-c[n+1] · 10^i = -c[n] · 10^i · (1+10^-i) - 10
My adapted version for the 16’s system uses a slightly modified formula:
-c[n+1] · 16^i = -c[n] · 16^i · (1+16^-i) - 16
Obiously, I had to use a different set of constants which are stored in 56bits precision without exponents:
const static uint08 lgA[14][8] = {
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, // A0
0x58, 0x4C, 0x24, 0xEB, 0xDB, 0x8F, 0x59, 0x00, // A1
0xE4, 0xB1, 0xEA, 0xB5, 0x11, 0x27, 0x5C, 0x00, // A2
0x24, 0xFF, 0x86, 0x0A, 0x3B, 0x52, 0x5C, 0x00, // A3
0x88, 0x08, 0x3E, 0x6A, 0xEF, 0x54, 0x5C, 0x00, // A4
0x3C, 0x3E, 0x05, 0xB2, 0x1A, 0x55, 0x5C, 0x00, // A5
0x40, 0x7D, 0x83, 0x66, 0x1D, 0x55, 0x5C, 0x00, // A6
0xF0, 0x62, 0xCB, 0x91, 0x1D, 0x55, 0x5C, 0x00, // A7
0x30, 0xE1, 0x7F, 0x94, 0x1D, 0x55, 0x5C, 0x00, // A8
0x30, 0x29, 0xAB, 0x94, 0x1D, 0x55, 0x5C, 0x00, // A9
0xB0, 0xDD, 0xAD, 0x94, 0x1D, 0x55, 0x5C, 0x00, // A10
0x30, 0x09, 0xAE, 0x94, 0x1D, 0x55, 0x5C, 0x00, // A11
0xB0, 0x0B, 0xAE, 0x94, 0x1D, 0x55, 0x5C, 0x00, // A12
0xB0, 0x0B, 0xAE, 0x94, 0x1D, 0x55, 0x5C, 0x00, // A13
};
The final algorithm consitsts of four steps:
1. Adjust to the 16’s system and filter out error conditions.
2. Perform Meggitt-analysis (pseudo multiplications)
3. Perform Meggitt-synthesis: synthesize logarithm’s mantissa
4. Finalization (insert exponent & normalize)
The last step is also the weakest point in this algoritm: Although you have a highly precise mantissa after step 3 (about 48 bits of precision), normalization destroys a part of the least significant bits, because the ancient exponent of the argument has to be placed as integer part of the final result. This is not a problem of my algorithm’s version specially, you will find this effect in all HP devices, but the 256’s system ist not optimal in this concern (as described at the beginning of section 3).
It could be recommendable to provide an extra version of the log16() function which returns the full precision mantissa only. It could be used to avoid loss of acurracy, eg. while computing exponentials.
3.5 Trigonometry
Tangent
As in the original HPs my MATH library provides two basic functions: tangent and arcus tangent. Based on the excellent articles of Jacques I migrated the original CORDIC algorithms to the 16’s system, stating the same reasons for choosing this system as described in section 3.4. All trigonometric calculations are done with arguments (angles) in radians. Conversions from/to other representations of the argument are processed prior or post trigonometric calculation.
To compute tan() my algorithm takes basically 4 steps:
1. Ensure the argument reduced to the range 0 .. 2pi
2. Meggitt analysis.
3. Meggitt synthesis (CORDIC rotations).
4. Compute y/x of the rotated vector, sign processing & normalizing.
There ist one weak point about this algorithm and its inverse:
In tan() as well as in arctan(), CORDIC rotations are not carried out on true floating-point, but on high precision fixed-point numbers. If tan() is used for arguments near muliples of pi/2, one of the two components of the rotated vector will eventually start approaching zero, yielding a loss of significant digits since there is no exponent to compensate: Arguments near 0 degrees lead to a poor y-component, while arguments near 90 degrees erode precision of the x-component.
This is also true for arctan() because it uses the same algorithm in the opposite direction. I checked out the quality of the results for increasing arguments in arctan() thus simulating a region near 90 degrees:
arctan(2) : 48 bits of precision (14.5 decimals), 63.43 degrees
arctan(25) : 42 bits of precision (12.5 decimals), 87.71 degrees
arctan(200) : 34 bits of precision (10.2 decimals), 89.71 degrees
arctan(7500) : 31 bits of precision (9.3 decimals), 89.992 degrees
Bigger arguments stay in the precision-range of 9 to 10 digits (not listed here).
Sine & Cosine
Sine and cosine are obtained by transforming the respective tangent with a set of formulas. Due to rounding errors and due to the fact, the tan()-function is not smooth at 90 and 270 degrees, sign-errors can be observed also in sin() and cos(). The set of formulas to obtain sin() and cos(), as they are pointed out by Jacques Laporte in his article about trigonometry, proofed to be sensitive to this type of errors.
In my implementation therefore I used a different set of formulas:
t := tan(x/2),
sin(x) = (1 + t^2),
cos(x) = (1 - t^2) / (1 + t^2)
Using this set I could reduce critical cases significantly. Furthermore it is no more necessary to compute square roots and reciprocals.
Arcsin & Arccos
Jacques Laporte shows in his article two formulas which give us arc-sine and arc-cosine if arc-tangent is known, but I could not get them work properly, so I side-stepped this problem by using an alternate set:
arcsin(x) = sgn(x) · arctan(sqrt(x^2 / (1 - x^2)))
arccos(x) = pi/2 - arcsin(x)
Now, it’s easy to get the arcus functions. There is just one case, x=1, that is to treat separately.
3.6 Conclusions
All of the MATH library is implemented using binary-type numbers. The routines are allowed to profit of an extended precision range of 48 bits. If I would have used decimal numbers, the theoretical maximum would have been 40 bits. Since the controller is a binary type (as opposed to HP processors) I furthermore was able to exploit maximum processing speed by incorporating the controller’s hardware multiplier (even logs and trigonometry profit by this feature).
Although, some unusual effects may be observed while using with a VTQ device:
(1)
If, for example 11 - 6 - 5 is computed, one would expect zero as result. In fact my calculator shows 3.7942 · 10^-12 in this situation. Why?
Well, the major disadvantage in using binary numbers is the fact, the have to be converted from or to decimal representation after decimal input or output as decimal number for display. Rounding errors thus are unavoidable.
(2)
Numbers like 1.3 can be represented exactly in a decimal format, but not as a binary number. In fact, our example 1.3 results in a periodic mantissa. It reads as 1.4CCCCC... (in hexadecimal representation). It can not be fully accurate.
This does not necessarily mean, the VTQ calculator has bad libraries. Let us have a look on an other computation:
10 / 3 * 3 - 10.
A classic HP (e.g. HP-27, HP-97) shows: -1.0 · 10^-9
A modern HP (e.g. HP42s) shows: -1.0 · 10^-11
The VTQ-13c shows: -1.07 · 10^-12
The result will never be zero, on neither calculator, since 10 / 3 can not be represented exactly with floating point numbers. The VTQ-13c yields by far the best result. See?
There is still work to be done as mentioned in section 3.4: Improve y^x by using full mantissa’s precision after computing a logartihm. There is also potential in the field of trigonometry: The drain of precision may probably be reduced by saving bits, shifted out during CORDIC rotations. They then could be re-introduced prior to divide the vector’s components. - Work for later days.