Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results



#21

48GX/hp48xgcc results:

RESULT: 876
TIME: 3.371399 SEC

#include <hp48/object.h>
#include <hp48/core.h>
#include <math.h>

int main()
{
int x, y, r, t, n, a[9];
hp_object *o;
double s;

for (n = 10; n > 0; --n) {
r = 8;
s = 0;
x = 0;
do {
a[++x] = r;
do {
++s;
y = x;
while (y > 1)
if (!(t = a[x] - a[--y]) || x - y == abs(t)) {
y = 0;
while (!--a[x])
--x;
}
} while (y != 1);
} while (x != r);
}

o = sys_malloc (5 + 2 * sizeof (double));
if (!o)
exit (1);

o->prolog = 0x2933;
o->_hide.real = s;

sys_exit (o);
}

50g/HPGCC3 beta 192 MHz results:

RESULT:  876
TIME: 0.000331 SEC

#include <hpgcc49.h>

int main()
{
int x, y, r, s, t, n, a[9];
cpu_setspeed(192 * 1000000);

for (n = 100000; n > 0; --n) {
r = 8;
s = 0;
x = 0;
do {
a[++x] = r;
do {
++s;
y = x;
while (y > 1)
if (!(t = a[x] - a[--y]) || x - y == abs(t)) {
y = 0;
while (!--a[x])
--x;
}
} while (y != 1);
} while (x != r);
}
sat3_push_dbl_real(s);
return (0);
}


Edited: 5 Feb 2008, 2:30 a.m.


#22

Looks like comparing apples with oranges, especially given the CPU clock difference;-)

Would be nice to know how much overhead the cross-compiled C stuff actually produced on the HP-48.


#23

I was not trying to state anything or draw any conclusions, I was just supplying data for Xerxes' list.

That said, if you want apples to apples, the GX result above is faster than any other GX on Xerxes' list.

How would you like me to measure overhead? What type of overhead?


#24

>How would you like me to measure overhead? What type of overhead?

>

The type of overhead can be derived from the type of object produced by the cross-compiler,

and the dimension of the overhead can be (roughly) seen after decompiling the code.

Does the cross-compiler produce pure machine code, SysRPL, UserRPL code, or a mixture?

How big is the object? Does it call XGCC library functions? And so on...

Could you send me the binary of the HP-48 object ?


#25

Quote:
The type of overhead can be derived from the type of object produced by the cross-compiler, and the dimension of the overhead can be (roughly) seen after decompiling the code.

I'll leave the decompiling of the code to you :-)

Quote:
Does the cross-compiler produce pure machine code, SysRPL, UserRPL code, or a mixture?

AFAIK, machine code only. The binaries require the use of shared libraries.

Quote:
How big is the object? Does it call XGCC library functions? And so on...

The following objects are required to run this benchmark:
object         size (bytes)
---------- ----
nqueens 1367
GCCLDD 292
libcore.sl 103
libgcc.sl 741
GCCLDD, libcore.sl, libgcc.sl, and other .sls can be use by other C programs minimizing RAM usage.

Quote:
Could you send me the binary of the HP-48 object ?

http://sense.net/~egan/hp48xgcc/xgcc.hp

This object is a directory with everything you need. There are multiple nqueens binaries. NQ1, (one iteration), NQ10 (ten iterations), NQS (solution).

Use at your own risk :-), you can test with EMU48 first.

#26

Hello Egan,

Thank you for this results.

Please allow me to ask you for the result of the 50G at 75 MHz to have the speed up factor compared to 192 MHz for completeness. May be the result is the same as already tested with HPGCC2 with unstructured code, but I'm not sure about it.


#27

Completeness.

HPGCC3/50g

MHz   Iterations   Time(s)/iteration
--- ---------- -----------------
6 3125 0.01081921875
12 6250 0.00543117188
48 25000 0.00132467285
75 39062 0.00086321105
120 62500 0.00053928320
152 79166 0.00042776882
192 100000 0.00033105103

Edited: 6 Feb 2008, 1:09 p.m.


#28

;-)

Interesting is the difference of the effective speed up factors:

x1.3 for UserRPL @ 203 MHz

x2.6 for HPGCC @ 192 MHz


The hp48xgcc seems to be not very efficient for a native compiler considering the result of calculators of the same category.


#29

Quote:
Interesting is the difference of the effective speed up factors:
x1.3 for UserRPL @ 203 MHz
x2.6 for HPGCC @ 192 MHz

It may take more than increasing the clock rate to increase the speed of UserRPL under Saturn emulation. Memory speed may be a factor too. The HPGCC version is very small, perhaps it fits in cache. All speculation.

Quote:
The hp48xgcc seems to be not very efficient for a native compiler considering the result of calculators of the same category.

What other calculators are you comparing to?

From the comparison below I'd say hp48xgcc was very efficient:

4:02         HP-48GX       UserRPL / Ver.P
1:30 HP-50G UserRPL
1:07 HP-50G UserRPL / Fast Mode x1.3 (75->203 MHz)
35.2 HP-48GX SysRPL / Ver.R
3.37 HP-48GX C / Structured / HP48XGCC / Cross Compiler

#30

Quote:
What other calculators are you comparing to?

4.28    PC-G850V  (Z80 @ 8.0 MHz)      C / Unstructured / Bytecode
3.37 HP-48GX (Saturn @ ~4 MHz) C / Structured / HP48XGCC / Cross Compiler
2.92 Series 3a (V30 @ 7.68 MHz) OPL / Bytecode
1.27 PB-2000C (HD61700 @ 0.91 MHz) Pascal / DL-Pascal-ROM-Card 1.2 / Compiler
0.136 HP-200LX (80186 @ 7.9 MHz) Basic / DEFINT / QuickBasic 4.5 / Compiler
0.0886 HP-200LX (80186 @ 7.9 MHz) C / Unstructured / Turbo C 2.01 / Compiler


I can't assess the speed of the Saturn CPU for assembly programs. Probably the Saturn
CPU is not very effective for integer only problems. I have to occupy myself with the
instruction set more deeply to find out e.g. if it's possible to use the registers only
even for storage of the board indices.


#31

Quote:
I can't assess the speed of the Saturn CPU for assembly programs.

The only way to rate the efficiency of hp48xgcc is to write a Saturn assembly version of the benchmark. I think I know someone that may be able to do that.

#32

Hi,

I don't know if you meant me...

However, I just wrote a real native Saturn assembly version of that benchmark.

I simply had to know how much could still be gained;-)

Conclusion: The xgcc version is not bad, but _way_ off regarding speed!

I haven't disassembled the xgcc output you sent to me yet (thanks for that:),

but it seems that either the lib calls or the chosen data structures,

or a mixture of both produces the relatively huge overhead in run time.


My real native version runs in about 0.9699108 seconds on Emu48 in 'Authentic' speed,

and in about 0.803724365234 seconds on my real HP-48GX revR !

The sample size was 100 runs of the program in each case,

and having taken the average of the single run times.

So it seems my solution runs circles around the xgcc generic assembly code:-)

Should I post the listing here?

Raymond


#33

Yes!

#34

Ok, here's the beef!

The code is a translation of the generic BASIC listing given by 'Xerxes'

into pure Saturn machine language, and thus the algo is the same.

Xerxes' bench listings

There may be some places for slight improvements,

especially the ASLC and CSLC parts, but I think it's not bad so far...

...and it's another example for the speed and efficiency of the real HP-48;-)

The (updated) program listed below returns the solved board in stack level 3,

the execution time in seconds in stack level2,

and the count of evaluated nodes in stack level 1.

The board matrix is kept in CPU register C[9:1] . Nib C[0] is used as scratch.

One of the goals was to reduce the total count of CPU cycles,

and that's why the index pointers X and Y (in B[0] and D[0])

were accessed using P (D=D-1 P) instead of using (D=D-1 A),

where the latter has a shorter opcode but needs more CPU cycles.

Have fun!

Raymond


::  CK0
CLKTICKS

* A B C D Dn Rn P Cyc
CODE
GOSBVL =SAVPTR

L10 LAHEX 888888888
R0=A 0:R
C=0 W
D1=C 1:S

B=0 A X=0
* D=0 A Y=0

L40 P= 0 2
A=R0 R

?A#B P
GOYES L50
GOTO L180

L50 B=B+1 P R X' AAAAAAAAx * X=X+1

L60 P= 0 0 2 * A(X)=R
C=B P AAAAAAAAX
P=C 0 X
C=A P AAAAAAAAX * Pushed R to AAAAAAAA at pos X

L70 CD1EX S * S=S+1
C=C+1 A
CD1EX

L80 P= 0 0 2 * Y=X
C=B P X
D=C P Y=X

L90 D=D-1 P Y' * Y=Y-1

L100 ?D=0 P
GOYES L40

L110 P= 0 2 * T=A(X)-A(Y)
C=D P AAAAAAAAY
P=C 0 Y

A=C P 'A(Y)'

L110asl ASLC
P=P+1 3
GONC L110asl * On exit: P=0

* Here: P=0 A[0]='A(Y)' 2

C=A P * Backup of 'A(Y)'
A=C W * Full backup of AAAAAAAA incl A(Y)

C=B P AAAAAAAAX
P=C 0 X 6

L110csl CSLC
P=P+1 3
GONC L110csl * On exit: P=0

ACEX W A[0]='A(X)' C[0]='A(Y)'
?A>=C P
GOYES NoSwp

ACEX P

NoSwp A=A-C P ABS(T)

L120 ?A=0 P * IF T=0 THEN 140
GOYES L140

L130 C=B P X * IF X-Y<>ABS T THEN 90
C=C-D P X-Y

?A#C P
GOYES L90

L140 P= 0 2
C=B P X * A(X)=A(X)-1
P=C 0 X 6
C=C-1 P

L150 ?C#0 P * IF A(X)<>0 THEN 70
GOYES L70

L160 P= 0 * X=X-1
B=B-1 P

L170 ?B#0 P * IF X<>0 THEN 140
GOYES L140

L180

* AD1EX * PRINT S
* P= 0 2
* GOVLNG =PUSH#ALOOP


CSR W * Shift right one nib
AD1EX
P= 7
ACEX WP

RSTK=C * Save count

GOSBVL =PUSHhxs
GOSBVL =SAVPTR

C=RSTK
A=C A
GOVLNG =PUSH#ALOOP

ENDCODE

CLKTICKS ( *Ticks1 #Board #LCnt Ticks2* )
* ROT
4ROLL

bit- #>% # 2000 UNCOERCE %/
SWAP UNCOERCE

;


Edited: 7 Feb 2008, 6:38 p.m.


#35

Thank you Raymond for this interesting implementation thats an enrichment for the list.

With your permission I have inserted your listing without the comments like the other assembly examples. The BASIC listing was used as pattern for all assembly versions.

Now the execution speed of HP48XGCC is not surprising any more. It seems that my suspicion comes true that integer only problems are not the strong point of the Saturn CPU. On the other hand the speed of the 71B BASIC interpreter shows the advantage of it's instruction set.

The informations about the clock speed of the 48GX are not really clear. I have found 4.0 MHz, ~4 MHz and 3.7-4.0 MHz. But what is correct?


#36

I'm glad I could contribute :-)

About the Saturn CPU: I'm not sure whether the Saturn is weak regarding integer handling,

but the main target of the developers seemed to be good at BCD handling.

<OT>
I think the combination of Saturn CPU and HP-71B OS was _very_ efficient. Hats off for the developers!

</OT>


About the HP-48G series clock speed: AFAIK the latter of your options (3.7-4.0 MHz) comes nearest to reality .

#37

BTW did I mention that I just slightly improved the code,

it now needs only about 0.3363 (ZERO POINT THREE) seconds!

I replaced the biggest time wasters by some more efficient code.

Here's the story:

As mentioned in my earlier post, the ASLC and CSLC loops were the ones which could use some refinement.


** The two loops L110asl and L110csl from the original listing take the biggest amount of time.
** In the best case, when P is 9, the loop is run 16-9 times, -> 5 times,
** which sums up to 5*21 + 5*3 + 4*10 (NC case) + 1*3 (C case) = 105 + 15 + 40 + 3 = 163 cycles
** In the worst case, when P =1, the loop is run 16-1 times, 15 times,
** which sums up to 15*21 + 15*3 + 14*10 (NC case) + 1*3 (C case) = 315 + 45 + 140 + 3 = 503 cycles !!!
**
** The newer method (using ASR W) andshown below,
** runs 1 time in the best case, and 8 times in the worst case
** summing up to 2 + 3+d + 6 + 1*3+d + 1*2 + 3 = 2 + 4 + 6 + 19 2 + 3 = 36
** or worst 2 + 3+d + 6 + 8*3+d + 8*2 + 7*10 + 3 = 250 cycles !

P=C 0 Y 6
A=C P 'A(Y)' 3+d
P= 0 2
C=-C P 3+d
P=C 0 6
*
L110asr ASR W 3+d
P=P+1 2
GONC L110asr 10/3
**********
C=A P 3+d * Backup of 'A(Y)'
A=C W 3+d * Full backup of AAAAAAAA incl A(Y)
C=B P AAAAAAAAX 3+d

C=-C P 3+d
P=C 0 6

L110csr CSR W 3+d
P=P+1 3
GONC L110csr 10/3

ACEX W A[0]='A(X)' C[0]='A(Y)' 3+d

So the second variant was remarkably faster (runtime was about 0.5 seconds!) than the easy and elegant, but slower initial version.

But the worst case (250 cycles) still looked not too good, so I tried some other methods, including self-modifying code in temporary memory.

Not that self-modifying shit which actually modifies itself,
but a default code slice in a RAM buffer, outside of the
main program,

which got a parameter modified on demand, and then called from the main program.
Not bad, too, but the management overhead was slightly too large.

With this method the run time was between 0.37 seconds and 0.38 seconds. Nothing more to gain.

So I finally used the discrete dispatcher version, which I actually had before the 'self-modifying' one.

This version is the fastest one up to now. It has a run time of 0.33630 seconds :-)

The current listing is shown below.

Have fun:-)

Raymond


::  CK0
CLKTICKS

* A B C D Dn Rn P Cyc
CODE
GOSBVL =SAVPTR

L10 LAHEX 888888888
R0=A 0:R 19
C=0 W 3+d
D1=C 1:S 8
B=0 A X=0 7

L40 P= 0 2
A=R0 R 19
?A=B P 13+d/6+d
GOYES L180

L50 B=B+1 P R X' AAAAAAAAx 3+d * X=X+1

L60 P= 0 0 2 * A(X)=R
C=B P AAAAAAAAX 3+d
P=C 0 X 6
C=A P AAAAAAAAX 3+d * Pushed R to AAAAAAAA at pos X

L70 CD1EX S 8 * S=S+1
C=C+1 A 7
CD1EX 8

L80 P= 0 0 2 * Y=X
C=B P X 3+d
D=C P Y=X 3+d

L90 D=D-1 P Y' 3+d * Y=Y-1

L100 ?D=0 P 13+d/6+d
GOYES L40

L110 P= 0 2 * T=A(X)-A(Y)
C=D P AAAAAAAAY 3+d
GOSUB Ptst
P= 0
A=C P
C=B P
GOSUB Ptst
P= 0
?A>=C P 13+d/6+d
GOYES NoSwp

ACEX P 3+d

NoSwp A=A-C P ABS(T) 3+d

L120 ?A=0 P 13+d/6+d * IF T=0 THEN 140
GOYES L140

L130 C=B P X 3+d * IF X-Y<>ABS T THEN 90
C=C-D P X-Y 3+d
?A#C P 13+d/6+d
GOYES L90

L140 P= 0 2
C=B P X 3+d * A(X)=A(X)-1
P=C 0 X 6
C=C-1 P 3+d

L150 ?C#0 P 13+d/6+d * IF A(X)<>0 THEN 70
GOYES L70

L160 P= 0 2 * X=X-1
B=B-1 P 3+d

L170 ?B#0 P 13+d/6+d * IF X<>0 THEN 140
GOYES L140

L180 CSR W * Shift right one nib 3+d
AD1EX 8
P= 7 2
ACEX WP 3+d

RSTK=C * Save count 8

GOSBVL =PUSHhxs
GOSBVL =SAVPTR

C=RSTK 8
A=C A 7
GOVLNG =PUSH#ALOOP

*********
Ptst P=C 0 6
?P# 1 13/6
GOYES tP2

CPEX 1
C=P 0
CPEX 1
RTNCC

tP2 ?P# 2 13/6
GOYES tP3

CPEX 2
C=P 0
CPEX 2
RTNCC

tP3 ?P# 3 13/6
GOYES tP4

CPEX 3
C=P 0
CPEX 3
RTNCC

tP4 ?P# 4 13/6
GOYES tP5

CPEX 4
C=P 0
CPEX 4
RTNCC

tP5 ?P# 5 13/6
GOYES tP6

CPEX 5
C=P 0
CPEX 5
RTNCC

tP6 ?P# 6 13/6
GOYES tP7

CPEX 6
C=P 0
CPEX 6
RTNCC

tP7 ?P# 7 13/6
GOYES tP8

CPEX 7
C=P 0
CPEX 7
RTNCC

tP8 CPEX 8
C=P 0
CPEX 8
RTNCC
ENDCODE

CLKTICKS ( *Ticks1 #Board #LCnt Ticks2* )
4ROLL
bit- #>% # 2000 UNCOERCE %/
SWAP UNCOERCE
;




Edited: 8 Feb 2008, 7:06 p.m.


#38

I can't believe it. Your latest creation is about 10(!) times faster than hp48xgcc that surly doesn't use a register for holding the array.
Your program shows clearly the advantage of efficient hand-coded assembly. Thanks for the digression to Saturn. ;-)

#39

Quote:
I don't know if you meant me...

However, I just wrote a real native Saturn assembly version of that benchmark.


Thanks. I was hoping you'd do it.
Quote:
Conclusion: The xgcc version is not bad, but _way_ off regarding speed!

Ah, well, king for a day...
Quote:
My real native version runs in about 0.9699108 seconds on Emu48 in 'Authentic' speed, and in about 0.803724365234 seconds on my real HP-48GX revR !

Your stellar results are no surprise to me. I was expecting native assembly to be at least 2x faster. More so since hp48xgcc is half-baked.
Quote:
Should I post the listing here?

Yes. Can you have your version return the solution as well?

Edited: 7 Feb 2008, 5:50 p.m.


#40

Hi again,

my current verion runs in about 0.3363 seconds!

Now the run time factor is about 10 (TEN) between the hp48gxcc version and my native solution;-)

In other words, the gxcc version needs ten times more time than my version.

Now we could say further optimizations will be somewhat difficult...

Have nice weekend:-)


Possibly Related Threads...
Thread Author Replies Views Last Post
  HP-48GX & 50G Question Matt Agajanian 2 215 12-08-2013, 10:17 PM
Last Post: Matt Agajanian
  hp prime - sending program results to the stack giancarlo 6 320 10-15-2013, 02:00 AM
Last Post: Giancarlo
  HP Prime complex results Javier Goizueta 0 117 10-06-2013, 12:59 PM
Last Post: Javier Goizueta
  HP Prime Solving Nonlinear System of Equations for Complex Results Helge Gabert 11 510 09-30-2013, 03:44 AM
Last Post: From Hong Kong
  Yet another benchmark port on the wiki: Savage Pier Aiello 35 1,239 09-26-2013, 03:22 AM
Last Post: Pier Aiello
  A brand new calculator benchmark: "middle square method seed test" Pier Aiello 25 843 09-13-2013, 01:58 PM
Last Post: Pier Aiello
  New community-maintained version of "Calculators benchmark: add loop" Pier Aiello 20 710 09-12-2013, 02:42 AM
Last Post: Pier Aiello
  Calculator Speed Benchmark (Add Loop) Thomas Chrapkiewicz 2 209 01-20-2013, 11:24 AM
Last Post: Thomas Chrapkiewicz
  Archive an HP 50G and Restore on an HP 48GX Rudy P. 5 295 11-16-2012, 09:24 PM
Last Post: Michael Lopez
  hp50g: installing hpgcc3 Gjermund Skailand 0 86 06-02-2012, 07:19 AM
Last Post: gjermund skailand

Forum Jump: