Benchmarks?



#8

In a recent post about the Solver differences amongst various models, it was mentioned that certain models were faster than others. That made me notice that there are no actual benchmarks available on this site.

I realize that benchmarking is a difficult science, but perhaps as a group we could come up with something reasonable and practical. Maybe Dave would consider adding the results to his features lists.

Some comments:

- The comparison would only really be useful for long running operations, either programs or solver / integrate like functions. I personally don't think the speed of keyboard operations is all that critical, although the speed with which the 48G responds to x! is breathtaking after using the 15C for a while. Actually, I take that back... it takes WAY longer to find AND execute x! on the 48G than it does on the 15C!

- Solver speed seems to be a very reasonable benchmark since it uses a bunch of the machine's capabilities, if it weren't for the fact that it does not appear on that many models.

- For programming, I'm not sure how we'd address the RPN vs RPL issue. In many ways this is apples to oranges.


#9

There are these.


#10

Oops.

Didn't see these... my bad.

However, do I believe that the HP-11C is at 6% of the 9100A? Yikes. Then again, the 41C is only at 13% so maybe...

There are a few missing machines in the list. I suppose the 15C is pretty much the same as the 11C -- no change in the processor between these two?

#11

While I found that the 48GX is very fast compared to older calculator. The problem with the 48 is that it can be very slow with the stack.


#12

The problem with the 48 is that it can be very slow with the
stack

Well, the RPL stack can be very deep. I don't think that I'd call it
slow if I restricted myself to using only 4 levels. On a 48GX, 4 ROLL or
4 ROLLD takes only about 21 ticks (0.0026 seconds). Can we blink that
fast?

I have noticed that ROLL or ROLLD over a lot of levels can be relatively
slow, about 186 ticks (0.023 second) on a 48GX if I have the numbers 1
through 1000 on the stack and execute 1000 ROLL or 1000 ROLLD. How long
does that take on a non-RPL calculator?


For reference, here's my general purpose timing program, which I used
for the above and for the total times in the programs below. Note that
the "correction value" varies by model. In these particular calculators
today, I used these values for correction, but the best value varies
among units and even over time (Temperature? Battery condition? Phase of
the moon?). To find the value, I put the number 1 on the otherwise empty
stack, temporarily disable the last stack, arguments, and commands
saves, run the timer program, and adjust the value until it *usually*
returns "Ticks: 0" and "0_s". Well, on a 48 series that is; on a 49G I
have to settle for *usually* +/- 10 ticks (0.0012 seconds). This program
isn't 100% repeatable, but I expect that it's about the best that can be
done in UserRPL, and plenty good enough for my purposes. If I care to
get picky about routines that are very nearly the same (or very fast),
then I run them in loops. Note that this program times the evaluation of
the object on the stack; if that's a name, the time will include
whatever's needed for name resolution.

48SX Checksum (using 104 below): # C2F4h
48GX Checksum (using 72 below): # 5F0h
49G Checksum (using 252. below): # CF3Eh
Bytes: 136
%%HP: T(3)A(R)F(.); @ Header for ASCII download.
\<<
MEM \-> t @ Force GC, use MEM result for dummy 't'.
\<<
TICKS 't' STO @ Get start time, store it in 't'.
EVAL @ Evaluate the object.
TICKS @ Get stop time.
RCWS @ Get current wordsize.
64. STWS @ Set wordsize to 64.
SWAP t - @ Elapsed time.
B\->R @ Convert binary to real.
@ Correction for time to do TICKS 't' STO.
@104 48SX value
72 @ 48GX value
@252. 49G value
-
"Ticks" \->TAG @ Tag result.
DUP 8192. / '1_s' * @ Also show in seconds.
ROT STWS @ Restore original wordsize.
\>>
\>>


"Garbage collection" (but I still say that it should be called "memory
packing") gets slower as the stack gets deeper, and often slows down
drastically if the stack (which is really a stack of pointers) has many
pointers into a large composite that's in temporary memory (that is, the
composite isn't stored in a global or port variable). An example of such
a pointer into a composite would be to build a list on the stack and use
GET to extract an element from it. It "looks like" the element is on the
stack, but really the pointer is into the list, which is still in
temporary memory. The original list is still referenced and kept in
temporary memory for the sake of the pointer until the element is
dropped, stored in a global or port variable, combined into another list
or a vector, or has the NEWOB command executed on it. (This also affects
how much free memory there is.) The garbage collection routine
automatically runs when the unused memory gets too low. If the garbage
collection routine runs and finds a pointer into a composite, then it
has to check all of the following objects in the composite one at a time
to find the end of the composite and its size field so that it knows
exactly which block of memory to move.


Usually, when speed is the most important criterion, I recommend using
the stack instead of variables, and if that gets unmanageable, then
local variables in preference to global variables. But it can be a huge
advantage to store a large list in a global variable (to get it out of
temporary memory) before taking it apart if there's an expectation that
a garbage collection may occur with the "list elements" on the "stack".
For example, compare the execution times for the following programs.
Note that MEM forces a garbage collection (to make it easy to determine
how much memory is available), and it's the easiest way to get an idea
of how long garbage collection takes. In the following programs I've
included internal timing routines for the MEM command itself. All of the
timings are with the stack empty and with last stack, arguments, and
commands saving enabled. Wordsize is 64. The 49G is in "approximate"
mode. Note that the "correction value" varies by model; they were found
experimentally and may not be "best" for all calculators. Also note that
the times would vary with repeated trials.


This first one is just to time MEM on a nearly empty stack.

48SX Checksum (using 32 below): # 6255h
48GX Checksum (using 16 below): # 85AFh
49G Checksum (using 189. below): # F96Dh
Bytes: 84.5
%%HP: T(3)A(R)F(.); @ Header for ASCII download.
\<<
TICKS @ Get the start time.
MEM @ Force a garbage collection.
TICKS @ Get the finish time.
ROT - @ Elapsed time.
B\->R @ Convert binary to real number.
@ Correction for time to execute TICKS.
@32 48SX value
16 @ 48GX value
@189. 49G value
-
"Ticks" \->TAG @ Tag result.
DUP 8192. / '1_s' * @ Also show in seconds.
\>>
48SX: 378 ticks (0.046 second) for MEM,
1510 ticks (0.18 second) total.
48GX: 254 ticks (0.031 second) for MEM,
1031 ticks (0.13 second) total.
49G: 594 ticks (0.072 second) for MEM,
1738 ticks (0.21 second) total.


Now with the reals 1 through 1000 on the stack:

48SX Checksum (using 32 below): # C99Fh
48GX Checksum (using 16 below): # 2E65h
49G Checksum (using 189. below): # 62B9h
Bytes: 111.5
%%HP: T(3)A(R)F(.); @ Header for ASCII download.
\<<
1. 1000. @ Place the reals 1 through 1000 on the stack.
FOR n
n
NEXT
TICKS @ Get the start time.
MEM @ Force a garbage collection.
TICKS @ Get the finish time.
ROT - @ Elapsed time.
B\->R @ Convert binary to real number.
@ Correction for time to execute TICKS.
@32 48SX value
16 @ 48GX value
@189. 49G value
-
"Ticks" \->TAG @ Tag result.
DUP 8192. / '1_s' * @ Also show in seconds.
\>>
48SX: 12751 ticks (1.56 seconds) for MEM,
56681 ticks (6.92 seconds) total.
48GX: 8866 ticks (1.08 seconds) for MEM,
39169 ticks (4.78 seconds) total.
49G: 9052 ticks (1.10 seconds) for MEM,
39057 ticks (4.77 seconds) total.
No list involved.


Now we'll build a list and explode it onto the stack:

48SX Checksum (using 32 below): # 428Dh
48GX Checksum (using 16 below): # A577h
49G Checksum (using 189. below): # CBB8h
Bytes: 121.5
%%HP: T(3)A(R)F(.); @ Header for ASCII download.
\<<
1. 1000. @ Place the reals 1 through 1000 on the stack.
FOR n
n
NEXT
DUP \->LIST @ Put the numbers into a list.
LIST\-> @ Explode the list onto the stack.
DROP @ Discard the count.
TICKS @ Get the start time.
MEM @ Force a garbage collection.
TICKS @ Get the finish time.
ROT - @ Elapsed time.
B\->R @ Convert binary to real number.
@ Correction for time to execute TICKS.
@32 48SX value
16 @ 48GX value
@189. 49G value
-
"Ticks" \->TAG @ Tag result.
DUP 8192. / '1_s' * @ Also show in seconds.
\>>
48SX: 1818259 ticks (221.96 seconds) for MEM,
1870881 ticks (228.38 seconds) total.
48GX: 1255937 ticks (153.31 seconds) for MEM,
1292199 ticks (157.74 seconds) total.
49G: 1180957 ticks (144.15 seconds) for MEM,
1216584 ticks (148.51 seconds) total.
"Looks like" the same
thing on the stack for MEM, but because of the list, it's much slower.
Oh well, that gives us time to start a fresh pot of coffee.


But if the list is stored in a global variable before we explode it:

48SX Checksum (using 32 below): # C0E2h
48GX Checksum (using 16 below): # 5AB7h
49G Checksum (using 189. below): # 4C07h
Bytes: 156.5
%%HP: T(3)A(R)F(.); @ Header for ASCII download.
@ Note well! This will purge a global variable named 'TEMP'. If you
@ already have something named 'TEMP', then choose a different name!
\<<
'TEMP' @ Place the name on the stack.
1. 1000. @ Place the reals 1 through 1000 on the stack.
FOR n
n
NEXT
DUP \->LIST @ Put the numbers into a list.
OVER STO @ Store the list in a global variable.
RCL @ Put the list back on the stack.
LIST\-> @ Explode the list onto the stack.
DROP @ Discard the count.
TICKS @ Get the start time.
MEM @ Force a garbage collection.
TICKS @ Get the finish time.
ROT - @ Elapsed time.
B\->R @ Convert binary to real number.
@ Correction for time to execute TICKS.
@32 48SX value
16 @ 48GX value
@189. 49G value
-
"Ticks" \->TAG @ Tag result.
DUP 8192. / '1_s' * @ Also show in seconds.
'TEMP' PURGE @ Purge the variable.
\>>
48SX: 6556 ticks (0.80 second) for MEM,
68400 ticks (8.35 seconds) total.
48GX: 4508 ticks (0.55 second) for MEM,
67779 ticks (8.27 seconds) total.
49G: 4996 ticks (0.61 second) for MEM,
51629 ticks (6.30 seconds) total.
Much better, even with the
extra overhead of handling the global variable. I'm willing to trade the
extra size for speed.


And no, using a local variable doesn't have this effect; local variables
are still in temporary memory. Try:

48SX Checksum (using 32 below): # C035h
48GX Checksum (using 16 below): # 5590h
49G Checksum (using 189. below): # 4178h
Bytes: 144
%%HP: T(3)A(R)F(.); @ Header for ASCII download.
\<<
1. 1000. @ Place the numbers 1 through 1000 on the stack.
FOR n
n
NEXT
DUP \->LIST @ Put the numbers into a list.
\-> TEMP @ Store the list in a local variable.
\<<
TEMP @ Put the list back on the stack.
LIST\-> @ Explode the list onto the stack.
DROP @ Discard the count.
TICKS @ Get the start time.
MEM @ Force a garbage collection.
TICKS @ Get the finish time.
ROT - @ Elapsed time.
B\->R @ Convert binary to real number.
@ Correction for time to execute TICKS.
@32 48SX value
16 @ 48GX value
@189. 49G value
-
"Ticks" \->TAG @ Tag result.
DUP 8192. / '1_s' * @ Also show in seconds.
\>>
\>>
48SX: 1820005 ticks (222.63 seconds) for MEM,
1872915 ticks (228.63 seconds) total.
48GX: 1257259 ticks (153.47 seconds) for MEM,
1293734 ticks (157.93 seconds) total.
49G: 1181800 ticks (144.26 seconds) for MEM,
1217617 ticks (148.63 seconds) total.
The local variable
didn't help a bit.


These timings were done were on a 48SX Version D, a 48GX Version R, and
a 49G Revision #1.19-6.

Another technique is to force a garbage collection with MEM DROP just
before these conditions occur, in the hope that garbage collections
won't be needed at particularly bad times.

By the way, there's supposed to be a new 49G Revision 1.19-7 flash ROM
already written, just waiting for HP to allow it to be released. The
plan is (was?) that this will have an improved garbage collection
routine, particularly when it comes to handling this issue with
composite objects.

I hope that someone (maybe just RPL users?) will find this helpful or
interesting. I enjoyed playing around with the calculators, and some
things that I already "sort of knew" have soaked into my brain a bit
deeper. The biggest surprise for me was how well the 49G did.

Regards,
James


#13

James --

Good Lord, what an effort!

I suggest that the post be preserved in the Articles section.


#14

Thank you.

It really wasn't that much of an effort; as I wrote, it was playing
around. All very simple, straightforward RPL programming. Probably the
biggest effort was adding comments for the benefit of those not familiar
with RPL whilst avoiding adding typos to the programs. Just in case I
did make typos, or someone manually keying in a program in made a typo,
I included the outputs from the BYTES command. As you may have noticed,
the programs share a good deal of code, so the various programs were a
matter of editing a previous program, not of writing an entirely new
program. And of course, I didn't actually key in the programs on each
calculator; I transferred them via their serial ports.

After posting, I wished that I had included the 28 series calculators.
But of course, that would entail keying them in; no serial port
available. And the 28 series doesn't have a built-in TICKS command, I'd
have to use SYSEVAL, and with a different address for each ROM version.
The 28 series also lacks the BYTES or any other built-in checksum type
of command, though I do have one copied from a book that I use for
verifying that a program is keyed in correctly. And of course, on the
28C, with its very limited memory, I'm pretty sure that I wouldn't be
able to put so many numbers on the stack, even with the last stack,
arguments, and commands saves disabled. Still, I may decide to add
something to the Articles section, and if so, I'll include the 28
series, but using a shallower stack.

I wrote:

The original list is still referenced and kept in temporary memory for
the sake of the pointer until the element is dropped, stored in a global
or port variable, combined into another list or a vector, or has the
NEWOB command executed on it. (This also affects how much free memory
there is.)

I should've mentioned that there are exceptions to this. If the element
is an integer value from -9 through 9, the composite isn't kept in
temporary memory for the sake of the element. These numbers (and many
other objects) each have their own address in ROM; I guess that what's
on the stack when one of these numbers (and perhaps other objects with
ROM addresses?) are on the stack as elements, is a pointer to the ROM
address, rather than a pointer into the composite in temporary memory.

For example, if you replace:

  1. 1000.              @ Place the reals 1 through 1000 on the stack.
FOR n
n
NEXT
DUP \->LIST @ Put the numbers into a list.
LIST\-> @ Explode the list onto the stack.
in one of the slow programs with:
  1. 1000.              @ Place the real number zero onto the stack 1000
FOR n @ times.
0.
NEXT
1000. \->LIST @ Combine the zeros into a list.
LIST\-> @ Explode the list onto the stack.
then you'll find that it runs much faster.

Regards,
James


Possibly Related Threads...
Thread Author Replies Views Last Post
  accuracy benchmarks for financial calculations Kim Hansen 6 434 01-16-2009, 11:22 AM
Last Post: PeterP
  Graphing Calc Plotting Speed Benchmarks Warren Anderson 2 230 02-09-2007, 12:54 AM
Last Post: Chuck
  Your Benchmarks please for on-going test: Comparison HP 95 LX (512K) against Psion 3C (2M) Bruno Geuth 12 663 06-21-2004, 06:56 PM
Last Post: Gordon Dyer
  Turtle/Hare benchmarks for the new HP-71X running on HP-49G+ HrastProgrammer 6 381 05-17-2004, 10:00 PM
Last Post: Veli-Pekka Nousiainen

Forum Jump: