Optimization questions - HP-15C



#36

About a year ago, as a personal challenge to myself, I decided to write a Sudoku solver for the HP-15C. It was quite a task and it took me several months to whittle it down to 34 registers of memory and 31 registers of code. When I did finally cram it into memory earlier this year and get it working, I felt as if I had achieved quite a victory. I thought I would document it, get some satisfaction from posting it to the Articles section and be done with it.



Unfortunately it was somewhat of a Pyrrhic victory. While the program will solve a Sudoku puzzle in a flash if I run it in the HP-15C emulator on my PC, it runs mind numbingly slowly on the actual calculator hardware (even on LE). It does work, but odds are that the calculator will run out of batteries before it has solved any given puzzle.



For a few months I did not touch it, burned out on the project. Now however, I though I would give it one more push to try to optimize the code to run at a more palatable speed.



In order to cram the logic into available memory, I had to resort to using a lot of subroutine calls as a form of common code elimination. A lot! I suspect that the root of the slowness is in the fact that the program is paying a large penalty for all those nested call.



The purpose of this post is to see if someone could give me some basic information on how subroutines are implemented on the hardware and what the performance costs are:



1) Is there a performance penalty for calling a label that is far away (in terms of number of instructions) versus a label that is close. (And if there is, what is it?) The documentation talks about searching for labels which makes it sounds as if there might be. The program is 200 steps and some of the subroutines can be quite far away from the caller.



2) Is there a performance penalty for calling a label that is backwards in the code?



3) Is there a performance penalty for nested calls? During the execution the call stack is almost always several calls deep.



3) Is there some trick for efficiently calling subroutines from tight loops?


#37

Hi, Marcel.

I once had some performance measurements made in an HP41CX by using its internal stopwatch and found some slightly differences while having backward and forward jumping, and also when they were single GTO's or subroutine calls (XEQ). As expected, the need of at least pushing the return address into a return stack demanded some machine cycles, even when the jump had already been compiled.

I have never measured the HP15C performance because I can only think of doing it by naked eye observation and an actual hand operated stopwatch, and I guess the new HP15C LE will make things even harder because it is too fast. Unless, of course, you can connect some tracking device (logic analyzer) and check results. That would be a precise way of doing it.

All I can think of is that the HP15C 'O.S.' seems to search for labels in a 'forward first' scheme, so if the label is backwards, the O.S. will always go ahead till the last existing program step then looping to the beginning of the program in order to search for the corresponding label. This way, based on what is written in the HP15C manual, backwards jumping will always consume more time than forward, even if the backward label is just a few steps back. So, questions #1), #2) and #4) sound coherent to me.

About question #3): again, I have no actual evidence to reason about, but if the stack memory has fixed size - and also based on the fact that the HP15C stores only 7 return addresses - then its addresses are possibly fixed as well, and if this is taken into account when pushing and popping return addresses in and out of it, I'd guess that there is no extra penalty (added after last edit:) for each nested subroutine call in this case.

But these are just conjectures of mine, cannot affirm any of these based on concrete evidences.

Sorry not helping the way you need, though. I guess that if you deal with photography, slightly changes in time measuring while capturing or revealing images are critical, right?

Cheers.

Luiz (Brazil)

Edited: 28 Oct 2012, 10:28 a.m.


#38

Hi Luiz,



The functions are called many times and so even minor changes in timing will make a big difference. Because the program is manipulating individual integers that are packed into the registers, it uses the modulus operator frequently; since the 15c does not have a modulus operator built in, that subroutine is one of my worst offenders.



I guess I will have to resort to writing some tests to see how much of a difference the distance of the jump makes. I was thinking of just writing a loop with a counter that calls a subroutine. With a stopwatch I can let it run for a minute and then interrupt it and check the counter to see how many it did in that time. While that won't give me the exact cost of a subroutine call, since the overhead of the loop is built into that time, it will allow me to compare the results of different distance jumps. I was hoping someone had already done something like that and kept the data.



Marcel

#39

In time: will you share the code? Please? 8^)

Luiz (Brazil)


#40

Sure. I haven't wanted to post it to the articles section because it is not good enough yet. Here is a link to a spreadsheet with the current code on SkyDrive:

http://sdrv.ms/UVrmXh

To actually use it:

1) Dimension your memory to use 34 registers

2) Manually stuff the rows of the input sudoku into registers 8-16. Each row is input with as one number, with 0 representing a blank.

3) GSB A to run the program

4) When the program finishes, the solution will be found in registers 17-25, which you will unfortunately have to manually use the indirect register to retrieve.

I know this is not particularly user friendly. I would like to squeeze the program down by a couple of registers in order to be able to put in an input and output loop. I have some ideas on how to do that but thought I would deal with the speed issue first.


#41

Why not use the matrix functionality to input and output the data?


#42

Yes, I will have to look into that. Whatever takes the least memory. Right now it has exactly one byte free, so some more memory needs to be freed up before I can even think about I/O.

#43

The XML-file below can be loaded into the nonpareil emulator so you don't have to type in the program. I was curious how long it would take your program to solve a simple example:


The program is still running.

I'm very much impressed by your effort!

Thomas



<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE state SYSTEM "nonpareil.dtd">
<state version="1.00" model="15C" platform="voyager" arch="nut">
<ui/>
<chip name="Nut">
<registers>
<reg name="a" data="000000fffff000"/>
<reg name="b" data="000000fffff000"/>
<reg name="c" data="00000000000eae"/>
<reg name="m" data="00000000000000"/>
<reg name="n" data="00000000000000"/>
<reg name="g" data="04"/>
<reg name="p" data="c"/>
<reg name="q" data="4"/>
<reg name="q_sel" data="0"/>
<reg name="fo" data="00"/>
<reg name="s" data="0800"/>
<reg name="pc" data="0000"/>
<reg name="stack" index="0" data="006d"/>
<reg name="stack" index="1" data="0042"/>
<reg name="stack" index="2" data="0000"/>
<reg name="stack" index="3" data="0000"/>
<reg name="decimal" data="0"/>
<reg name="carry" data="0"/>
<reg name="awake" data="0"/>
<reg name="pf_addr" data="00"/>
<reg name="ram_addr" data="007"/>
<reg name="active_bank" index="0" data="0"/>
<reg name="active_bank" index="1" data="0"/>
<reg name="active_bank" index="2" data="0"/>
<reg name="active_bank" index="3" data="0"/>
<reg name="active_bank" index="4" data="0"/>
<reg name="active_bank" index="5" data="0"/>
<reg name="active_bank" index="6" data="0"/>
<reg name="active_bank" index="7" data="0"/>
<reg name="active_bank" index="8" data="0"/>
<reg name="active_bank" index="9" data="0"/>
<reg name="active_bank" index="a" data="0"/>
<reg name="active_bank" index="b" data="0"/>
<reg name="active_bank" index="c" data="0"/>
<reg name="active_bank" index="d" data="0"/>
<reg name="active_bank" index="e" data="0"/>
<reg name="active_bank" index="f" data="0"/>
</registers>
</chip>
<chip name="Voyager LCD">
<registers>
<reg name="enable" data="1"/>
<reg name="blink" data="0"/>
</registers>
</chip>
<memory as="ram">
<loc addr="000" data="00000000000000"/>
<loc addr="001" data="00000000000000"/>
<loc addr="002" data="00000000000000"/>
<loc addr="003" data="00000000000000"/>
<loc addr="004" data="000000fffff000"/>
<loc addr="005" data="00000000000008"/>
<loc addr="006" data="0000000000000c"/>
<loc addr="007" data="00000000000eae"/>
<loc addr="008" data="00000000000000"/>
<loc addr="009" data="2faf8befbe2280"/>
<loc addr="00a" data="00000000000000"/>
<loc addr="010" data="00000000000000"/>
<loc addr="011" data="00000000000000"/>
<loc addr="012" data="00000000000000"/>
<loc addr="013" data="00000000000000"/>
<loc addr="014" data="00000000000000"/>
<loc addr="015" data="c0d2d2d2d2d2d2"/>
<loc addr="016" data="000000000006e1"/>
<loc addr="017" data="00000000000000"/>
<loc addr="018" data="00000000000000"/>
<loc addr="019" data="00000000000000"/>
<loc addr="01a" data="00000000a00000"/>
<loc addr="0c0" data="00000000000000"/>
<loc addr="0c1" data="00000000000000"/>
<loc addr="0c2" data="00000000000000"/>
<loc addr="0c3" data="00000000000000"/>
<loc addr="0c4" data="00000000000000"/>
<loc addr="0c5" data="00000000000000"/>
<loc addr="0c6" data="00000000000000"/>
<loc addr="0c7" data="00000000000000"/>
<loc addr="0c8" data="00000000000000"/>
<loc addr="0c9" data="00000000000000"/>
<loc addr="0ca" data="00000000000000"/>
<loc addr="0cb" data="00000000000000"/>
<loc addr="0cc" data="00000000000000"/>
<loc addr="0cd" data="00000000000000"/>
<loc addr="0ce" data="00000000000000"/>
<loc addr="0cf" data="00000000000000"/>
<loc addr="0d0" data="00000000000000"/>
<loc addr="0d1" data="00000000000000"/>
<loc addr="0d2" data="00000000000000"/>
<loc addr="0d3" data="00000000000000"/>
<loc addr="0d4" data="00000000000000"/>
<loc addr="0d5" data="00000000000000"/>
<loc addr="0d6" data="00000000000000"/>
<loc addr="0d7" data="00000000000000"/>
<loc addr="0d8" data="00000000000000"/>
<loc addr="0d9" data="00000000000000"/>
<loc addr="0da" data="00000000000000"/>
<loc addr="0db" data="00000000000000"/>
<loc addr="0dc" data="00000000000000"/>
<loc addr="0dd" data="00000000000000"/>
<loc addr="0de" data="00000000000000"/>
<loc addr="0df" data="00000000000000"/>
<loc addr="0e0" data="00000000000000"/>
<loc addr="0e1" data="00b225faf9c4c4"/>
<loc addr="0e2" data="32ff7120f2ebfd"/>
<loc addr="0e3" data="30210919ff43ff"/>
<loc addr="0e4" data="2d33ff361c7126"/>
<loc addr="0e5" data="c3f10c1e2d3619"/>
<loc addr="0e6" data="ff52ff29342933"/>
<loc addr="0e7" data="293242ff253627"/>
<loc addr="0e8" data="faf11c7536f909"/>
<loc addr="0e9" data="ff271e713726f1"/>
<loc addr="0ea" data="b27531f0f80e41"/>
<loc addr="0eb" data="c3f110ff7631f0"/>
<loc addr="0ec" data="f82d7137273726"/>
<loc addr="0ed" data="f100ff41c3f143"/>
<loc addr="0ee" data="ff42ff0ab220f0"/>
<loc addr="0ef" data="f1ebfd35862304"/>
<loc addr="0f0" data="b24724f8324624"/>
<loc addr="0f1" data="f7f13245ccfb33"/>
<loc addr="0f2" data="f844fafcf3ebfd"/>
<loc addr="0f3" data="f332ebfdf34320"/>
<loc addr="0f4" data="f93142ebfdf931"/>
<loc addr="0f5" data="81df06b296fafc"/>
<loc addr="0f6" data="35fb30368623f7"/>
<loc addr="0f7" data="f1324056ef07b2"/>
<loc addr="0f8" data="25faf9c49623f6"/>
<loc addr="0f9" data="f2c1c5fac353ff"/>
<loc addr="0fa" data="30210bb22b342b"/>
<loc addr="0fb" data="332b32250db280"/>
<loc addr="0fc" data="cdc5f2fbf14005"/>
<loc addr="0fd" data="b2c497fa03b286"/>
<loc addr="0fe" data="23f6f2c1c101b2"/>
<loc addr="0ff" data="b5fca3c5b1fd00"/>
</memory>
</state>


001 - 42,21, 0   LBL 0
002 - 10 ÷
003 - 43 36 LASTx
004 - 34 x<>y
005 - 42 44 FRAC
006 - 20 ×
007 - 43 34 RND
008 - 43 32 RTN
009 - 42,21, 1 LBL 1
010 - 36 ENTER
011 - 36 ENTER
012 - 2 2
013 - 6 6
014 - 32 3 GSB 3
015 - 45 24 RCL (i)
016 - 43 32 RTN
017 - 42,21, 3 LBL 3
018 - 40 +
019 - 44 25 STO I
020 - 33 R-v
021 - 43 32 RTN
022 - 42,21, 5 LBL 5
023 - 44 0 STO 0
024 - 1 1
025 - 30 -
026 - 2 2
027 - 34 x<>y
028 - 14 y^x
029 - 42, 4, 0 x<> 0
030 - 43 32 RTN
031 - 42,21,14 LBL D
032 - 32 5 GSB 5
033 - 45 2 RCL 2
034 - 32 12 GSB B
035 - 45 3 RCL 3
036 - 32 12 GSB B
037 - 45 4 RCL 4
038 - 32 12 GSB B
039 - 43 32 RTN
040 - 42,21,12 LBL B
041 - 32 1 GSB 1
042 - 45 0 RCL 0
043 - 43, 6, 3 F? 3
044 - 16 CHS
045 - 40 +
046 - 34 x<>y
047 - 36 ENTER
048 - 2 2
049 - 6 6
050 - 32 3 GSB 3
051 - 44 24 STO (i)
052 - 33 R-v
053 - 9 9
054 - 40 +
055 - 32 5 GSB 5
056 - 43 32 RTN
057 - 42,21, 7 LBL 7
058 - 42, 4, 6 x<> 6
059 - 44 0 STO 0
060 - 45 2 RCL 2
061 - 1 1
062 - 7 7
063 - 32 3 GSB 3
064 - 45 24 RCL (i)
065 - 45 6 RCL 6
066 - 45 0 RCL 0
067 - 30 -
068 - 45 5 RCL 5
069 - 20 ×
070 - 40 +
071 - 44 24 STO (i)
072 - 43 32 RTN
073 - 42,21, 6 LBL 6
074 - 44,40, 1 STO + 1
075 - 45 1 RCL 1
076 - 9 9
077 - 10 ÷
078 - 43 44 INT
079 - 44 2 STO 2
080 - 45 1 RCL 1
081 - 9 9
082 - 32 0 GSB 0
083 - 44 3 STO 3
084 - 3 3
085 - 10 ÷
086 - 43 44 INT
087 - 45 2 RCL 2
088 - 3 3
089 - 10 ÷
090 - 43 44 INT
091 - 3 3
092 - 20 ×
093 - 40 +
094 - 44 4 STO 4
095 - 8 8
096 - 45 3 RCL 3
097 - 30 -
098 - 13 10^x
099 - 44 5 STO 5
100 - 45 2 RCL 2
101 - 1 1
102 - 7 7
103 - 32 4 GSB 4
104 - 44 6 STO 6
105 - 45 2 RCL 2
106 - 8 8
107 - 32 4 GSB 4
108 - 44 7 STO 7
109 - 43 32 RTN
110 - 42,21, 4 LBL 4
111 - 32 3 GSB 3
112 - 45 24 RCL (i)
113 - 45 5 RCL 5
114 - 10 ÷
115 - 43 44 INT
116 - 1 1
117 - 0 0
118 - 32 0 GSB 0
119 - 43 32 RTN
120 - 42,21,11 LBL A
121 - 43, 5, 2 CF 2
122 - 43, 5, 3 CF 3
123 - 1 1
124 - 16 CHS
125 - 44 1 STO 1
126 - 42,42,.0 LBL .0
127 - 1 1
128 - 32 6 GSB 6
129 - 45 7 RCL 7
130 - 32 7 GSB 7
131 - 45 7 RCL 7
132 - 43,30, 1 TEST 1
133 - 32 14 GSB D
134 - 8 8
135 - 0 0
136 - 45 1 RCL 1
137 - 43,30, 6 TEST 6
138 - 22 .0 GTO .0
139 - 1 1
140 - 16 CHS
141 - 44 1 STO 1
142 - 42,21,15 LBL E
143 - 8 8
144 - 0 0
145 - 45 1 RCL 1
146 - 43,30, 5 TEST 5
147 - 43 32 RTN
148 - 1 1
149 - 32 6 GSB 6
150 - 45 7 RCL 7
151 - 43,30, 1 TEST 1
152 - 22 15 GTO E
153 - 32 7 GSB 7
154 - 42,42,.9 LBL .9
155 - 9 9
156 - 45 6 RCL 6
157 - 43,30, 5 TEST 5
158 - 22 13 GTO C
159 - 1 1
160 - 40 +
161 - 32 7 GSB 7
162 - 45 6 RCL 6
163 - 32 5 GSB 5
164 - 43, 5, 2 CF 2
165 - 45 2 RCL 2
166 - 32 9 GSB 9
167 - 45 3 RCL 3
168 - 32 9 GSB 9
169 - 45 4 RCL 4
170 - 32 9 GSB 9
171 - 43, 6, 2 F? 2
172 - 22 .9 GTO .9
173 - 45 6 RCL 6
174 - 32 14 GSB D
175 - 22 15 GTO E
176 - 42,21,13 LBL C
177 - 1 1
178 - 16 CHS
179 - 32 6 GSB 6
180 - 43,30, 1 TEST 1
181 - 22 13 GTO C
182 - 45 6 RCL 6
183 - 43, 4, 3 SF 3
184 - 32 14 GSB D
185 - 43, 5, 3 CF 3
186 - 22 .9 GTO .9
187 - 42,21, 9 LBL 9
188 - 32 1 GSB 1
189 - 45 0 RCL 0
190 - 10 ÷
191 - 43 44 INT
192 - 2 2
193 - 32 0 GSB 0
194 - 43,30, 1 TEST 1
195 - 43, 4, 2 SF 2
196 - 33 R-v
197 - 33 R-v
198 - 9 9
199 - 40 +
200 - 32 5 GSB 5
201 - 43 32 RTN

#44

I have never used nonpareil. I tried the example below in the HP emulator (that came on CD with LE) and it took less than 1 second to solve.

One thing I happened to notice is that in the key codes that you posted below, the numbered labels that are above 9 show an odd key code: For example, "LBL .9" shows up as "42,42,.9" instead of "42,21,.9" - is that acceptable for nonpareil or could it be a bug that is impacting the execution of the code?


#45

An easy way to make sure the program is working correctly is to input data of all 0. Thus, do a CLEAR REG and a GSB A.

The program will return the first solution it finds, which in this case will be:

123456789
456789123
789123456
214365897
365897214
897214365
531642978
642978531
978531642

It finds this solution very quickly compared to even a simple sudoku puzzle.

If the nonpareil version does not ever return a proper solution, let me know and I will give it a whirl. I have become pretty good at debugging this program :)

#46

Hi Marcel

You found a bug in the script that generates the listing. For further details, please consult the following article:
Effective Computer-aided Calculator Programming - Part 1 - Voyager

The following patch can be applied to fix the script:

2812,2821c2812,2821
< ff00 | LBL .0 | - 42,42,.0 |
< ff01 | LBL .1 | - 42,42,.1 |
< ff02 | LBL .2 | - 42,42,.2 |
< ff03 | LBL .3 | - 42,42,.3 |
< ff04 | LBL .4 | - 42,42,.4 |
< ff05 | LBL .5 | - 42,42,.5 |
< ff06 | LBL .6 | - 42,42,.6 |
< ff07 | LBL .7 | - 42,42,.7 |
< ff08 | LBL .8 | - 42,42,.8 |
< ff09 | LBL .9 | - 42,42,.9 |
---
> ff00 | LBL .0 | - 42,21,.0 |
> ff01 | LBL .1 | - 42,21,.1 |
> ff02 | LBL .2 | - 42,21,.2 |
> ff03 | LBL .3 | - 42,21,.3 |
> ff04 | LBL .4 | - 42,21,.4 |
> ff05 | LBL .5 | - 42,21,.5 |
> ff06 | LBL .6 | - 42,21,.6 |
> ff07 | LBL .7 | - 42,21,.7 |
> ff08 | LBL .8 | - 42,21,.8 |
> ff09 | LBL .9 | - 42,21,.9 |

As far as I know the HP emulator is based on Nonpareil so it might be that you can save the state oft the calculator to a file and retrieve it from there. Nonpareil uses a gziped XML-file but you can just use the plain XML-file as well. The script "vcomp" allows to compile programs for the voyagers to this format and to list them as well. And there are some other features like macros.


Meanwhile your program has finished and produced the correct result! Wow, I'm impressed! Congratulations!.

Cheers

Thomas


This is the listing of your sudoku-program after correcting the script:
001 - 42,21, 0  LBL 0        042 -    45  0  RCL 0        083 -    44  3  STO 3        124 -       16  CHS          165 -    45  2  RCL 2        
002 - 10 / 043 - 43, 6, 3 F? 3 084 - 3 3 125 - 44 1 STO 1 166 - 32 9 GSB 9
003 - 43 36 LSTx 044 - 16 CHS 085 - 10 / 126 - 42,21,.0 LBL .0 167 - 45 3 RCL 3
004 - 34 x<>y 045 - 40 + 086 - 43 44 INT 127 - 1 1 168 - 32 9 GSB 9
005 - 42 44 FRAC 046 - 34 x<>y 087 - 45 2 RCL 2 128 - 32 6 GSB 6 169 - 45 4 RCL 4
006 - 20 x 047 - 36 ENTER 088 - 3 3 129 - 45 7 RCL 7 170 - 32 9 GSB 9
007 - 43 34 RND 048 - 2 2 089 - 10 / 130 - 32 7 GSB 7 171 - 43, 6, 2 F? 2
008 - 43 32 RTN 049 - 6 6 090 - 43 44 INT 131 - 45 7 RCL 7 172 - 22 .9 GTO .9
009 - 42,21, 1 LBL 1 050 - 32 3 GSB 3 091 - 3 3 132 - 43,30, 1 TEST 1 173 - 45 6 RCL 6
010 - 36 ENTER 051 - 44 24 STO (i) 092 - 20 x 133 - 32 14 GSB D 174 - 32 14 GSB D
011 - 36 ENTER 052 - 33 Rv 093 - 40 + 134 - 8 8 175 - 22 15 GTO E
012 - 2 2 053 - 9 9 094 - 44 4 STO 4 135 - 0 0 176 - 42,21,13 LBL C
013 - 6 6 054 - 40 + 095 - 8 8 136 - 45 1 RCL 1 177 - 1 1
014 - 32 3 GSB 3 055 - 32 5 GSB 5 096 - 45 3 RCL 3 137 - 43,30, 6 TEST 6 178 - 16 CHS
015 - 45 24 RCL (i) 056 - 43 32 RTN 097 - 30 - 138 - 22 .0 GTO .0 179 - 32 6 GSB 6
016 - 43 32 RTN 057 - 42,21, 7 LBL 7 098 - 13 10^x 139 - 1 1 180 - 43,30, 1 TEST 1
017 - 42,21, 3 LBL 3 058 - 42, 4, 6 x<>6 099 - 44 5 STO 5 140 - 16 CHS 181 - 22 13 GTO C
018 - 40 + 059 - 44 0 STO 0 100 - 45 2 RCL 2 141 - 44 1 STO 1 182 - 45 6 RCL 6
019 - 44 25 STO I 060 - 45 2 RCL 2 101 - 1 1 142 - 42,21,15 LBL E 183 - 43, 4, 3 SF 3
020 - 33 Rv 061 - 1 1 102 - 7 7 143 - 8 8 184 - 32 14 GSB D
021 - 43 32 RTN 062 - 7 7 103 - 32 4 GSB 4 144 - 0 0 185 - 43, 5, 3 CF 3
022 - 42,21, 5 LBL 5 063 - 32 3 GSB 3 104 - 44 6 STO 6 145 - 45 1 RCL 1 186 - 22 .9 GTO .9
023 - 44 0 STO 0 064 - 45 24 RCL (i) 105 - 45 2 RCL 2 146 - 43,30, 5 TEST 5 187 - 42,21, 9 LBL 9
024 - 1 1 065 - 45 6 RCL 6 106 - 8 8 147 - 43 32 RTN 188 - 32 1 GSB 1
025 - 30 - 066 - 45 0 RCL 0 107 - 32 4 GSB 4 148 - 1 1 189 - 45 0 RCL 0
026 - 2 2 067 - 30 - 108 - 44 7 STO 7 149 - 32 6 GSB 6 190 - 10 /
027 - 34 x<>y 068 - 45 5 RCL 5 109 - 43 32 RTN 150 - 45 7 RCL 7 191 - 43 44 INT
028 - 14 y^x 069 - 20 x 110 - 42,21, 4 LBL 4 151 - 43,30, 1 TEST 1 192 - 2 2
029 - 42, 4, 0 x<>0 070 - 40 + 111 - 32 3 GSB 3 152 - 22 15 GTO E 193 - 32 0 GSB 0
030 - 43 32 RTN 071 - 44 24 STO (i) 112 - 45 24 RCL (i) 153 - 32 7 GSB 7 194 - 43,30, 1 TEST 1
031 - 42,21,14 LBL D 072 - 43 32 RTN 113 - 45 5 RCL 5 154 - 42,21,.9 LBL .9 195 - 43, 4, 2 SF 2
032 - 32 5 GSB 5 073 - 42,21, 6 LBL 6 114 - 10 / 155 - 9 9 196 - 33 Rv
033 - 45 2 RCL 2 074 - 44,40, 1 STO+ 1 115 - 43 44 INT 156 - 45 6 RCL 6 197 - 33 Rv
034 - 32 12 GSB B 075 - 45 1 RCL 1 116 - 1 1 157 - 43,30, 5 TEST 5 198 - 9 9
035 - 45 3 RCL 3 076 - 9 9 117 - 0 0 158 - 22 13 GTO C 199 - 40 +
036 - 32 12 GSB B 077 - 10 / 118 - 32 0 GSB 0 159 - 1 1 200 - 32 5 GSB 5
037 - 45 4 RCL 4 078 - 43 44 INT 119 - 43 32 RTN 160 - 40 + 201 - 43 32 RTN
038 - 32 12 GSB B 079 - 44 2 STO 2 120 - 42,21,11 LBL A 161 - 32 7 GSB 7
039 - 43 32 RTN 080 - 45 1 RCL 1 121 - 43, 5, 2 CF 2 162 - 45 6 RCL 6
040 - 42,21,12 LBL B 081 - 9 9 122 - 43, 5, 3 CF 3 163 - 32 5 GSB 5
041 - 32 1 GSB 1 082 - 32 0 GSB 0 123 - 1 1 164 - 43, 5, 2 CF 2

Edited: 30 Oct 2012, 5:53 a.m.


#47

Quote:
You found a bug in the script that generates the listing.

Fixed. Thanks for the bug report.

#48

Thanks a lot for the quick fix!

#49

Hi Thomas,

Thanks for the reference to the article. I wish I had known about vcomp when I started on this project. It would have saved me some time.

I'm actually relatively new to the world of HP calculators. I had a 32sII for years but used it infrequently and never wrote any meaningful programs for it. It was only in the last year when on a whim I bought a 15c LE that I decided to take on a challenging program as a way of familiarizing myself with it.

Where can I download nonpareil for the windows with the 15c rom? I poked around google for a bit and could not find it.


#50

Quote:
Where can I download nonpareil for the windows with the 15c rom?


High-Fidelity Calculator Simulator Download page
in

the Wayback Machine

If the first direct link doesn't work you mighty search for "http://nonpareil.brouhaha.com/download" and go back to Jan. 16th 2007. You need revision 0.78 or earlier.

Have fun

Thomas

Edited: 30 Oct 2012, 1:26 p.m.


#51

I really wanted to try the nonpareil emulator. It seems like a cool workflow with the compiler.

I had no end of problems getting the windows version working (a lot of GTK issues) and I eventually gave up.

So I built the .78 code on a linux box and that went smoothly. It took me a while to figure out how to load a file, so in the end I replaced the 15c.nst file with the XML file you had posted. That seemed to do the trick and when I fired up nonpareil 15c, the program seemed to be in memory in tact.

I ran a simple test case, however, and it failed after several minutes.

During execution, nonpareil would occasionally print out "warning: stray READ RAM 0bf PF 00 reg 0". The numbers on the warning were always the same.

Since you were able to run a test case in non-pareil, I am assuming that something in my configuration is wrong.

Does anyone have any thoughts on what my issue might be? Any common pitfalls to bringing a build of nonpareil up and running?

p.s

The one piece of information I left out is that I originally did an scons and "scons install" on the newest version of nonpareil. Only then did I notice that it did not come with 15c ROM. I then got version .7 and did a scons and scons install with it. Is there any chance that having had a newer version installed previously is somehow installing the version .78?


#52

Quote:
It took me a while to figure out how to load a file

File -> Open

Quote:
warning: stray READ RAM 0bf PF 00 reg 0

I get that warning as well. I didn't care too much. It might be that we initialize something wrong in 'vcomp'. Does it happen only after loading a compiled nst-file?

Quote:
I am assuming that something in my configuration is wrong.

Probably not. Or at least I'm having the same wrong configuration.

I think you can just copy the 15c ROM and the other corresponding files (kml, image) from release 0.78 to the newest version of nonpareil. Just in case you want to start from scratch again.

Cheers

Thomas


#53

I honestly don't see a file menu anywhere.....


#54

That's strange.


#55

If the low "r" character is what puzzles you, some time ago it was mentionesd here as sort of a signature from Nonpareil. It has no further effects. HTH


#56

No, the strange thing is that Marcel's emulator seems to lack the File-menu.

Edited: 1 Nov 2012, 8:55 a.m.


#57

I figured it out. As I had mentioned, I could not get things working under windows and so I installed it under Linux.

Although I am an old unix hack, I almost never use the Linux desktop. I was using the latest Ubuntu distribution and the desktop not only puts the application's menu to a global position at the top of the screen (instead of at each window), but it also auto-hides it so that it does not show up unless you mouse over it. It was this latter behavior that was confusing me.

As soon as I found some time to try to do something else on the system besides just run nonpareil, I came to the realization that all applications were missing their menu and I eventually figured out what was going on.

So.... sorry - my ignorance.

#58

Ok. I finally worked out the issue

When I had loaded your file into nonpareil, I had assumed that it would take care of setting the memory partition, and I never bothered to check it.

Once I manually did the DIM 34, everything started working and I am ready to start focusing on the program and everyone suggestions.

I really wanted to get nonpareil working because it seems suitable for performing timing tests. For development I having been using the HP15C emulator that came from HP with the LE. It is great for testing if things work because it is blindingly fast. Test cases that take hours on the hardware will take less than 1 second in that emulator. Unfortunately, for the same reason, it is is not practical for running timing tests.

#59

Quote:
During execution, nonpareil would occasionally print out "warning: stray READ RAM 0bf PF 00 reg 0". The numbers on the warning were always the same.

There are quirks in at least the 15c firmware where it
will perform read access of address 0xbf. When I
discovered this developing KEMU, I found the reads
to be innocuous and IIRC occur in more than one
routine in the firmware.

My dusty notes indicate it is possible to force an
access to 0xbf manually via the key sequence:

ON
ENTER
Rv (roll down)
#60

Quote:
As far as I know the HP emulator is based on Nonpareil

To the best of my knowledge the HP emulator is not based on any code from Nonpareil, though I did provide Cyrille with some minor assistance a few times as the new 12C and 15C LE were in development.

I have no idea what HP's state file format is, but it's probably not compatible with Nonpareil.

#61

I haven't taken the time to take a good look at the program, but just by eyeballing I came across the following:

At steps 55, 118, and 200, there's a GSB immediately followed by a RTN. They can be replaced by a GTO.

Unless they're needed for the stack lift, the RCL 0 ; - at 66/67 can be replaced by RCL - 0, the RCL 5 ; x at 68/69 by RCL x 5, the RCL 3 ; - at 96/97 by RCL - 3, the RCL 5 ; / at 113/114 by RCL / 5, the RCL 0 ; / at 189/190 by RCL / 0.

Probably won't make any appreciable difference, but every byte and every step may help.


#62

Quote:
At steps 55, 118, and 200, there's a GSB immediately followed by a RTN. They can be replaced by a GTO.

For the same reason lines 038 and 039 can be removed. First they can be replaced by a GTO B which isn't used since the next step is LBL B.

My guess is a lot of performance could be gained by simply rearinging the code to avoid backward-jumps.

Kind regards

Thomas

#63

Thanks very much for taking the time to look at the code!

Quote:
At steps 55, 118, and 200, there's a GSB immediately followed by a RTN. They can be replaced by a GTO.

The C++ developer in me had not thought of that (and cringes at the thought). However, that's a great suggestion.

Quote:
Unless they're needed for the stack lift, the RCL 0 ; - at 66/67 can be replaced by RCL - 0, the RCL 5 ; x at 68/69 by RCL x 5, the RCL 3 ; - at 96/97 by RCL - 3, the RCL 5 ; / at 113/114 by RCL / 5, the RCL 0 ; / at 189/190 by RCL / 0.

Unfortunately, I believe that the suggested replacements are all two byte instructions, so the memory usage would remain the same.


#64

Quote:
the memory usage would remain the same

But it's probably still faster since it's only one instruction. You might run a benchmark test to find out.


#65

You are absolutely correct. Based on Namir's performance document it seems that stack lift/drop is also expensive and reducing it should improve performance.

#66

This is the call-chain of your program:

I tried to rearrange your subroutines to avoid backward-calls. In addition to the other suggestions I've renamed LBL .0 to LBL 2 and LBL .9 to LBL 8. These use only 1 byte instead of 2.

If the return value of sub-routine 7 isn't used you could replace

078 RCL+ (i)
079 STO (i)
by a simple statement
078 STO+ (i)

And then I suggest to rename LBL {B, D} <-> {2, 8}. So all the alphanumeric labels are used for loops. But that's for aesthetic reasons only.

Just keep in mind: I made these changes without understanding a single line of your code.

Cheers

Thomas

PS: The program is currently running with previous clearing the registers.


001 - 42,21,11  LBL A         039 -    22 13  GTO C         077 - 45,20, 5  RCLx 5        115 -        9  9             153 -        1  1             
002 - 43, 5, 2 CF 2 040 - 1 1 078 - 45,40,24 RCL+ (i) 116 - 40 + 154 - 7 7
003 - 43, 5, 3 CF 3 041 - 40 + 079 - 44 24 STO (i) 117 - 42,21, 5 LBL 5 155 - 32 4 GSB 4
004 - 1 1 042 - 32 7 GSB 7 080 - 43 32 RTN 118 - 44 0 STO 0 156 - 44 6 STO 6
005 - 16 CHS 043 - 45 6 RCL 6 081 - 42,21,14 LBL D 119 - 1 1 157 - 45 2 RCL 2
006 - 44 1 STO 1 044 - 32 5 GSB 5 082 - 32 5 GSB 5 120 - 30 - 158 - 8 8
007 - 42,21, 2 LBL 2 045 - 43, 5, 2 CF 2 083 - 45 2 RCL 2 121 - 2 2 159 - 32 4 GSB 4
008 - 1 1 046 - 45 2 RCL 2 084 - 32 12 GSB B 122 - 34 x<>y 160 - 44 7 STO 7
009 - 32 6 GSB 6 047 - 32 9 GSB 9 085 - 45 3 RCL 3 123 - 14 y^x 161 - 43 32 RTN
010 - 45 7 RCL 7 048 - 45 3 RCL 3 086 - 32 12 GSB B 124 - 42, 4, 0 x<>0 162 - 42,21, 4 LBL 4
011 - 32 7 GSB 7 049 - 32 9 GSB 9 087 - 45 4 RCL 4 125 - 43 32 RTN 163 - 32 3 GSB 3
012 - 45 7 RCL 7 050 - 45 4 RCL 4 088 - 42,21,12 LBL B 126 - 42,21, 6 LBL 6 164 - 45 24 RCL (i)
013 - 43,30, 1 TEST 1 051 - 32 9 GSB 9 089 - 32 1 GSB 1 127 - 44,40, 1 STO+ 1 165 - 45,10, 5 RCL/ 5
014 - 32 14 GSB D 052 - 43, 6, 2 F? 2 090 - 45 0 RCL 0 128 - 45 1 RCL 1 166 - 43 44 INT
015 - 8 8 053 - 22 8 GTO 8 091 - 43, 6, 3 F? 3 129 - 9 9 167 - 1 1
016 - 0 0 054 - 45 6 RCL 6 092 - 16 CHS 130 - 10 / 168 - 0 0
017 - 45 1 RCL 1 055 - 32 14 GSB D 093 - 40 + 131 - 43 44 INT 169 - 42,21, 0 LBL 0
018 - 43,30, 6 TEST 6 056 - 22 15 GTO E 094 - 34 x<>y 132 - 44 2 STO 2 170 - 10 /
019 - 22 2 GTO 2 057 - 42,21,13 LBL C 095 - 36 ENTER 133 - 45 1 RCL 1 171 - 43 36 LSTx
020 - 1 1 058 - 1 1 096 - 2 2 134 - 9 9 172 - 34 x<>y
021 - 16 CHS 059 - 16 CHS 097 - 6 6 135 - 32 0 GSB 0 173 - 42 44 FRAC
022 - 44 1 STO 1 060 - 32 6 GSB 6 098 - 32 3 GSB 3 136 - 44 3 STO 3 174 - 20 x
023 - 42,21,15 LBL E 061 - 43,30, 1 TEST 1 099 - 44 24 STO (i) 137 - 3 3 175 - 43 34 RND
024 - 8 8 062 - 22 13 GTO C 100 - 33 Rv 138 - 10 / 176 - 43 32 RTN
025 - 0 0 063 - 45 6 RCL 6 101 - 9 9 139 - 43 44 INT 177 - 42,21, 1 LBL 1
026 - 45 1 RCL 1 064 - 43, 4, 3 SF 3 102 - 40 + 140 - 45 2 RCL 2 178 - 36 ENTER
027 - 43,30, 5 TEST 5 065 - 32 14 GSB D 103 - 32 5 GSB 5 141 - 3 3 179 - 36 ENTER
028 - 43 32 RTN 066 - 43, 5, 3 CF 3 104 - 43 32 RTN 142 - 10 / 180 - 2 2
029 - 1 1 067 - 22 8 GTO 8 105 - 42,21, 9 LBL 9 143 - 43 44 INT 181 - 6 6
030 - 32 6 GSB 6 068 - 42,21, 7 LBL 7 106 - 32 1 GSB 1 144 - 3 3 182 - 32 3 GSB 3
031 - 45 7 RCL 7 069 - 42, 4, 6 x<>6 107 - 45,10, 0 RCL/ 0 145 - 20 x 183 - 45 24 RCL (i)
032 - 43,30, 1 TEST 1 070 - 44 0 STO 0 108 - 43 44 INT 146 - 40 + 184 - 43 32 RTN
033 - 22 15 GTO E 071 - 45 2 RCL 2 109 - 2 2 147 - 44 4 STO 4 185 - 42,21, 3 LBL 3
034 - 32 7 GSB 7 072 - 1 1 110 - 32 0 GSB 0 148 - 8 8 186 - 40 +
035 - 42,21, 8 LBL 8 073 - 7 7 111 - 43,30, 1 TEST 1 149 - 45,30, 3 RCL- 3 187 - 44 25 STO I
036 - 9 9 074 - 32 3 GSB 3 112 - 43, 4, 2 SF 2 150 - 13 10^x 188 - 33 Rv
037 - 45 6 RCL 6 075 - 45 6 RCL 6 113 - 33 Rv 151 - 44 5 STO 5 189 - 43 32 RTN
038 - 43,30, 5 TEST 5 076 - 45,30, 0 RCL- 0 114 - 33 Rv 152 - 45 2 RCL 2

#67

Hi, Thomas.

Just out of curiosity: did you use any specific graph-generator program for the call-chain representation or it is just hand-made using a regular graphic program? I'm asking about this because the final view is very harmonic and easy to understand.

Cheers.

Luiz (Brazil)


#68

I used "dot" to create the image "sudoku.png":

dot -Tpng -o sudoku.png < sudoku.dot

You need a configuration file "sudoku.dot":


digraph config {
size ="10,13";
rankdir = "LR";
ranksep =.5;
nodesep = .1;
ratio = "fill";
subgraph SUDOKU {
node [style=filled, color="#bccec2", shape=rectangle, fontname=Verdana, fontsize=24];
edge [style=dashed];
label=GTO;
"A" -> ".0";
"A" -> "E";
"A" -> "C";
"A" -> ".9";
edge [style=solid];
label=GSB;
"1" -> "3";
"D" -> "5";
"D" -> "B";
"D" -> "B";
"D" -> "B";
"B" -> "1";
"B" -> "3";
"B" -> "5";
"7" -> "3";
"6" -> "0";
"6" -> "4";
"6" -> "4";
"4" -> "3";
"4" -> "0";
"A" -> "6";
"A" -> "7";
"A" -> "D";
"A" -> "6";
"A" -> "7";
"A" -> "7";
"A" -> "5";
"A" -> "9";
"A" -> "9";
"A" -> "9";
"A" -> "D";
"A" -> "6";
"A" -> "D";
"9" -> "1";
"9" -> "0";
"9" -> "5";
}
}

It's just part of my Linux distribution so I don't know where to get it.

HTH

Thomas


#69

Very interesting!

Thanks a lot!

In time: which Linux distribution/version do you use?

Luiz (Brazil)


#70

/etc/SuSE-release:

openSUSE 11.3 (i586)
VERSION = 11.3

That's rather old I guess but not my decision.

Cheers

Thomas

#71

Meanwhile I found this article: DOT language

It appears that dot is a filter based on Graphviz for drawing directed graphs.

#72

Thanks for taking the time with the code!! That's a cool visualization.

The labeling is a mess. As I mentioned earlier, I spent a long time trying to squeeze the code down and during that time I went through many iterations of labeling.

One consideration that I dealt with was looking at the number of GSB/GTOs targeting a particular label and then using that to determine whether to use a two byte label or a single byte one.

I am going to incorporate the various changes that have been suggested and that you have incorporated into the code below, but I was thinking of doing it step by step and running some timing tests as I do that. I am curious as to how much of a performance increase each of the changes gets me on a real world test rather than a benchmark.

I'm trying to get nonpareil working for me so I can try using it to run the timing tests. (All I'm interested in is relative times, not actual times, so an emulator seems fine for that).


#73

In your utility subroutine for setting flag matrix values (LBL B) the ENTER is dispensable:

    XY SWITCH   ; bring the row index back into x
; ENTER ; duplicate it
2 ; 26 is the starting register for the flag matrix
6

ENTER disables the stack lift so the consecutive 26 will just overwrite the value X of the stack.

Quote:
The labeling is a mess.

Rest assured, your code is well structured. The main problem I see is that you declared the subroutines before using them.

Kind regards

Thomas

#74

Just another suggestion for change(x):

Original:

LBL 6   
STO+ 1 ; x holds +1 or -1; Register 1 is the current index

RCL 1 ; get the current index (0 to 80)
9 ; integer divide by 9 to get the row index (0 to 8)
/ ; no integer divide on 15c so do a floating point divide
INT ; use the INT operator to finish of the integer divide
STO 2 ; register 2 contains the current row index

RCL 1 ; start with the current index again
9 ; we are going to do a modulus 9 to get the column index
GSB 0 ; do the modulus operation
STO 3 ; register 3 holds the current column index

Suggestion:

LBL 6   
STO+ 1 ; x holds +1 or -1; Register 1 is the current index

RCL 1 ; get the current index (0 to 80)
RCL 1 ; copy to calculate col later
9 ; integer divide by 9 to get the row index (0 to 8)
/ ; no integer divide on 15c so do a floating point divide
INT ; use the INT operator to finish of the integer divide
STO 2 ; register 2 contains the current row index

9 ;
* ;
- ; col = index - 9 * row
STO 3 ; register 3 holds the current column index

This avoids the call to mod(x, y).

Did you consider to just keep
row and col? The drawback is that you have to handle over- and underflow of col. But most of the time it's just to increment or decrement col. That might be faster than the calculation of row and col from index in all cases.


In the utility routine to extract the value at the current column from the matrix indirectly specified by x&y, I suggest to reverse the use of INT and FRAC:

LBL 4   
GSB 3 ; set the indirect register based on x & y
RCL (i) ; get the row from the matrix passed in
RCL/ 5 ; get the power of 10 value for the current position
; divide the complete row value by the power of 10 (equivalent to a shift down)
FRAC ; trim off the digits on the left of the decimal
1 ; now shift the first digit to the left of the decimal
0
*
INT ; cut off what is left on the right of the decimal

I assume you'd have to use 9 instead of 8 in the calculation of the power of 10 of current column index:

    9           ; now calculate the power of 10 of the current column
RCL- 3 ; Get the current column index into x
; subtract 9 because we want to go right to left for integer math to work

I haven't tested that though.


Let's see how we can get rid of the call to mod(x, y) in the subroutine 9:
LBL 9           ; note that this subroutine label is reused
GSB 1 ; get the appropriate row (x) from the flag matrix
RCL/ 0 ; get the temp register stuffed by the caller
; do the integer divide
INT
2 ; do a modulus 2 to see if the bit is set
/
FRAC
TEST 1 ; if the bit is set, set user flag 2

So in the end the subroutine mod(x, y) can be removed altogether. This saves more bytes than we added but most of all it saves an expensive subroutine call which is repeated often.

Kind regards

Thomas


#75

Thomas,

Thank you very much for your all your truly excellent suggestions. I am not sure you realize how much you helped!

I ran some instrumented profiles on the algorithm earlier today and it turns out that roughly 80% of the subroutine calls to mod() were coming from the main loop (LBL 9) and the other 20% from the change() subroutine (LBL 6).

Your suggestions for removing the calls to MOD in the LBL 9 subroutine immediately eliminated roughly 80% of the calls to MOD

Furthermore, the few bytes of memory saved by yours and others various suggestions allowed me to put in an optimization in the main loop that reduced the amount of work that it performs by roughly 40%. Previously, it would check whether a digit was found in the current row, column, and block with each of those checks setting a flag if it was found. Only after all three checks had been made, was there a conditional to check the flag and try the next digit if the one being tried was already found somewhere. With a few extra bytes of memory available, I was able to put that conditional check after each check, so, for example, it will not do the work to check the column or the block, if the digit was already found in the row.

I have not included your reordering of the subroutines yet. When I did the instrumented profiles I got data on how often each subroutine is called from each different place in the code. I am now pondering how to use that data to optimally arrange the subroutines that are called from multiple places.

I have not yet included your suggestion on changing the LBL 4 subroutine yet, but I will investigate it.

I am attaching a new annotated listing and a nonpareil state file which reflect the current state of the code I am working with.

One question: When I use nonpareil, I am always having to issue the DIM command. It never seems to be sticky. Is there something I can do about that or do I have to live with it?

Current annotated listing:

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;	
; mod(x, y)
; Returns y % x. Note the RND appears necessary because of precision issues.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
LBL 0
/ ; y / x – start with the floating point division
LASTX ; put x back onto the stack
X<>Y ; switch x & y so that we can operate on y
FRAC ; take the fractional part of y
* ; and multiply it with the original x to get the modulus
RND ; have to round on the 15c to ensure it's an integer
RTN

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; getPart(x)
; Returns the integer representing the entire Xth row of the flag matrix
; Row numbers start at 0.
; returns value in x - input parameter x ends up in y
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
LBL 1
ENTER
ENTER ; duplicate the parameter so we can leave it for the caller
2 ; 26 is the starting register for the flag matrix
6
GSB 3 ; set the indirect register to the row specified by x
RCL (i) ; retrieve the entire row from the flag matrix
RTN

; subroutine to set the indirect register and remove the parameters from the stack
LBL 3
+ ; x+y is the memory offset we want
STO I ; put it in the indirect register
RDN ; get rid of the sum from the stack
RTN

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; setPow2(x)
; Sets the utility temp register to 2^(x-1). Leaves x in place.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
LBL 5
STO 0 ; store the input X in the temp register
1 ; we want to subtract 1 from the exponent
- ; calculate x-1
2 ; set the base as 2
X<>Y ; the y^x function wants x and y reversed
y^x ; calculate the value
X<>0 ; stuff the result in the temp register and restore the input x
RTN

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; setU(x)
; Sets/clears the flag matrix values (used to indicate that x is set at the current row/column/block)
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
LBL D
GSB 5 ; calculate the bit value (for the row) we need to set or clear in the existing values
RCL 2 ; Get the current row index into x
GSB B ; set the appropriate flag matrix values and calculate the new bit value for the column
RCL 3 ; Get the current column index into x
GSB B ; set the appropriate flag matrix values and calculate the new bit value for the block
RCL 4 ; Get the current block index into x

; MUST IMMEDIATELY FOLLOW PRECEEDING SUBROUTINE
; utility subroutine for setting flag matrix values

LBL B
GSB 1 ; get the current flag matrix row at index x

RCL 0 ; get the temp register which holds the bit value we will be setting
F? 3 ; flag 3 indicates if we are setting or clearing the flag
CHS ; if we are clearing, we will do a subtraction instead
+ ; add the values which is equivalent to a boolean operation either setting or clearing the flag

X<>Y ; bring the row index back into x
2 ; 26 is the starting register for the flag matrix
6
GSB 3 ; set the indirect register so that we are ready to store the new value
STO (i) ; store the new value into the flag matrix
RDN ; get rid of the new value to restore the stack
9 ; just for convenience, the next bit value will be 9 bits to the left
+ ; set the next bit index
GTO 5 ; calculate the value with that bit set
; we GTO instead of GSB and it will take care of returning for us

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; putA(x)
; Set the value x into the current row/column in the trial solution.
; Does it by subtracting the previous value and adding the new one.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
LBL 7
X <> 6 ; swap new value with register that holds current value
STO 0 ; store the old value in the temp register
RCL 2 ; Get the current row index into x
1 ; 17 is the starting register for the current trial solution
7
GSB 3 ; Set the indirect register
RCL (i) ; Get the current value for the entire row
RCL 6 ; Get the new value
RCL- 0 ; subtract the old value from the new value
RCL* 5 ; shift the power of 10 to the appropriate column
+ ; add to the old value
STO (i) ; store the new row value from where we got it
RTN

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; change(x)
; Increments or decrements the current position in the trial solution.
; Updates the registers containing the current row, column and block index,
; as well as those containing the power of 10 factor for the current column and others
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
LBL 6
STO+ 1 ; x holds +1 or -1; Register 1 is the current index

RCL 1 ; get the current index (0 to 80)
RCL 1 ; get the current index (0 to 80)
9 ; integer divide by 9 to get the row index (0 to 8)
/ ; no integer divide on 15c so do a floating point divide
INT ; use the INT operator to finish of the integer divide
STO 2 ; register 2 contains the current row index

9
*
- ; col = index – 9 * row
STO 3 ; register 3 contains the current column index

3 ; calculate the block index from the row & column indexes
/ ; block = INT(col/3) + INT(row/3)*3
INT ; it would be nice to save a couple of bytes in this section of code
RCL 2
3
/
INT
3
*
+
STO 4 ; register 4 holds the block index

8 ; now calculate the power of 10 of the current column
RCL- 3 ; Get the digit (from right) based on the column
10^X ; calculate the exponent
STO 5 ; save in register 5 which is used throughout the code

RCL 2 ; get the current row
1 ; 17 is the start register of the current trial solution
7
GSB 4 ; extract the value at the current column
STO 6 ; register 6 contains the current trial value at the current row/column

RCL 2 ; get the current row
8 ; 8 is the start register of the input data from the user
GSB 4 ; extract the value at the current column
STO 7 ; register 7 contains the starting value at the current row/column (0 if none)
RTN

; Utility routing to extract the value at the current column from the matrix indirectly specified by x&y
LBL 4
GSB 3 ; set the indirect register based on x & y
RCL (i) ; get the row from the matrix passed in
RCL / 5 ; shift the row to the right so that digit we are interested in is least significant
INT ; trim off the digits shifted to the right of the decimal
1 ; we will do a modulus 10 to extract the last digit
0
GTO 0 ; do the modulus operation
; do a GTO instead of GSB and it will return for us

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; main()
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
LBL A
CF 2 ; make sure flag 2 is unset – CLR REG does not do this
CF 3 ; make sure flag 3 is unset – CLR REG does not do this
1 ; start with a index in register 1 of -1 (0 to 80)
CHS ; that way we can start with an increment operation
STO 1 ; and actually start at 0 where we want.

LBL 2 ; we are going to loop through all input values and set the flags to show they are set
1 ; go forward one position at a time
GSB 6 ; go to the next position in the trial solution and set all temp variables

RCL 7 ; get the starting input value at this row/col
GSB 7 ; set the value in the trial solution
RCL 7 ; get the starting input value again because the last call destroyed it
TEST 1 ; if it is > 0 then the user actually input a value for this row/col
GSB D ; set the flags to indicate this value is set

8 ; 80 is the upper bound of the indexes (9x9 = 80 = 0:80)
0
RCL 1 ; get the current index
TEST 6 ; if the current index hasn't reached 80
GTO 2 ; do the next value
1 ; reset the starting value
CHS ; to -1 as we did at the beginning of the program
STO 1 ; register 1 holds the current index

LBL E ; m1 ; main solution loop
8 ; when we reach the last index (80) we are done
0
RCL 1 ; register 1 holds the current index
TEST 5 ; see if we are at the end
RTN ; finished ; woohoo – we are done!
1 ; Go forward one spot
GSB 6 ; Do the position increment
RCL 7 ; get the starting input value at this row/col
TEST 1 ; if it's > 0, the user specified a value here
GTO E ; go forward, since this value was explicitly specified by the user and doesn't need to be solved for
GSB 7 ; Set the value in the trial solution

LBL 8 ; m2
9 ; we check the possible digits in order 1-9. If we are on nine, we've tried them all at this location
RCL 6 ; Get the current trial solution value
TEST 5 ; Check to see if it is 9
GTO C ; If it is, backup one step
1 ; We weren't at 9 yet, so increment the value by 1
+
GSB 7 ; Set the value in the trial solution

RCL 6 ; Get the current trial solution value
GSB 5 ; Calc 2^x-1 to get the bit mask
CF 2 ; Clear the flag we will use to see if the test value has already been used in row/column/block
RCL 2 ; Get the current row index into x
GSB 9 ; see if the current trial solution value has already been used in the row
F? 2 ; If the number has been used in the block, try the next value
GTO 8
RCL 3 ; Get the current column index into x
GSB 9 ; see if the current trial solution value has already been used in the column
F? 2 ; If the number has been used in the block, try the next value
GTO 8
RCL 4 ; Get the current block index into x
GSB 9 ; see if the current trial solution value has already been used in the block
F? 2 ; If the number has been used in the block, try the next value
GTO 8
RCL 6 ; Get the current trial solution value
GSB D ; set the flags to indicate this value is set
GTO E ; move on to the next position in the puzzle

LBL C ; m3 ; Come here to back up to the previous position
1 ; We will go one spot backwards
CHS
GSB 6 ; Set the new current position and all temp values
TEST 1 ; the previous call leaves the starting value at this position in X
GTO C ; if that value is > 0, meaning the value was set, backup one more spot
RCL 6 ; Get the current trial solution value

SF 3 ; Set the 3 flag to indicate we will be clearing the flag matrix bits, not setting them.
GSB D ; Set/Clear the flag matrix bits
CF 3 ; unset the 3 flag
GTO 8 ; check the next digit

LBL 9
GSB 1 ; get the appropriate row (x) from the flag matrix
RCL/ 0 ; get the temp register stuffed by the caller
INT
2 ; if the bit is set, there will be a remainder when dividing by two
/
FRAC
TEST 1 ; if the bit is set, set user flag 2 which is used as a return value
SF 2
RDN ; move the stack down to prepare the caller for the next call
RDN ; move the stack down to prepare the caller for the next call
9 ; the bits flags for row/column/block are shifted by 9 from each other
+ ; calculates the appropriate bit offset for the next call
GTO 5 ; calc 2^x-1 to get the bit mask
; do a GTO instead of GSB and it will return for us

Nonpareil state file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE state SYSTEM "nonpareil.dtd">
<state version="1.00" model="15C" platform="voyager" arch="nut">
<ui/>
<chip name="Nut">
<registers>
<reg name="a" data="00000000000000"/>
<reg name="b" data="00000000000000"/>
<reg name="c" data="00000000000eae"/>
<reg name="m" data="00000000000000"/>
<reg name="n" data="00000000000000"/>
<reg name="g" data="09"/>
<reg name="p" data="c"/>
<reg name="q" data="4"/>
<reg name="q_sel" data="0"/>
<reg name="fo" data="00"/>
<reg name="s" data="0800"/>
<reg name="pc" data="0000"/>
<reg name="stack" index="0" data="009b"/>
<reg name="stack" index="1" data="0042"/>
<reg name="stack" index="2" data="0000"/>
<reg name="stack" index="3" data="0000"/>
<reg name="decimal" data="0"/>
<reg name="carry" data="0"/>
<reg name="awake" data="0"/>
<reg name="pf_addr" data="00"/>
<reg name="ram_addr" data="007"/>
<reg name="active_bank" index="0" data="0"/>
<reg name="active_bank" index="1" data="0"/>
<reg name="active_bank" index="2" data="0"/>
<reg name="active_bank" index="3" data="0"/>
<reg name="active_bank" index="4" data="0"/>
<reg name="active_bank" index="5" data="0"/>
<reg name="active_bank" index="6" data="0"/>
<reg name="active_bank" index="7" data="0"/>
<reg name="active_bank" index="8" data="0"/>
<reg name="active_bank" index="9" data="0"/>
<reg name="active_bank" index="a" data="0"/>
<reg name="active_bank" index="b" data="0"/>
<reg name="active_bank" index="c" data="0"/>
<reg name="active_bank" index="d" data="0"/>
<reg name="active_bank" index="e" data="0"/>
<reg name="active_bank" index="f" data="0"/>
</registers>
</chip>
<chip name="Voyager LCD">
<registers>
<reg name="enable" data="1"/>
<reg name="blink" data="0"/>
</registers>
</chip>
<memory as="ram">
<loc addr="000" data="00000000000000"/>
<loc addr="001" data="00000000000000"/>
<loc addr="002" data="00000000000000"/>
<loc addr="003" data="00000000000000"/>
<loc addr="004" data="00000000000000"/>
<loc addr="005" data="00000000000008"/>
<loc addr="006" data="0000000000000c"/>
<loc addr="007" data="00000000000eae"/>
<loc addr="008" data="00000000000000"/>
<loc addr="009" data="2faf8befbe2280"/>
<loc addr="00a" data="bef282bcbcaf80"/>
<loc addr="010" data="00000000000000"/>
<loc addr="011" data="00000000000000"/>
<loc addr="012" data="00000000000000"/>
<loc addr="013" data="00000000000000"/>
<loc addr="014" data="00000000000000"/>
<loc addr="015" data="c0d2d2d2d2d2d2"/>
<loc addr="016" data="00000000000000"/>
<loc addr="016" data="000000000003e1"/>
<loc addr="017" data="00000000000000"/>
<loc addr="018" data="00000000000000"/>
<loc addr="019" data="00000000000000"/>
<loc addr="01a" data="00000000a00000"/>
<loc addr="0c0" data="00000000000000"/>
<loc addr="0c1" data="00000000000000"/>
<loc addr="0c2" data="00000000000000"/>
<loc addr="0c3" data="00000000000000"/>
<loc addr="0c4" data="00000000000000"/>
<loc addr="0c5" data="00000000000000"/>
<loc addr="0c6" data="00000000000000"/>
<loc addr="0c7" data="00000000000000"/>
<loc addr="0c8" data="00000000000000"/>
<loc addr="0c9" data="00000000000000"/>
<loc addr="0ca" data="00000000000000"/>
<loc addr="0cb" data="00000000000000"/>
<loc addr="0cc" data="00000000000000"/>
<loc addr="0cd" data="00000000000000"/>
<loc addr="0ce" data="00000000000000"/>
<loc addr="0cf" data="00000000000000"/>
<loc addr="0d0" data="00000000000000"/>
<loc addr="0d1" data="00000000000000"/>
<loc addr="0d2" data="00000000000000"/>
<loc addr="0d3" data="00000000000000"/>
<loc addr="0d4" data="00000000000000"/>
<loc addr="0d5" data="00000000000000"/>
<loc addr="0d6" data="00000000000000"/>
<loc addr="0d7" data="00000000000000"/>
<loc addr="0d8" data="00000000000000"/>
<loc addr="0d9" data="00000000000000"/>
<loc addr="0da" data="00000000000000"/>
<loc addr="0db" data="00000000000000"/>
<loc addr="0dc" data="00000000000000"/>
<loc addr="0dd" data="00000000000000"/>
<loc addr="0de" data="00000000000000"/>
<loc addr="0df" data="00000000000000"/>
<loc addr="0e0" data="00000000000000"/>
<loc addr="0e1" data="0000000015faf9"/>
<loc addr="0e2" data="c4c432ff71a3fd"/>
<loc addr="0e3" data="f2ebe0cf210918"/>
<loc addr="0e4" data="43ff2d33ff361c"/>
<loc addr="0e5" data="7126c3f10c1e2d"/>
<loc addr="0e6" data="361852ff293418"/>
<loc addr="0e7" data="52ff29331852ff"/>
<loc addr="0e8" data="293242ff253627"/>
<loc addr="0e9" data="faf11c7536f908"/>
<loc addr="0ea" data="271e713726f1b2"/>
<loc addr="0eb" data="7531f0f80e41c3"/>
<loc addr="0ec" data="f1127631f0f82d"/>
<loc addr="0ed" data="7137273726f102"/>
<loc addr="0ee" data="41c3f143ff42ff"/>
<loc addr="0ef" data="0a10f0f1ebe5cf"/>
<loc addr="0f0" data="862304b24724f8"/>
<loc addr="0f1" data="324624f7f13245"/>
<loc addr="0f2" data="cca3cff844fafc"/>
<loc addr="0f3" data="f3ebfdf332ebfd"/>
<loc addr="0f4" data="f343fbfcf942eb"/>
<loc addr="0f5" data="fdf9313181df06"/>
<loc addr="0f6" data="b296fac5cfa0cf"/>
<loc addr="0f7" data="368623f7f13240"/>
<loc addr="0f8" data="56ef0715faf9c4"/>
<loc addr="0f9" data="9623f6f2c5fac3"/>
<loc addr="0fa" data="53ff30210b342b"/>
<loc addr="0fb" data="332b32250db280"/>
<loc addr="0fc" data="cdc5f2fbf14005"/>
<loc addr="0fd" data="b2c497fa03b286"/>
<loc addr="0fe" data="23f6f2c1c101b2"/>
<loc addr="0ff" data="b5fca3c5b1fd00"/>
</memory>
</state>


#76

Quote:
One question: When I use nonpareil, I am always having to issue the DIM command. It never seems to be sticky. Is there something I can do about that or do I have to live with it?

If you don't specify the option -new when compiling the program with vcomp the nst-file that is already there is used instead of the template of the script. So you could save the state after changing the dimension and use this file as template. I don't know whether that works properly and can't test that at the moment.

cf. Effective Computer-aided Calculator Programming - Part 1 - Voyager

Quote:
When compiling you have the option of creating a new out-of-the-box state file or inserting your code into an existing state file preserving the contents of the stack and storage registers.

NOTE: Inserting into an existing state file is experimental. Backup your data first.

NOTE: New state files are not exactly "out-of-the-box", the 11C, 12C, and 15C are out-of-the-box + FIX 9.


I don't know where the DIM-information is stored within the nst-file. It could be "pf_addr" or "ram_addr" but I might be completely wrong. You could try to find out by comparing different nst-files after only changing the dimension and modify the template that is used for the 15C model. Or if you prefer add an additional option to specify that.

HTH

Thomas

#77

Meanwhile I found out how to do it. Replace line 3036 of vcomp:

<loc addr="015" data="c0d2d2d2d2d2d2"/>
with:
<loc addr="015" data="c0e1e1e1e1e1e1"/>

The formula is: e116 = bf16 + 3410.

Cheers

Thomas

#78

Quote:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;	
; mod(x, y)
; Returns y % x. Note the RND appears necessary because of precision issues.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
LBL 0
/ ; y / x – start with the floating point division
LASTX ; put x back onto the stack
X<>Y ; switch x & y so that we can operate on y
FRAC ; take the fractional part of y
* ; and multiply it with the original x to get the modulus
RND ; have to round on the 15c to ensure it's an integer
RTN

This will not always work. For instance,

1234567891 ENTER 123 GSB 0    -->    40.59000000     (FIX 8)   or
41. (FIX 0)

Correct answer: 40.00000000

For other applications, I would suggest either

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;	
; mod(x, y)
; Returns y % x.
; (x, y: positive integers)
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
LBL 0
X<>Y ;
STO 0 ;
X<>Y ;
/ ;
LASTX ;
X<>Y ;
INT ;
* ;
RCL- 0 ;
CHS ;
RTN
or
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;	
; mod(x, y)
; Returns y % x.
; (x, y: positive integers)
; (t is not preserved)
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
LBL 0
X<>Y ;
ENTER ;
R^ ;
R^ ;
X<>Y ;
Rv ;
/ ;
LASTX ;
X<>Y ;
INT ;
* ;
- ;
RTN

Regards,

Gerson.


#79

Gerson,

Thanks for pointing that out. I had stuck that RND in a long time ago without really thinking it through.

In my quest for speed, I have actually removed the need for a mod() subroutine entirely. However it's good to know that the code was flawed so that I don't reuse it anywhere.

Marcel

#80

Out of curiosity, I ran some tests. My main program looped 1000 times. Each loop does a GSB 2 and label 2 simply returns. Then I inserted different numbers of ENTER instructions between the main program and LBL 2 to increase the distance between the two.

I ran the program in a 15C LE and timed the execution with the second hand of my watch. To my surprise, there was essentially no difference in execution time, whether LBL 2 was 4 instructions away or 75, the program always took about 4 seconds to run.

I suspect that 15C reserves room in the GSB instruction to store the actual location of the target label. The first time you run the program, it probably stores the location so subsequent GSB's will jump right to the correct spot.

This is total conjecture of course.


#81

Thanks! I had been planning on running the same kind of tests when I found a bit of spare time.

This is really interesting (and, unfortunately, not particularly encouraging).

If your 1000 item loop has a similar overhead to the one that Namir used in his testing in his great article on performance (http://tinyurl.com/9xtok7k) then the loop took about 2.5 seconds and the GSB & RTN took the remaining 1.5 seconds.

I may have to spend my energies on algorithmic optimization - which is quite difficult given the tight memory. In fact, one could argue that I went through a process of intentional algorithmic deoptimization just to get the code small enough....

#82

I just ran a test which did not involve a GSB, but a GTO and it did seem to show that there is a speed difference depending on the distance.

The calculator had a 200 step program in it. I deleted 10 steps and inserted the following at the very beginning of memory:

LBL 1
1
STO+ 0
GTO 1

I did a CLEAR REG followed by a GSB 1 and I let it run for 30 seconds and then did a R/S and examined register 0.

I did a couple runs, averaging the results which were consistent and averaged about 215 loops per second.

Because I had inserted the program at the beginning of memory, there were about 190 program steps including about 15 labels after this loop.

I then deleted all the program steps after the above loop, so that it was the only thing in memory and reran it several times.

Once again, the results were relatively consistent but showd a speed of about 565 loops per second.

This seems to imply that the distance did have some impact on the performance. It also implies that the if you have any loops, the larger your program, the slower all your loops will be...

Edited: 29 Oct 2012, 8:16 p.m.


#83

Hi.

Your results agree with the way the O.S. is described as for searching labels, i.e., always forward. No compilation, as it happens with the HP41. I am not sure if you know about it, but in the HP41 the first execution of a program is the slowest one if it has many backward jumps, as the one you show here. At the first jump, the O.S. identifies the direction and # of bytes and stores it in the GTO itself (modifies the bytes composing the GTO instruction). There are some particularities as for short-range and long-range jump, but this is the main 'modus operandi'.

I liked your approach for measuring the calculator speed.

Cheers.

Luiz (Brazil)


#84

Quote:
At the first jump, the O.S. identifies the direction and # of bytes and stores it in the GTO itself (modifies the bytes composing the GTO instruction).

It occurs to me that this has implications for programs stored in ROM and, to a lesser extent, programs stored on cards or tape. If the program is in ROM then the OS can't modify the GTO instruction, so you'd want all those GTO/GSB instructions to be modified already before copying them to ROM. But how would OS know which GTO/GSB targets are part of the program being copied and thus safe to modify?

If programs in ROM don't have the GTO/GSB instructions modified, then they might run faster if copied to RAM first.

Does anyone know how this works?

Dave


#85

Hi.

You are correct in your assumptions, but consider that the HP41-based, user-coded programs stored in ROM modules cannot be changed once recorded, so they are pre-compiled and the distance (# of bytes and direction) from GTO/XEQ (GSB) to their corresponding LBL's is already stored accordingly. So they are executed in their 'best speed' anytime.

This is not true with programs stored in recordable media (magnetic cards or tape). Even if you have your program already compiled in RAM (after being executed as may times as needed to compute all distances OR with specific programs to compile them), these data are lost when they are copied into cards or magnetic tape.

Cheers.

Luiz (Brazil)


Possibly Related Threads…
Thread Author Replies Views Last Post
  HP Prime questions: I/O and Meta programming Andy Gryc 2 1,564 10-31-2013, 11:22 PM
Last Post: Andy Gryc
  HP-41 questions x34 4 1,787 09-28-2013, 05:15 PM
Last Post: x34
  HP Prime two questions about function app dg1969 0 888 09-27-2013, 11:14 AM
Last Post: dg1969
  HP Prime Questions Clayton Workman 23 6,159 09-25-2013, 11:19 AM
Last Post: Clayton Workman
  Linear Optimization/Programming fhub 14 4,306 08-04-2013, 06:27 AM
Last Post: fhub
  Questions about building a RAM card for hp 48 Waon Shinyoe (China) 0 1,147 07-09-2013, 09:53 PM
Last Post: Waon Shinyoe (China)
  HP-71b questions jerome ibanes 6 2,305 07-01-2013, 05:03 PM
Last Post: Namir
  HP 41 Mcode related Questions Michael Fehlhammer 4 2,056 05-10-2013, 07:09 PM
Last Post: Michael Fehlhammer
  An unusual HP 71B - questions Doug (NYC) 8 2,765 01-16-2013, 08:14 AM
Last Post: Doug (NYC)
  HP-15C/DM-15CC Memory questions x34 2 1,545 07-24-2012, 06:28 PM
Last Post: Jeroen Van Nieuwenhove

Forum Jump: