On most hardware, multiplication is significantly faster than division. So if you have to divide many numbers by the same divisor, it is usually faster to determine the reciprocal of the divisor once and multiply the numbers with the reciprocal. For integers, this is tricky, so Gforth packages this work into the words described in this section.
Let's start with an example: You want to divide all elements of an array of cells by the same number n. A straightforward way to implement this is:
: array/ ( addr u n -- )
-rot cells bounds u+do
i @ over / i !
1 cells +loop
drop ;
A more efficient version looks like this:
: array/ ( addr u n -- )
{: | reci[ staged/-size ] :}
reci[ /f-stage1m
cells bounds u+do
i @ reci[ /f-stage2m i !
1 cells +loop ;
This example first creates a local buffer reci[ with size
staged/-size for storing the reciprocal data. Then
/f-stage1m computes the reciprocal of n and stores it in
reci[. Finally, inside the loop /f-stage2m uses the
data in reci[ to compute the quotient.
There are some limitations: Only positive divisors are supported for
/f-stage1m; for u/-stage1m you can use a divisor of 2 or
higher. You get an error if you try to use an unsupported divisor.
You must initalize the reciprocal buffer for the floored second-stage
words with /f-stage1m and for the unsigned second-stage words
with u/-stage1m. You must not modify the reciprocal buffer
between the first stage and the second stage; basically, don't treat
it as a memory buffer, but as something that is only mutable by the
first stage; the point of this rule is that future versions of Gforth
will not consider aliasing of this buffer.
The words are:
staged/-size ( – u ) gforth-1.0 “staged-slash-size”
Size of buffer for u/-stage1m or /f-stage1m.
/f-stage1m ( n addr-reci – ) gforth-1.0 “slash-f-stage1m”
Compute the reciprocal of n and store it in the buffer
addr-reci of size staged/-size. Throws an error if
n<1.
/f-stage2m ( n1 a-reci – nquotient ) gforth-1.0 “slash-f-stage2m”
Nquotient is the result of dividing n1 by the divisor represented
by a-reci, which was computed by /f-stage1m.
modf-stage2m ( n1 a-reci – umodulus ) gforth-1.0 “mod-f-stage2m”
Umodulus is the remainder of dividing n1 by the divisor represented
by a-reci, which was computed by /f-stage1m.
/modf-stage2m ( n1 a-reci – umodulus nquotient ) gforth-1.0 “slash-mod-f-stage2m”
Nquotient is the quotient and umodulus is the remainder of
dividing n1 by the divisor represented by a-reci, which was
computed by /f-stage1m.
u/-stage1m ( u addr-reci – ) gforth-1.0 “u-slash-stage1m”
Compute the reciprocal of u and store it in the buffer
addr-reci of size staged/-size. Throws an error if
u<2.
count trailing zeros in binary representation of x
u/-stage2m ( u1 a-reci – uquotient ) gforth-1.0 “u-slash-stage2m”
Uquotient is the result of dividing u1 by the divisor represented
by a-reci, which was computed by u/-stage1m.
umod-stage2m ( u1 a-reci – umodulus ) gforth-1.0 “u-mod-stage2m”
Umodulus is the remainder of dividing u1 by the divisor represented
by a-reci, which was computed by u/-stage1m.
u/mod-stage2m ( u1 a-reci – umodulus uquotient ) gforth-1.0 “u-slash-mod-stage2m”
Uquotient is the quotient and umodulus is the remainder of
dividing u1 by the divisor represented by a-reci, which was
computed by u/-stage1m.
Gforth currently does not support staged symmetrical division.
You can recover the divisor from (the address of) a reciprocal with
staged/-divisor @:
staged/-divisor ( addr1 – addr2 ) gforth-1.0 “staged-slash-divisor”
Addr1 is the address of a reciprocal, addr2 is the address containing the divisor from which the reciprocal was computed.
This can be useful when looking at the decompiler output of Gforth: a division by a constant is often compiled to a literal containing the address of a reciprocal followed by a second-stage word.
The performance impact of using these words strongly depends on the
architecture (does it have hardware division?) and the specific
implementation (how fast is hardware division?), but just to give you
an idea about the relative performance of these words, here are the
cycles per iteration of a microbenchmark (which performs the mentioned
word once per iteration) on two AMD64 implementations; the norm
column shows the normal division word (e.g., u/), while the
stg2 column shows the corresponding stage2 word (e.g.,
u/-stage2m):
Skylake Zen2
norm stg2 norm stg2
41.3 15.8 u/ 35.2 21.4 u/
39.8 19.7 umod 36.9 25.8 umod
44.0 25.3 u/mod 43.0 33.9 u/mod
48.7 16.9 /f 36.2 22.5 /f
47.9 20.5 modf 37.9 27.1 modf
53.0 24.6 /modf 45.8 35.4 /modf
227.2 u/stage1 101.9 u/stage1
159.8 /fstage1 97.7 /fstage1