So if anyone wants to have a go at it on the Falcon (or another retro box with a RISC chip inside), here's how the mapper works:
First you need to build a table of coefficients for quadratic equations, offline.
This involves solving and storing A,B,C terms in order to later query y=(Ax^2+Bx+C) for any (x), where (x) is essentially the scene (z) term for a given pixel, and the result (y) approximates (1/z). I'm using a table of 1024 equations but you can optimize in either direction to save space or gain range/accuracy.
It should be possible to use linear equations if the table is big enough, and perhaps save some cycles while still getting a decent(ish) approximation - but for the Falcon's DSP it's not much trouble to just do it properly.
Note that the table stores equations - triplets, not single values. This means you're performing a lookup on a set of curves - not single value samples.
Generating the table is a bit hard - it involves performing a best-fit on equations using a set of sample points on each curve. I use subdivision but random sampling may work. For most of the table entries, the same 3 points will converge to the best fit, but for entries near the ends of the table the choices will move due to clamping effects enforced on A,B,C for the legal fixedpoint range. It's important to be aware of this detail or you'll get stuck. There are a few gotchas involved in generating the table and due to the nature of best-fit algorithms, you can end up with a broken solver that looks like it is nearly working - beware.
Despite those problems, It's relatively easy to understand/test in floating point because A,B,C can be kept in their natural range. A fixed-point version however is much more difficult since the terms need to be normalized to maximize use of available bits, and for optimal precision they must be differently normalized. This part is a challenge but it can be shown to (just) work with as few as 23 bits + sign for all source terms.
2) Implement the runtime part, which efficiently performs y=(Ax^2+Bx+C).
For this to be efficient, you really need a RISC device with a multiply-accumulate and fast shifting capability. Or at the very least, a very fast multiplier and careful coding. Unfortunately the Falcon's DSP is terrible at shifting and does present some problems of its own here, getting it to work fast. Left as an exercise for the reader
The transform looks a bit like this:
normbits = 23; // for Falcon's 24/48 DSP accumulator - 1 bit auto denorm on this device. should be 32 for a 64bit RISC accumulator.
qbits = 13; // 10 table bits + 13 precision bits == 23
tbits = 8; // arbitrary fraction retained for texture u,v, multiply precision
// during setup, get z, uz, vz normalized into fixedpoint range
z *= (int)(1<<normbits);
x = (int)pixel_z;
ix = (x >> qbits);
A = qtab[ix].A;
B = qtab[ix].B;
Q = qtab[ix].C; // C already shifted by (tbits)
Q += (B*x) >> (normbits-tbits);
Q += (((A*A)>>normbits) * x) >> (normbits-tbits);
return Q;
On the DSP it looks a bit like this (not optimized, not scheduled and missing some details):
Code:
; x0 z
..
schedule these moves elswhere, fuse across >1 iteration
..
move y:qtab_ptr,a
move y:rshft12,y0
mac x0,y0,a y:c_FFFFFE,y0 ; &qtab[(X>>12)]
and y0,a ; &qtab[(X>>13)*2]
move a,r4
..
schedule u,v part here, overlap x/y access if table overlaps low memory etc. etc.
..
move x:(r4),b ; C
mpy x0,x0,a y:(r4)+,y0 ; B
mac y0,x0,b a,x1 y:(r4)+,y0 ; A
mac y0,x1,b ;
..
move b,x0 ; 1/z
...
now multiply x0 (1/z) by uz,vz and combine into texture address. uz,vz should be pre-normalized.