Re: quake Engine on the Jaguar (Scrummy posts)
Posted: Tue Nov 11, 2014 2:38 am
And here we go. Preliminary texture mapping tests:
The Only R.E.A.L. 3DO Experience.
dml wrote:So if anyone wants to have a go at it on the Falcon (or another retro box with a RISC chip inside), here's how the mapper works:
First you need to build a table of coefficients for quadratic equations, offline.
This involves solving and storing A,B,C terms in order to later query y=(Ax^2+Bx+C) for any (x), where (x) is essentially the scene (z) term for a given pixel, and the result (y) approximates (1/z). I'm using a table of 1024 equations but you can optimize in either direction to save space or gain range/accuracy.
It should be possible to use linear equations if the table is big enough, and perhaps save some cycles while still getting a decent(ish) approximation - but for the Falcon's DSP it's not much trouble to just do it properly.
Note that the table stores equations - triplets, not single values. This means you're performing a lookup on a set of curves - not single value samples.
Generating the table is a bit hard - it involves performing a best-fit on equations using a set of sample points on each curve. I use subdivision but random sampling may work. For most of the table entries, the same 3 points will converge to the best fit, but for entries near the ends of the table the choices will move due to clamping effects enforced on A,B,C for the legal fixedpoint range. It's important to be aware of this detail or you'll get stuck. There are a few gotchas involved in generating the table and due to the nature of best-fit algorithms, you can end up with a broken solver that looks like it is nearly working - beware.
Despite those problems, It's relatively easy to understand/test in floating point because A,B,C can be kept in their natural range. A fixed-point version however is much more difficult since the terms need to be normalized to maximize use of available bits, and for optimal precision they must be differently normalized. This part is a challenge but it can be shown to (just) work with as few as 23 bits + sign for all source terms.
2) Implement the runtime part, which efficiently performs y=(Ax^2+Bx+C).
For this to be efficient, you really need a RISC device with a multiply-accumulate and fast shifting capability. Or at the very least, a very fast multiplier and careful coding. Unfortunately the Falcon's DSP is terrible at shifting and does present some problems of its own here, getting it to work fast. Left as an exercise for the reader
The transform looks a bit like this:
normbits = 23; // for Falcon's 24/48 DSP accumulator - 1 bit auto denorm on this device. should be 32 for a 64bit RISC accumulator.
qbits = 13; // 10 table bits + 13 precision bits == 23
tbits = 8; // arbitrary fraction retained for texture u,v, multiply precision
// during setup, get z, uz, vz normalized into fixedpoint range
z *= (int)(1<<normbits);
x = (int)pixel_z;
ix = (x >> qbits);
A = qtab[ix].A;
B = qtab[ix].B;
Q = qtab[ix].C; // C already shifted by (tbits)
Q += (B*x) >> (normbits-tbits);
Q += (((A*A)>>normbits) * x) >> (normbits-tbits);
On the DSP it looks a bit like this (not optimized, not scheduled and missing some details):
Code: Select all
; x0 z .. schedule these moves elswhere, fuse across >1 iteration .. move y:qtab_ptr,a move y:rshft12,y0 mac x0,y0,a y:c_FFFFFE,y0 ; &qtab[(X>>12)] and y0,a ; &qtab[(X>>13)*2] move a,r4 .. schedule u,v part here, overlap x/y access if table overlaps low memory etc. etc. .. move x:(r4),b ; C mpy x0,x0,a y:(r4)+,y0 ; B mac y0,x0,b a,x1 y:(r4)+,y0 ; A mac y0,x1,b ; .. move b,x0 ; 1/z
now multiply x0 (1/z) by uz,vz and combine into texture address. uz,vz should be pre-normalized.
Mike if you are reading this thread, do you have any idea how on the PSX version of Quake II they sidestepped the complex geometry lighting issue? DML hazarded a guess:MikeFulton wrote:Interesting thing... that thread says the PSX has dedicated 3D hardware but the Jaguar didn't. That's actually not really true. The PSX does not actually have what most people would consider to be "dedicated 3D hardware". At least, not in a way that would distinguish it from Jaguar.a31chris wrote: While I was referencing his name I found this interesting thread:
http://atariage.com/forums/topic/43179- ... ke-engine/
The main processor in the PSX is a MIPS R3000, which a basic, general purpose RISC processor otherwise known for being used in early Silicon Graphics workstations. Sony added a co-processor chip (the "GTE") that implements matrix math functions. This is used for doing your basic 3D graphics transformations on polygon vertices but has nothing to do with the actual pixel pushing. By comparison, the Jaguar GPU has similar matrix math instructions, so the two machines are on fairly even ground at that point.
The rendering loop of a PSX game is basically this:
I guess it's probably quite clever and PSX specific - maybe they subdivide the faces efficiently and vertex-light it (although it doesn't appear that way to me, I didnt't notice mach banding) or maybe they use the hardware triangle engine to compose lightmaps with textures in VRAM on the fly.
What map is Fatal1 startpoint on Quake II?DML wrote:FPS on 'fatal1' startpoint rose from 12 to over 16fps.
I'm going to keep hacking at this until I get completely stuck for ways to speed it up in TC mode, and then will switch back to texturing performance.
https://www.youtube.com/watch?v=2KbDM-Bw80UDML wrote:I compiled a new video last night, focused mainly on outdoor or sprawling maps / big geometry / angled, non-boxy scenery. Deep stress testing for recent optimization work. (I think my Falcon actually squeaked - so much violence after 15 years asleep )
There are no textures in this one - it is flat-fill @ 320x160 / 16bit TC. This is the format I use for performance testing the engine code.
Anyway I think it's starting to run up against some hardware limits at 16/32, or at least design limits for what I've done with the program. I'm sure there are still ways to optimize it, but it's getting harder and taking longer with each try. The last optimization I tested was nasty, complicated and didn't really make much difference in the end... 1-2%. So I'm going to finally stop with this and fix the newly added bugs before looking at textures again.
Trivia: ARMA5 is a map I used to play at lunchtimes while I was working on PC games. I had a 450MHz PII (P3?) at the time, with an early NV graphics card. It's quite heavy going but the old bird just about copes
Quick test demonstrating working transparency without a z-buffer, and z-clipping of transparencies with exaggerated nearplane.
and then...DML wrote:After spending a bit of time with the texturing routines I had two minor breakthroughs which will probably result in better texture fillrate.
So I expect the following will likely become possible soon:
- 192x120 resolution with textures (fullscreen+overscan chunky mode on RGB)
- 16bit surfaces (a bit like BadMooD - better lighting and less fuzz. currently all texture and lighting is 8bit)
- colour lightmaps, like HW/OpenGL (software Q2 used monochrome lighting)
This guy is amazing.Another speedup is on the way. After fixing most of the correctness issues and getting a stable render at all z-distances, I tried dropping the span arithmetic from 48bit to 24bit effective, and got nearly the same result. So there will soon be a 6x reduction in the amount of code needed to set up each span, and a 50% reduction in data transmitted to DSP per face - which is nice
Imagine he or someone else doing these optimizations on the 32x or Jaguar where the caches are much larger and things won't get 'bumped out' so easily.I made a few more simple improvements that actually got rid of all of the padding nops and doubled the size of the jumptower. The impact of this on speed of complex scenery is actually quite good - it spends more time drawing pixels and less flyback time on very short spans.
For now it's partly a negative trade because it meant pushing some other code out of DSP fast memory - slowing other areas down - and causing more cache misses on the CPU side (bad!) but these can be bought back later with other changes. The fact that it is faster eveywhere despite this is a good sign.
https://www.youtube.com/watch?v=LHsmzo0 ... e=youtu.beVid #1 mainly covers continuity with big data. The camera is no longer stuck in one place carefully viewing the same stuff.
https://www.youtube.com/watch?v=nk3UMXWgiVoVid #2 covers breadth - different kinds of environment and complicated scenery.
DML wrote:So the other thing I had been working on is an alternate way to perform square-root operations for realtime 3D.
These are very expensive to perform via 882 FPU and even more so using algorithms on the CPU. Tables help but consume a lot of space to make a real difference and this is useless on a DSP 56k with very little RAM. The excellent Carmack/3DLabs sqrt() trick - which exploits floating point bit representation - deserves a mention. But it requires FPU and therefore still expensive (and limiting) on a Falcon, and useless on the DSP.
(I will point out that square-root is highly valuable for 3D graphics. Having access to a fast sqrt() makes a real difference to what is possible!)
So far I had been using a modified/improved bitwise algorithm on the DSP, both integer and fixedpoint versions. This works quite well but requires 23 iterations of a 5-instruction sequence. That's 23*5*2 = 230+ cycles (!!!). I tried translating other algorithms to DSP but this remained the general winner for speed/accuracy. There is a partial-table solution which should be faster but it didn't save much and consumed a lot more space and registers. In any case all methods tried are either so slow (or so inaccurate) that they have limited use.
But I didn't give up!
After some experiments I developed a solution which closely approximates a 23bit fixedpoint sqrt() in just 10 cycles.
A modified/compound version can also approximate 1.0 / sqrt(x) - albeit less accurately - which can then be used to normalize 3D vectors very very quickly. I wouldn't use this for important math (!) but I think it should suffice for most graphics uses.
The fun part - this method is continuous, accurate enough to replace other methods and fast enough to use per-pixel.
There is some other stuff going which ties in with this, but it is early stages and I'm not close to describing it yet.
Below is a dump from random samples using this integer-only sqrt() approximation. Only result deviations >= 0.01% vs expected are reported, indicating that accuracy decreases with small source values, which turns out to be ok for most common cases of sqrt() in graphics problems and isn't too much of a surprise for integer-based formulas anyway as fewer bits are available for smaller numbers, unlike floats.
Code: Select all
[x=realvalue] [y=expected] [y=actual] [error >= 0.01%] r:0.4277 ye:0.6540 ya:0.653809 e:0.02% r:0.0586 ye:0.2421 ya:0.241943 e:0.06% r:0.1701 ye:0.4124 ya:0.412109 e:0.08% r:0.2395 ye:0.4893 ya:0.489258 e:0.02% r:0.3707 ye:0.6088 ya:0.608398 e:0.07% r:0.3865 ye:0.6217 ya:0.621826 e:0.02% r:0.1175 ye:0.3427 ya:0.342529 e:0.06% r:0.3985 ye:0.6313 ya:0.631104 e:0.03% r:0.4556 ye:0.6750 ya:0.675293 e:0.04% r:0.3674 ye:0.6061 ya:0.605957 e:0.03% r:0.2812 ye:0.5302 ya:0.530029 e:0.04% r:0.1479 ye:0.3846 ya:0.384521 e:0.01% r:0.3074 ye:0.5544 ya:0.554199 e:0.04% r:0.0485 ye:0.2203 ya:0.219971 e:0.16% r:0.4198 ye:0.6479 ya:0.647705 e:0.03% r:0.0109 ye:0.1045 ya:0.104248 e:0.19% r:0.4058 ye:0.6370 ya:0.636475 e:0.08% r:0.1180 ye:0.3434 ya:0.343262 e:0.05% r:0.3377 ye:0.5812 ya:0.580811 e:0.06% r:0.2195 ye:0.4685 ya:0.468262 e:0.05% r:0.3707 ye:0.6088 ya:0.608398 e:0.07% r:0.0187 ye:0.1369 ya:0.136719 e:0.12% etot: 0.000649
Original PostDML wrote:Maybe the DSP one can be improved but 5 ops and 23bit result was as near as I got for the traditional way. Note that it operates on 23bit fractions so in/out values are shifted by 1 bit, as is typical for DSP.
Code: Select all
sqrt macro xysqr,xyroot,Txy move y:<cy_point5,b tfr b,a #<0,xyroot ; : pattern-accumulator do #<23,_loop lsr b a,Txy ; shift trial bit : new trial pattern mpy Txy,Txy,a ; trial (x*x) cmp xysqr,a xyroot,a ; (x*x)>a? : restore pattern-acc for update tle Txy,a ; condition update pattern-acc add b,a a,xyroot ; combine bit : save updated pattern-acc _loop: endm
Sorry about the tabs.
It is possible to get rid of the lsr shift but seemingly not the parallel move - so 5 ops it is for now. BTW unrolling it a bit can remove a few ops I think from the final iter but I didn't bother, kept it small. It takes forever anyway.
DML wrote:I just got the DSP version of approximate sqrt() working, have tested it and I am now pretty certain that it will be accurate enough to replace the other one in most cases.
The body of the calculation is 8 cycles (4 ops) but there is a 12bit normalizing shift involved afterwards and the fastest I could do this was 10 cycles (5 ops - beating my previous impl for a 48bit dynamic shift by 2 ops). So the full arithmetic takes 18 cycles on DSP after all...
There is also some addressing setup code - which can be amortized into a loop (same as was done for the texturemapper) but standalone its another bunch of cycles. So lets say the first pass on the DSP is 18 cycles best case, up to 30 worst case if just called once. More than the 10 cycles I had sketched out but I won't complain. Definitely better than 230 though
For the texturemapper I was able to play with normalization of each term, at the expense of accuracy and removed nearly all of the shifting from the original version. Not sure I can do that here but it's only the first iteration. Maybe another day.
It's definitely nice to see my test running waaaaay faster with this upgrade
http://www.jaguar64.eu/viewtopic.php?f=5&t=21VladR wrote:I finally reorganized my dev set-up, and connected jag (including skunk) to a small tv that fit along the wall next to my PC, so I can deploy the builds within 10 seconds to jag and test it right away.
I obviously tried the H.E.R.O. first and have been optimizing and refactoring it since. You guys didn't mention that my last public 30-fps build ran more like 24 fps on real HW.
But that's fixed now. I crossed the 30-fps about 3 weeks ago, then jumped to 45 fps about 2 weeks ago.
Getting over 55 fps was quite challenging, but I came up with a new line drawing algorithm that is incredibly fast compared to Bresenham (and others).
Yesterday I finally crossed the 60 fps barrier. On an actual jag. No GPU, No DSP, No ASM - just the 'slow' C compiled to bus-hogging 68k driving the Blitter/OP
This code, when rewritten to GPU, should be able to handle something like 640x480 in an acceptable smooth framerate, I believe.