Posting by proxy for Douglas Little
This makes me wonder what the real performance bottleneck for Quake on Falcon030 might be. I haven't had time to do more than briefly preview Doug Little's videos.
The main bottleneck - the area causing most difficulty in the current version - is the sheer number of edges in the scene coupled with the 'lumpy' PVS which switches on big chunks of geometry as you move around. The Quake engine needs to process/classify all edges yielded by the PVS (first against frustum planes, hierarchically then against the viewport, individually) and that involves a lot of organizing/reindexing and transferring global edge and vertex information from system RAM to a local, packed representation on DSP in small batches, There isn't enough DSP ram to hold it all at once so it needs batched <=256 faces at a time.
There are basically three parts to the game loop...
1) Game logic
2) scene preparation
3) scene rendering
So far this codebase just draws maps - but is also processing events for game objects and collision detection for the player. It doesn't support enemies just now and doesn't draw any of the dynamic objects (pickups, doors etc).
I presume the division of labor would have the 68030 doing stages 1 and 3 while stage 2 would be handled, at least in part, by the DSP.
This would probably be a nice scenario for decent concurrency but the Falcon is not so well equipped for it. The DSP has 32kwords of local ram (really, 16kwords of paired/wide memory) and much of that is used to generate/collect global scene information (surfaces, clipped edges, drawing spans) before the drawing phase. It can't access system RAM itself. There is a narrow port to get stuff between the CPU and DSP. A small amount of local ram is used for a processing buffer for incoming geometry (in batches). But the 32MHz DSP is fast with its local memory, at 2 clocks for most operations including mul.
The DSP is needed to accelerate most stages in the system, and is fully responsible for the geometry pipeline after the PVS & BSP stage, up to the production of spans for drawing. Some long-duration concurrency can be found during stages such as scan-conversion into spans, but most of the concurrency is short duration and cooperative with the CPU.
This does present some challenges for making it run quickly.
Regarding the game logic, even if the rendering is taken out of the equation, I have to wonder what sort of frame rate you could get out of the Falcon030's 16mhz processor, particularly in the slowed-down 16-bit graphics mode.
I had exactly this problem when I bolted my own Doom 3D engine (for Falcon) onto Id's Doom game code. The game code took as long as the rendering. In some cases - at least briefly - the game code would actually take longer than the rendering. Hardly any of the code would fit in the tiny CPU cache, and thrashed badly from the main bus.
Correcting that was painful - it involved changing the collision detection system (implementing a DSP based BSP raycast among many other changes) and the game object tick management system and base rate, and it was a lot of hassle. In the end though it was possible to push the game code costs into the background and the game ended up playable.
The same problems would occur in Quake2, plus a few new ones (edge density of enemy objects, ZBuffering those faces or inserting them into the spanbuffer, cost of full collision detection for moving map objects). For this reason I'm doubtful the Quake2 singleplayer mode will be practical to run in its original form. But I'm still interested in drawing the worlds in any case, and I think drawing 2 players plus map objects is doable, for example.
With regards to scene preparation, I would presume one would have the 68030 queue up vertex values that need the 3D math stuff and then hand that off to the DSP. Then the DSP would spit back a transformed list of polygon vertices ready for the rendering stage. What does the 68030 do while it's waiting for the DSP? In order to minimize idle time, one might want to have the 68030 cranking away on the game logic for the next frame while the DSP is cooking stuff to be rendered.
Almost, yes. The DSP is also responsible for a later stage - the spanbuffer, which sorts and clips polygons against each other, and issues non-overlapping spans back to the CPU for drawing (as part of a texture chain),
This would let the CPU perform drawing while the DSP does something else, so additional concurrency could be set up, but this doesn't work for texturing because the DSP is again needed to generate texture uvs for each pixel. The CPU could do this on its own for affine texturing but not for perspective-correct texturing.
As far as rendering goes, the first question I have is, does your OpenGL (-ish) library for Falcon030 do real 3D polygon rendering, with the texture scaling factor interpolated between vertices?
The Falcon version currently only fills flat surfaces, mainly because I'm still optimizing the upper and middle-lower levels of the 3D pipe. Most recently optimizing the spanbuffer scan conversion pass.
However, yes. I have prototype code in the PC version of the same engine which implements z-correction with the integer unit only (24bits, to make it DSP friendly), no floating point involved. It generates per-pixel perspective correct uvs.
It does not use edge subdivision, and does not use spanlets (divide-in-flight) as with the original Quake. It uses a approximation of the divide curve as the driver and avoids divide instructions at all stages, since that's way too slow for CPU and very awkward on DSP (it has an iterative divide, but not a concurrent one - need to divide 24 times for a 24bit result!).
If so, then the Falcon030's blitter is pretty much a paperweight because each pixel will need to be output individually.
For this engine, the Falcon's blitter is a paperweight, yes. It is far less sophisticated than the Jaguar blitter - it is a two-source, integer-addressing block transfer device with some logic operations and bitplane scrolling support. Aside a few interesting quirks and tricks which have been found to work with it, the blitter is really not capable of 3D work, except where it is needed to fill flat colour or copy contiguous data in rows or rectangles.
The Jaguar blitter is a lot faster, wider bus, has fixedpoint addressing IIRC and I'm sure I had it affine-texturing cubes back in the day as one of my early tests with the devkit. (I think I recently found the COF file for that and gave it to the Jaguar emulator guy to help improve emulation).
Because of the setup required for each blitter operation, there's a certain minimum number of pixels you have to be pushing at one time before it becomes cheaper to use it instead of just accessing the frame buffer directly.
Yes, and that is even true when filling flat colour on the Falcon using the blitter. I do use it for that in the current version, but the gain is small over the CPU, and can go slightly negative if the scene is dense enough with many tall thin surfaces.
For texturing the CPU+DSP need to be coupled, with the CPU performing the system ram pixel transfers from texture to framebuffer, and the DSP generating texture uvs concurrently, slightly ahead of each plot. This will certainly be slower than filling flat colour, but the time is constant. The resolution will be dropped from 320x200/160 down to 160x120 most probably - chunky columns, and the spanbuffer rotated 90' to minimize edge crossings in the scene and save a bit more time.
If you used a constant texture scale factor per polygon (like PS1), using the blitter is more practical, but then you realistically have to spend more CPU time doing polygon subdivision.
Yes that's right. In this case it will be correcting for each pixel (or can do, but we'll see if distributing across N pixels helps too) so there won't be a need to generate additional edges. However I had some fun with subdivision in the past, on the same problem.
Back in '97 I attempted a Quake (1) port which tried to adaptively subdivide the spans using the DSP, by testing the z-correction error at the span midpoint and splitting the span length. This worked, but the result was visibly very wobbly and nasty to watch, because the eye was not expecting the correction error to move around, even if it was kept fairly small. I think in the end, to stop it wobbling, it was more efficient to just leave it as it was, with fixed intervals for the divides
Thanks for the questions on the project!