Documentation questions GPU

a31chris · Post by **a31chris** » Sun Feb 23, 2014 6:47 pm

AveryBlueMonkey wrote: Hi everyone, some beginner's questions on the GPU below, all from the TechRef v8. Greatly appreciate any help or clarification that anyone can provide here!

(1) On page 35 there's a statement regarding the register score boarding:
"WARNING - No score-board protection applies to writes. Therefore, if two instructions both write to the same
register and the first one completes after the second, the data will be written out of sequence. If they both write
at the same time, then the results are unpredictable. This only appplies where the second instruction does not
read the register."

Is the only time when this scenario can happen when there are two LOADs to the same register from memory without any intervening commands that would read that register (and trigger wait states), such as something like:

0: LOAD [external memory] r10
1: ... stuff not reading r10 ...
2: LOAD [internal memory] r10

With 2 completing before 1? I was struggling to think of any other command sequence that could trigger this event.

(2) On page 38 there's an example ISR that clears the interrupt mask:

0: int_serv:
1: movei GPU_FLAGS,r30 ; point R30 at flags register
2: load (r30),r29 ; get flags
3: bclr 3,r29 ; clear IMASK
4: bset 11,r29 ; and interrupt 2 latch
5: load (r31),r28 ; get last instruction address
6: addq 2,r28 ; point at next to be executed
7: addq 4,r31 ; updating the stack pointer
8: jump (r28) ; and return
9: store r29,(r30) ; restore flags

Regaring line 8 above, if r31 is meant to hold the address of the last instruction that was executed before the interrupt occurred, why is:
(i) the interrupt service routine altering it at all?
(ii) and why is it altering it by adding 4 specifically?

(3) On the systolic matrix multiplies command - assuming you have an appropriate calculation, is there any performance benefit to MMULT over manually typing the IMULTN ... IMACN ... RESMAC sequence yourself? From the reading it appeared to me to be like a C style MACRO, and would function identically either way.

(4) Is there any additional performance overhead when moving data between registers in different banks when compared to within the same bank, i.e. MOVEFA vs MOVE?

(5) Excluding the interrupt service do both register banks behave identically in terms of register usage?

(6) And time for the painfully basic question that I'll kick myself later if I don't ask, when you want to schedule work on another processor, say the M68K or Blitter, this is done by sending them an interrupt right? And if you wanted to wait for completion (say the Blitter copying a new batch of program code into the GPU internal memory), you would clear the GPUGO bit and wait for that processor to set it again, there would be no requirement for them to invoke an interrupt on the GPU?

As I said, if anyone's able to help on these thank you very much.

a31chris · Post by **a31chris** » Sun Feb 23, 2014 6:48 pm

Tursi wrote: It's been a long time since I was deep in these particular questions... but to my knowledge:

1) you are correct. It's possible to write code that does that, but the warning is just that if for pipelining reasons the writes may complete in an unpredictable order, you can't guarantee that the second write you execute is the one that completes. It's not how most code would be written (it'd be rare to write randomly without intervening reads AND care about the order of those writes).

2) Unless I'm reading this wrong, r31 is a stack pointer, which means it is pointing to a list of addresses, not holding the return address itself. That's why it's altered when data is pulled off the stack. If you read the section on interrupts (p.38) you will see that the value pushed onto the stack is the last instruction executed. This is a 32-bit value (so, 4 bytes). The return address is the value pulled from the stack plus two.. thus:

5: load (r31),r28 ; pull the last executed address from the stack
6: addq 2,r28 ; increment by 2 to get the return address into r28
7: addq 4,r31 ; update the stack pointer because we pulled 4 bytes from it
8: jump (r28) ; and return to the address in r28

3) I can't tell you anything about the multiply functions, myself.

4) No additional overhead that I know of, but I don't think we have the timing charts. That said, the RISC are supposed to execute on a fixed schedule with very few exceptions, so far as I know the register move functions all fit this. (That is, with few exceptions all opcodes execute in four clocks, to match the four pipeline stages).

5) yes, so far as I have noticed

6) Question is asked a little confusingly to me, so I'll just be general. First - the blitter is not a general purpose processor, so a GP processor must set its control bits to make it do work, there's no way to trigger it via an interrupt of single signal (more than once). So your question really comes down to the 68000, DSP and GPU. In that case, it's really up to you. There seems to be minimal performance difference between the RISCs running in a tight loop monitoring dedicated memory addresses for commands versus shutting down and being turned on externally. IIRC, there's more overhead triggering the RISC via interrupt, but it has been a while. It's all pretty close. The 68000, if you put it to sleep as many suggest, is woken up by an interrupt (and usually handles the vertical blank interrupt).

a31chris · Post by **a31chris** » Sun Feb 23, 2014 6:50 pm

kenr wrote:
averybluemonkey wrote:(1) On page 35 there's a statement regarding the register score boarding:
"WARNING - No score-board protection applies to writes. Therefore, if two instructions both write to the same
register and the first one completes after the second, the data will be written out of sequence. If they both write
at the same time, then the results are unpredictable. This only appplies where the second instruction does not
read the register."

Is the only time when this scenario can happen when there are two LOADs to the same register from memory without any intervening commands that would read that register (and trigger wait states), such as something like:

0: LOAD [external memory] r10
1: ... stuff not reading r10 ...
2: LOAD [internal memory] r10

With 2 completing before 1? I was struggling to think of any other command sequence that could trigger this event.
I'm very rusty on this, and I haven't had a chance to refresh my memory on any of the timing or other details, but I'm pretty sure the problem scenario is more like:
LOAD [external memory] r10
ADD r3,r10
....
....
DO_SOMETHING with r10; what's there?

There's an ambiguity about the contents of r10, since (if I remember the timings), the load and the add were trying to write to it simultaneously. In your scenario, the semantics are well-defined, but the first load is of a dead value.

Am I making sense?

- ken

3DO ZONE Forums

Documentation questions GPU

Documentation questions GPU

Re: Documentation questions GPU

Re: Documentation questions GPU