Alternative way of optimizing S1/S2's DisplaySprite?

ADudeCalledLeo · May 24, 2019

So let's cut right to the chase.
Here's how (I understand) the Sonic games handle displaying sprites: There's a big table of sprites to be processed in RAM. This table is $400 bytes in total. Every $80 bytes are a separate "layer", with the first layer ($0-$80) being in front of everything and the last layer ($380-$400) being behind everything.
Every object in the game has a "priority" SST, which tells the game which "layer" it should go to.
The DisplaySprite subroutine is what handles sending sprites to the table.
Sonic 1 and 2 have the "priority" SST as a byte, and to transform it into a "layer address", they do this in DisplaySprite:
Code:
    lea    (Sprite_Table_Input).w,a1
    move.w    priority(a0),d0
    lsr.w    #1,d0
    andi.w    #$380,d0
    adda.w    d0,a1
Sonic 3 and Sonic & Knuckles, on the other hand, have the "priority" SST as a word, and in DisplaySprite they simply do this:
Code:
    lea    (Sprite_Table_Input).w,a1
    adda.w    priority(a0),a1
So obviously, S3K's way of doing it is more optimized than the S1/S2 method. And there is a guide on how to port the S3K method to S2 (thanks redhotsonic).
However, the guide isn't very... intuitive, and freeing up an SST shared with (almost) all objects is somewhat difficult, and you might want to use that SST for something else anyway. So, why not employ another way to speed up this subroutine?
Here's my suggestion:
Code:
DisplaySprite:
   moveq   #0,d0
   move.b   priority(a0),d0
   andi.b   #7,d0    ; safety measure. this shouldn't be needed, so remove this if you want a bit more speed
   add.w   d0,d0
   movea.w   Priority2InputAddrTable(pc,d0.w),a1
; (snip)
; ---------------------------------------------------------------------------
Priority2InputAddrTable:
   dc.w   Sprite_Table_Input
   dc.w   Sprite_Table_Input+$80
   dc.w   Sprite_Table_Input+$100
   dc.w   Sprite_Table_Input+$180
   dc.w   Sprite_Table_Input+$200
   dc.w   Sprite_Table_Input+$280
   dc.w   Sprite_Table_Input+$300
   dc.w   Sprite_Table_Input+$380
Basically, instead of loading up the sprite table input address to a1 and then calculating (except in S3K) and adding an offset to it, my version simply cuts out the middleman and uses the priority as an index into a table containing layer addresses. Thus, being (possibly) faster, at the expense of a very tiny amount of ROM usage.

...yeah, if you haven't noticed, I'm not actually sure if this is any faster. Can someone let me know, please?
also why does this forum not have syntax highlighting for 68kASM? can xenforo just not do that?

AURORA☆FIELDS · May 24, 2019

I doubt its much faster than S3K. What I do personally, is use the S3K method but load the address directly in priority(a0). This shortens the code to just loading from priority(a0) to a1, saving 16-20 cycles (I forget exactly). Its a really easy change for S3K hacks and gives a nice speed boost.

ADudeCalledLeo · May 24, 2019

Natsumi said: ↑

I doubt its much faster than S3K.
Click to expand...

Hey, I never implied that!

I'll make a mental note of your optimization in case I ever want to go insane hack S3K.

AURORA☆FIELDS · May 24, 2019

For the record, here are the cycle timings for each method:
Code:
    lea    (sprite_table_input).w,a1        ; 4 bytes 8(2/0)
    move.w    priority(a0),d0            ; 4 bytes 12(3/0)
    lsr.w    #1,d0                ; 2 bytes 6(1/0) + 2n(0/0) where n is shift or rotate count
    andi.w    #$380,d0                ; 4 bytes 8(2/0)
    adda.w    d0,a1                ; 2 bytes 8(1/0)
Code:
    moveq    #0,d0                    ; 2 bytes 4(1/0)
    move.b    priority(a0),d0                ; 4 bytes 12(3/0)
    andi.b    #7,d0                    ; 4 bytes 8(2/0)
    add.w    d0,d0                    ; 2 bytes 4(1/0)
    movea.w    Priority2InputAddrTable(pc,d0.w),a1        ; 4 bytes 14(3/0) 
Code:
    lea    (sprite_table_input).w,a1        ; 4 bytes 8(2/0)
    adda.w    priority(a0),a1            ; 4 bytes 16(3/0) 
Code:
    movea.w    priority(a0),a1        ; 4 bytes 12(3/0) 
So, 44 cycles for the original, 32 (or 40 if you include the and instruction!) for the new version. Indeed, it is faster, and actually I am surprised it is that much faster. Pretty neat optimization trick. For the S3K one and my method, the difference is: 24 vs 12, or about half the cycles needed!

ADudeCalledLeo · May 24, 2019

Natsumi said: ↑

So, 44 cycles for the original, 32 (or 40 if you include the and instruction!) for the new version.
Click to expand...

...wow. I was only expecting it to be like 1 or 2 cycles faster. Shows you how much I know about working with ASM, I guess.

Log in or Sign up

Alternative way of optimizing S1/S2's DisplaySprite?

ADudeCalledLeo I'll make a ROM hack one of these days... Member

AURORA☆FIELDS so uh yes Exiled

ADudeCalledLeo I'll make a ROM hack one of these days... Member

AURORA☆FIELDS so uh yes Exiled

ADudeCalledLeo I'll make a ROM hack one of these days... Member

Log in or Sign up

Alternative way of optimizing S1/S2's DisplaySprite?

ADudeCalledLeo I'll make a ROM hack one of these days... Member

AURORA☆FIELDS so uh yes Exiled

ADudeCalledLeo I'll make a ROM hack one of these days... Member

AURORA☆FIELDS so uh yes Exiled

ADudeCalledLeo I'll make a ROM hack one of these days... Member

Useful Searches