Alternative way of optimizing S1/S2's DisplaySprite?

Discussion in 'Discussion & Q&A' started by ADudeCalledLeo, May 24, 2019.

  1. ADudeCalledLeo

    ADudeCalledLeo Newcomer Member

    Joined:
    Oct 21, 2017
    Messages:
    11
    Location:
    Null Space
    So let's cut right to the chase.
    Here's how (I understand) the Sonic games handle displaying sprites: There's a big table of sprites to be processed in RAM. This table is $400 bytes in total. Every $80 bytes are a separate "layer", with the first layer ($0-$80) being in front of everything and the last layer ($380-$400) being behind everything.
    Every object in the game has a "priority" SST, which tells the game which "layer" it should go to.
    The DisplaySprite subroutine is what handles sending sprites to the table.
    Sonic 1 and 2 have the "priority" SST as a byte, and to transform it into a "layer address", they do this in DisplaySprite:
    Code:
        lea    (Sprite_Table_Input).w,a1
        move.w    priority(a0),d0
        lsr.w    #1,d0
        andi.w    #$380,d0
        adda.w    d0,a1
    Sonic 3 and Sonic & Knuckles, on the other hand, have the "priority" SST as a word, and in DisplaySprite they simply do this:
    Code:
        lea    (Sprite_Table_Input).w,a1
        adda.w    priority(a0),a1
    So obviously, S3K's way of doing it is more optimized than the S1/S2 method. And there is a guide on how to port the S3K method to S2 (thanks redhotsonic).
    However, the guide isn't very... intuitive, and freeing up an SST shared with (almost) all objects is somewhat difficult, and you might want to use that SST for something else anyway. So, why not employ another way to speed up this subroutine?
    Here's my suggestion:
    Code:
    DisplaySprite:
       moveq   #0,d0
       move.b   priority(a0),d0
       andi.b   #7,d0    ; safety measure. this shouldn't be needed, so remove this if you want a bit more speed
       add.w   d0,d0
       movea.w   Priority2InputAddrTable(pc,d0.w),a1
    ; (snip)
    ; ---------------------------------------------------------------------------
    Priority2InputAddrTable:
       dc.w   Sprite_Table_Input
       dc.w   Sprite_Table_Input+$80
       dc.w   Sprite_Table_Input+$100
       dc.w   Sprite_Table_Input+$180
       dc.w   Sprite_Table_Input+$200
       dc.w   Sprite_Table_Input+$280
       dc.w   Sprite_Table_Input+$300
       dc.w   Sprite_Table_Input+$380
    Basically, instead of loading up the sprite table input address to a1 and then calculating (except in S3K) and adding an offset to it, my version simply cuts out the middleman and uses the priority as an index into a table containing layer addresses. Thus, being (possibly) faster, at the expense of a very tiny amount of ROM usage.

    ...yeah, if you haven't noticed, I'm not actually sure if this is any faster. Can someone let me know, please?
    also why does this forum not have syntax highlighting for 68kASM? can xenforo just not do that?
     
    KCEXE likes this.
  2. Natsumi

    Natsumi Phoenix egg Member

    Joined:
    Oct 7, 2011
    Messages:
    695
    Location:
    Long and dangerous river
    I doubt its much faster than S3K. What I do personally, is use the S3K method but load the address directly in priority(a0). This shortens the code to just loading from priority(a0) to a1, saving 16-20 cycles (I forget exactly). Its a really easy change for S3K hacks and gives a nice speed boost.
     
  3. ADudeCalledLeo

    ADudeCalledLeo Newcomer Member

    Joined:
    Oct 21, 2017
    Messages:
    11
    Location:
    Null Space
    Hey, I never implied that! :p

    I'll make a mental note of your optimization in case I ever want to go insane hack S3K.
     
    Natsumi likes this.
  4. Natsumi

    Natsumi Phoenix egg Member

    Joined:
    Oct 7, 2011
    Messages:
    695
    Location:
    Long and dangerous river
    For the record, here are the cycle timings for each method:

    Code:
        lea    (sprite_table_input).w,a1        ; 4 bytes 8(2/0)
        move.w    priority(a0),d0            ; 4 bytes 12(3/0)
        lsr.w    #1,d0                ; 2 bytes 6(1/0) + 2n(0/0) where n is shift or rotate count
        andi.w    #$380,d0                ; 4 bytes 8(2/0)
        adda.w    d0,a1                ; 2 bytes 8(1/0)
    Code:
        moveq    #0,d0                    ; 2 bytes 4(1/0)
        move.b    priority(a0),d0                ; 4 bytes 12(3/0)
        andi.b    #7,d0                    ; 4 bytes 8(2/0)
        add.w    d0,d0                    ; 2 bytes 4(1/0)
        movea.w    Priority2InputAddrTable(pc,d0.w),a1        ; 4 bytes 14(3/0) 
    Code:
        lea    (sprite_table_input).w,a1        ; 4 bytes 8(2/0)
        adda.w    priority(a0),a1            ; 4 bytes 16(3/0) 
    Code:
        movea.w    priority(a0),a1        ; 4 bytes 12(3/0) 
    So, 44 cycles for the original, 32 (or 40 if you include the and instruction!) for the new version. Indeed, it is faster, and actually I am surprised it is that much faster. Pretty neat optimization trick. For the S3K one and my method, the difference is: 24 vs 12, or about half the cycles needed!
     
    MarkeyJester likes this.
  5. ADudeCalledLeo

    ADudeCalledLeo Newcomer Member

    Joined:
    Oct 21, 2017
    Messages:
    11
    Location:
    Null Space
    ...wow. I was only expecting it to be like 1 or 2 cycles faster. Shows you how much I know about working with ASM, I guess.