Optimizing collapsing platforms (a lot)

Discussion in 'Tutorials Archive' started by SpirituInsanum, Mar 8, 2014.

Thread Status:
Not open for further replies.
  1. SpirituInsanum

    SpirituInsanum Well-Known Member Member

    Joined:
    Feb 11, 2010
    Messages:
    642
    Of the incredibly slow process of creating many objects at once

    Raise your hand if you hate it when the game slows down when Sonic walks on a collapsing platform.

    The others can leave.

    Creating many objects is just painfully slow, but there's a reason for it, and there's a way to make it much faster as well.

    I take the example of the collapsing ledges and floors because it's a single fix for two objects, but also because it's probably one of the most obvious slowdown in the game: you actually see what makes the game slow down.

    First, analysing (yeah, big word) the origin of the slowdown.

    When you read from Obj1A_Collapse to loc_84F2, there isn't much to slow down the game at first sight. Some branches, a loop, nothing exceptional. But if you look more carefully, you'll see a loop inside a branch inside the loop, and here, you should say... how many times will it loop?

    Once or twice wouldn't really be a problem, but we're talking about something entirely different here. The original loop goes from $7 (floor) to $18 (ledge), and the loop inside the branch 0 to $5F.

    Now, imagine your first empty slot in object memory is for the 40th object, on the first main loop, the second loop is taken 39 times, on the second, 40 times, on the third, 41 times, and so on. Do this 24 ($18) times and all you have is a huge waste of processing time. The next free object won't be in a slot that was checked before, testing them all 24 times is just stupid (try to check the same thing 24 times in a row IRL if you don't believe me).

    This is what this routine looks like:


    Obj1A_Collapse:
    move.b #0,$3A(a0)

    loc_847A:
    lea (Obj53_Data1).l,a4
    moveq #$18,d1
    addq.b #2,$1A(a0)

    loc_8486:
    moveq #0,d0
    move.b $1A(a0),d0
    add.w d0,d0
    movea.l 4(a0),a3
    adda.w (a3,d0.w),a3
    addq.w #1,a3
    bset #5,1(a0)
    move.b 0(a0),d4
    move.b 1(a0),d5
    movea.l a0,a1
    bra.s loc_84B2

    loc_84AA:
    bsr.w SingleObjLoad ; <<< does this too many times
    bne.s loc_84F2
    addq.w #5,a3

    loc_84B2:
    move.b #6,$24(a1)
    move.b d4,0(a1)
    move.l a3,4(a1)
    move.b d5,1(a1)
    move.w 8(a0),8(a1)
    move.w $C(a0),$C(a1)
    move.w 2(a0),2(a1)
    move.b $18(a0),$18(a1)
    move.b $19(a0),$19(a1)
    move.b (a4)+,$38(a1)
    cmpa.l a0,a1
    bcc.s loc_84EE
    bsr.w DisplaySprite2

    loc_84EE:
    dbf d1,loc_84AA



    And the one done too many times:


    SingleObjLoad:
    lea ($FFFFD800).w,a1
    move.w #$5F,d0

    loc_DA94:
    tst.b (a1)
    beq.s locret_DAA0
    lea $40(a1),a1
    dbf d0,loc_DA94 ; it can repeat this loop up to $5F times every 24 times if needed!

    locret_DAA0:
    rts


    The content of the loop takes approximately 34 processor cycles.

    Best case scenario? That small loop will be done 0+1+2+3+4+5+6+7+8+9+10+11+12+13+14+15+16+17+18+19+20+21+22+23 times. Yup, I wrote it to show you how stupid it can be. 276 times, over 9k processor cycles. Or only 28 times for collapsing floors, we'll ignore this case, it can't happen in game (best can't happen, period).

    Worst case scenario? I won't check the details, but I'd say it's something like that small loop will be done 71+72+73+74+75+76+77+78+79+80+81+82+83+84+85+86+87+88+89+90+91+92+93+94+95 times! And here "stupid" sounds like the understatement of the day. If you're wondering, that's 2075 times. Don't count the processor cycles, it's too sad. But slowing down? Definitely.

    Anyway, it really doesn't need to start checking from the beginning each time, so we'll optimize this a bit.

    We'll also get rid of the branch in the process, because it's a waste of processing time as well and it's easier this way (although the code is a little longer).

    See the comments in the code for explanations:


    Obj1A_Collapse:
    move.b #0,$3A(a0)

    loc_847A:
    lea (Obj53_Data1).l,a4
    moveq #$18,d1
    addq.b #2,$1A(a0)

    loc_8486:
    moveq #0,d0
    move.b $1A(a0),d0
    add.w d0,d0
    movea.l 4(a0),a3
    adda.w (a3,d0.w),a3
    addq.w #1,a3
    bset #5,1(a0)
    move.b 0(a0),d4
    move.b 1(a0),d5
    movea.l a0,a1
    ; bra.s loc_84B2 ; We have to remove this otherwise a1's value won't be right,
    ; but since it's what creates the first object over the source object, we also have to create it now...
    ; First object's creation begins here:
    move.b #6,$24(a1)
    move.b d4,0(a1)
    move.l a3,4(a1)
    move.b d5,1(a1)
    move.w 8(a0),8(a1)
    move.w $C(a0),$C(a1)
    move.w 2(a0),2(a1)
    move.b $18(a0),$18(a1)
    move.b $19(a0),$19(a1)
    move.b (a4)+,$38(a1)
    ; and ends here, it's a simple copy/paste from loc_84B2.
    ; Now since we created one object already, we have to decrease the counter
    subq.w #1,d1
    ; We don't have to check whether it's the last one or not, it can't be unless there's not enough free ram to create more (and that's checked later).
    ; Here we begin what's replacing SingleObjLoad, in order to avoid resetting its d0 every time an object is created.
    lea ($FFFFD800).w,a1
    move.w #$5F,d0

    loc_84AA:
    ; bsr.w SingleObjLoad ; We remove this, it's the routine we want to avoid
    ; So here goes what was originally happening in SingleObjLoad, excepted now d0 won't be reset every time an object has to be created.
    ; We'll just copy/paste the content of loc_DA94 and correct the branches.
    @loop:
    tst.b (a1)
    beq.s @cont ; Let's correct the branches. Here we can also skip the bne that was originally after bsr.w SingleObjLoad because we already know there's a free object slot in memory.
    lea $40(a1),a1
    dbf d0,@loop ; Branch correction again.
    bne.s loc_84F2 ; We're moving this line here.
    @cont:
    ; And that's it, copy/paste complete.
    addq.w #5,a3

    loc_84B2:
    move.b #6,$24(a1)
    move.b d4,0(a1)
    move.l a3,4(a1)
    move.b d5,1(a1)
    move.w 8(a0),8(a1)
    move.w $C(a0),$C(a1)
    move.w 2(a0),2(a1)
    move.b $18(a0),$18(a1)
    move.b $19(a0),$19(a1)
    move.b (a4)+,$38(a1)
    ; cmpa.l a0,a1 ; Finally, this isn't necessary anymore, its only purpose was to skip DisplaySprite2 on the first object
    ; bcc.s loc_84EE
    bsr.w DisplaySprite2

    loc_84EE:
    dbf d1,loc_84AA

    And so now, instead of 276 to 2k times, the small loop won't have to be done more than 96 times.

    Of course there are other objects you can optimize in a similar way, but you'll have to figure that out yourself.

    Have fun.
     
  2. Psycho RFG

    Psycho RFG Well-Known Member Member

    Joined:
    Feb 22, 2011
    Messages:
    234
    Nice optimization. This kind of stuff is always welcome.
     
  3. MarkeyJester

    MarkeyJester ♡ ! Member

    Joined:
    Jun 27, 2009
    Messages:
    2,867
    VERY, very exciting there Spiritu! Some good thinking, nice work! I think you've proven yourself to be rather good and resourceful at this.
     
  4. SpirituInsanum

    SpirituInsanum Well-Known Member Member

    Joined:
    Feb 11, 2010
    Messages:
    642
    Yeah, well... I think that's something most people knew, few people cared enough to fix and nobody bothered talking about.  ^^'

    A few more details.

    I fixed quite a few objects this way and as far as I can tell, the most interesting ones in S1 are:

    - lost ring ($37, again, fps's arch-enemy),

    - swinging platforms ($15),

    - rings ($25, when they create lines of rings)

    Slightly less important but providing a nice boost when things are getting slow:

    - bridges ($11),

    - platforms on conveyor belts ($60 and $6F, but they're still slow)

    - helix of spikes ($17, I wasn't expecting that one at all, thanks to psychoRFG),

    - spiked balls ($57, especially near the end of Lz3)

    - caterkiller ($78)

    - orbinaut ($60)

    All objects aren't to be fixed exactly the same way. In this example the source object is used to create the first object, but others will create objects without using the source object, and some will create objects only after the source object in memory (those are using singleobjload2).
     
  5. Crash

    Crash Well-Known Member Member

    Joined:
    Jul 15, 2010
    Messages:
    302
    Location:
    Australia
    As good as this optimization is, I'm not sure that it would make much difference in-game. Even in your worst case scenario, I don't think it would cause any more than one single frame of delay, and I'm not so sure you'd actually notice it. The real issue with slowdown is the large number of objects that have to be updated every frame after this code is run. You might see better performance from porting the Sonic 2 child sprites system and rewriting all those multi-object objects to work with that, that would dramatically reduce the total objects used. That would be a LOT of work though :p
     
  6. SpirituInsanum

    SpirituInsanum Well-Known Member Member

    Joined:
    Feb 11, 2010
    Messages:
    642
    First of all, OOz's collapsing ledge, that's object 1A too in s2, loc_10B68... no child sprites. :eek:nthequiet:

    Second, two words: scattered rings. :hang:

    Do scattered rings with child sprites. I dare you, do it! Hahaha! :crazy: 

    Also, it's 32 objects in that case, not 24. I pushed it to 33 in my hack when many people actually reduce the number of scattered rings to save a few frames, and no slowdown (unless I lose 33 rings twice in a row which can happen and is a lot of fun).

    Child sprites are great until the objects created have to interact with the player or are all very different from each others.

    It really does make a huge difference in game. Of course there are even more optimized ways in some of the cases I cited in my 2nd post. For example, ring rows replaced with single rings. But they all require much bigger changes than the simple fix I'm showing here, so in the meantime it's a good and easy way to get rid of some slowdowns.

    Finally, correct me if I'm wrong, but if I remember correctly, the MD's M68k performs approximately 128k cycles per frame (7.67MHz divided by 60), so 9k to 70k cycles saved means 7% to 54% of the processing power saved on that frame, that's probably enough to process the new objects.

    So I think it's worth 20 lines of changes. ;)

    (And never believe people won't notice slowdowns, they do, and they bitch a lot about it.)
     
  7. LazloPsylus

    LazloPsylus The Railgun Member

    Joined:
    Nov 25, 2009
    Messages:
    Location:
    Academy City
    Speaking of things that could be optimized far better, ring loss scatter is an absolute cycle sinkhole. Try swapping the real-time calculation code out for a pre-calculated scatter. Probably isn't quite as pretty on the scatter, but there will be a noticeable reduction in slowdown on ring loss. Was enough to eliminate the slowdown on large ring loss in just about every scenario when a friend and I had it implemented a few years ago in Sonic 1 (from what I recall, there was still slight slowdown in water, but it was still much better than it originally was; it's been awhile, though, so I may be recalling that segment of playtesting incorrectly).

    Anyway, just a passing suggestion if you want something else to optimize. Your work here is a nice way of tightening up some inefficient code and cutting cycle costs, and I congratulate you on your attempt on educating people on how to eliminate some of that annoying slowdown. Keep at it and see what you can learn. There's always plenty more in that mess of an engine to tighten up and optimize if you're ever bored and want a challenge...
     
  8. SpirituInsanum

    SpirituInsanum Well-Known Member Member

    Joined:
    Feb 11, 2010
    Messages:
    642
    You're kidding, right? o_O

    http://sonicresearch.org/articles.php/_/guides/sonic-2/speed-up-ring-loss-processing-r33

    Done that, long ago.

    Also, best/worst case scenarios for scattered rings are approximately 17k cycles and 90k cycles wasted (minus 1.1k/4.4k for best/worst case with this optimization applied to rings). So there was still a chance to see slowdowns, although definitely not as much as when it was calculating the speed for every ring. ;)
     
    Last edited by a moderator: Mar 9, 2014
  9. LazloPsylus

    LazloPsylus The Railgun Member

    Joined:
    Nov 25, 2009
    Messages:
    Location:
    Academy City
    Ah, figured someone had to have done it at some point. Ah well, goes to show how little I'm around in this community anymore. Apologies for bringing it up.
     
    Last edited by a moderator: Mar 10, 2014
  10. Crash

    Crash Well-Known Member Member

    Joined:
    Jul 15, 2010
    Messages:
    302
    Location:
    Australia
    Sonic 3, even with all it's optimizations, doesn't do it either. Bridges and swinging platforms do it though, even in Sonic 2!

    Yeah, that wouldn't really work :)

    Yeah, minus all the cycles used for the vblank routines every frame, that 70,000 cycles is almost a full frame's worth of processing time. The thing is though, to get to that worst case scenario, there need to be around 70 objects already active. The game will be lagging so much already that 1 more frame's going to be like a drop in the ocean.

    Don't get me wrong, this is a good optimization, and it's probably the way it should have been programmed in the first place. I'm just pointing out that as a routine that would at the very most would be run once per second, it's not going to reduce slowdown like stuff that gets run every frame.
     
  11. SpirituInsanum

    SpirituInsanum Well-Known Member Member

    Joined:
    Feb 11, 2010
    Messages:
    642
    @Evanescence: No need to apologize, it's just that it's been on the front page of this site for almost two years, with my nick written on it, so I wasn't sure you were serious or not. ^^'

    @Crash: Swings only use it for the links though IIRC, the platforms themselves are separate objects.

    You're focusing too much on the example I gave. There's a better thing to optimize this way than this object or what's in the small list in my second post. I just think people should figure it out themselves.

    Despite the topic's title, the purpose of this small tutorial isn't to provide a fix for one object, that's only to catch people's attention, but to point at the lag created by SingleObjLoad when it's used in loops.

    I never said that single example fix would make the entire game smoother at every moment. This one only removes some lag where it's easily noticed (making another tutorial to optimize scattered rings again would have been awkward), so people can see the results easily and look for other things to optimize this way.
     
  12. nineko

    nineko I am the Holy Cat Member

    Joined:
    Mar 24, 2008
    Messages:
    1,902
    Location:
    italy
    You guys sure are weird. SpirituInsanum comes up with a brilliant example of optimisation (well received even by MarkeyJester himself, definitely not a random noob, but quite possibly one of the most knowledgeable people around here), and instead of being happy you start pointing out other things to optimise. I'm not surprised if some people decide to keep interesting things private. Okay, maybe there are other things which slow down the game more than this, but is that a good reason not to appreciate this tutorial? Don't make it sound like you're changing the cigarette lighter of your car when you actually wanted to refill the tank with gasoline.
     
  13. Crash

    Crash Well-Known Member Member

    Joined:
    Jul 15, 2010
    Messages:
    302
    Location:
    Australia
    Sorry if I'm coming across as being rude or dismissive, it's not my intention. I was just hoping to start a bit of discussion about the biggest areas left to target for reducing slowdown, but this was probably the wrong thread to do it in and the wrong way to phrase it. It certainly doesn't hurt to have all routines optimized, it's just I think the time and effort to do so could be better spent in other areas. (I could be completely wrong with my suggestion too, I haven't looked that hard into it)

    Also, when people post stuff like this it's a good way to get others thinking about ways to optimize their own stuff, it's a good thing to learn from!
     
  14. nineko

    nineko I am the Holy Cat Member

    Joined:
    Mar 24, 2008
    Messages:
    1,902
    Location:
    italy
    It can be my fault as well, sometimes I miss the point, I gave more focus to the "this optimisation isn't very important" part while not seeing the "also" in your "let's also talk about other things". My apologies, I'm known for overreacting from time to time.
     
Thread Status:
Not open for further replies.