Optimized KosDec and NemDec, considerably faster decompression

Discussion in 'Tutorials Archive' started by vladikcomper, Nov 4, 2013.

Thread Status:
Not open for further replies.
  1. vladikcomper

    vladikcomper Well-Known Member Member

    Joined:
    Dec 2, 2009
    Messages:
    415
    A much faster Kosinski decompressor for the 68k

    Kosinski is a variant of LZSS compression algorithm, used in various Sega Genesis games. Being much faster than Nemesis algorithm in decompression, this format also provides a fairly better compression ratio on data with frequently repeated sequences of bytes. The decompression code for the 68k found in many Genesis games seems identical. Its decompression routine is known as 'KosDec' in most of Sonic diassemblies. Although written in quite a wise and accurate manner, it still suffers from some optimization errors, both on logic and instructions side of things.

    I started looking into the ways of reducing decompression times after once I attempted decompressing big blocks of data on the fly. The process was time-critical and the original decompressor was doing pretty well, but not as well as I wanted it to be: it was kind of balancing on the edge -- I needed just a tiny bit faster processing to get rid of unwanted slowdown. I analyzed the whole decompressor's code back then in order to optimize it and get the best out of it. I figured few tricks which allowed me about 5% faster decompression which was enough back at the time.

    Later I did realize that the decompression routine can be optimized even further, but more fundamental changes should be made to it. I then decided to create my own implementation of Kosinski decompressor by applying all the optimization tricks I planned and more. It was doing the same thing, but in a different way, using a different, more efficient code logic.

    The result overcame all my expectations, the very first revision of my own Kosinski decompressor was overall 40% faster than the original one. Later on, MarkeyJester and flamewing have suggested a few more ideas, which helped to optimize it even further. Initially, I didn't plan to use some extreme optimization tricks like look-up tables (LUTs) in order to simplify calculations, but flamewing's test proved to give it a good performance increase that I didn't expect, which quite makes up for a larger decompression code.

    Now, let me present you, an extremely fast Kosinski decompressor!

    http://pastebin.com/EtYf1vz6

    The performance increase compared to the original decompressor is unreal. Believe or not, it works more than 1.5 times faster. I've set several tests to measure new compressor's effectiveness. Here's a test involving decompressing the original EHZ art from Sonic 2:

    [​IMG][​IMG]

    Using the new decompressor will not only greatly reduce loading times on any Kosinski-compressed art, but also provide you more freedom with "on the fly" decompression, since you will be able to decompress nearly twice as much data within one TV-frame.

    An optimized Nemesis decompressor for the 68k

    Unlike Kosinski, Nemesis is an entropy-based compression method. Due to this method's nature, characters in the compressed stream have variable size in bits, making decompression process way more time-consuming: it is required to read compressed stream bit by bit and analyze it to decode characters. Although compression was rather high, it took about 10 times slower than Kosinski to decompress. The latter also provided quite a close compression ratio, which was even higher for many files with frequently repeated byte sequences. That was the reason Sonic Team started slowly migrating from this compression format starting from Sonic 2. Most of the compressed art in Sonic 3K was in Kosinski format already.

    I attempted to optimize the Nemesis decompression routine, known as 'NemDec' shortly after my success with the Kosinski decoder. However, even given the fact that Nemesis decompression was terribly slow, their original code was surprisingly well-designed. The algorithm was quite complex and used nearly all the registers processor provided, leaving me almost no gaps for usual optimization tricks.

    I've managed to pull just a few optimization tricks, but this didn't make a considerable improvement though. Anyways, a little optimization over the original version was reached, which, considering how long the decompression takes, helps it to save up to 4 frames on decompressing a single level art file in Sonic 1.

    http://pastebin.com/cAJSJ0np

    The performance increase is about 5% as seen below. This test program decompressed the original GHZ art from Sonic 1. The optimized version works 3 frames faster:

    [​IMG][​IMG]

    * * *

    Also, stay tuned! More on the Kosinski and Nemesis compression shall come soon.
     
    Nat The Porcupine, KCEXE and Clownacy like this.
  2. Flamewing

    Flamewing Elite Hacker Member

    Joined:
    Aug 28, 2011
    Messages:
    37
    Location:
    France
    Part two is by me, so here it goes!

    Improved compressors and decompressors
    When we were working on the above decompressor, vladikcomper mentioned that he thought that KENS' LZSS encoder did not result in optimal sizes. He felt that if you broke some runs earlier, you could maybe gain better compression. I went with that idea, and did some research. As in reading scholarly computer science papers. It did not take long to see that this was true, and there were methods to make optimum compression. They were behind a paywall; and when I did get to them, they were a rather simple report on a recent conference. So I kept digging more and more. I found more articles, one of which mapped LZSS compression to graph theory. If you are interested in knowing more, I added lots of comments explaining what is going on in the source code (see below).

    So I set about to writing a perfect compression Kosinski implementation. But rather than do just that, I made a framework for perfect compression of LZSS, which I proceeded to apply to Kosinski, Saxman and Comper (an LZSS-based format created by vladikcomper that has decompression speed as the primary goal).

    Some folks know that I have also written a much improved Nemesis encoder too; FraGag made a version in C# (which has some undiagnosed issues) for MainMemory. This compressor has vastly improved compression over old KENS, but it sadly does not have perfect compression -- I have yet to devise a non-brute force way to get past the last remaining barrier to perfect compression. The brute-force solution is not acceptable here because it takes FAR too long.

    Anyway: you can get the Windows binaries here; you can get the source code along with that of the S2-SSEdit (here); the tools compile and install by default. These command line tools use a KENS-like interface. I will eventually switch to a KENSSharp-like interface, but for now:

    Code:
    To compress:   <tool> [options] input output
    To decompress: <tool> [options] -x <input> <output>
    To recompress: <tool> [options] -c <input> <output>
    
     
    Last edited by a moderator: Nov 15, 2013
    Nat The Porcupine and Clownacy like this.
  3. MainMemory

    MainMemory Well-Known Member Member

    Joined:
    Mar 29, 2011
    Messages:
    922
  4. OrdosAlpha

    OrdosAlpha RIGHT! Naebody move! Root Admin

    Joined:
    Aug 5, 2007
    Messages:
    1,793
    Location:
    Glasgow, Scotland
    Any plans to also bring improvements to the Enigma and Saxman formats?
     
  5. Flamewing

    Flamewing Elite Hacker Member

    Joined:
    Aug 28, 2011
    Messages:
    37
    Location:
    France
    At least as far as the encoder, it is already there — for the 68k decoder, it is an interesting idea. But to be fair, it would probably more useful to make a z80 decoder for Saxman — the format is most used to decompress music. Or maybe just write a good z80 Kosinski decompressor — I think it would likely be faster and give better compression.

    As fast as Enigma goes, the encoder I give is slightly modified from KENS, giving marginally better encoding; so marginal I didn't even bother mentioning. I do have some ideas for trying to improve, though; I need to check if they will pan out. I have no idea how easy or hard it will be to improve thendecoder, though — I never looked into it.
     
  6. vladikcomper

    vladikcomper Well-Known Member Member

    Joined:
    Dec 2, 2009
    Messages:
    415
    I should've posted this earlier along with flamewing's first post in this topic, but unfortunately real life events got me busy. As you might notice, the newly released compressors by flamewing also included a new compression format, Comper. So, here it comes...

    Comper Compression

    Comper is a new compression format, developed by me while I was working on a small proof-of-concept a while ago. My goal was to provide the best compression ratio possible along with faster decompression speeds that no other algorithms offered before. This way, large amounts of data can be loaded "on the fly" within one frame, giving no slowdown. Therefore, the compressed data can be operated as easily as uncompressed.

    Here's a little demonstration to show this format's potential. I've compressed the original GHZ art from Sonic 1 with Nemesis and Comper algorithms and compared both compression effectiveness and decompression times. The recently released flamewing's compressors were used to guarantee the best compression ratio for Nemesis so far.
     


    Uncompressed art: 25.9 KB
    Nemesis-compressed art: 10.3 KB
    Comper-compressed art: 12.6 KB


    [​IMG][​IMG]

    Resulted in just 2.3 KB larger file, Comper-compressed art was decompressed 10 times faster than Nemesis, which further prooves that Nemesis has a terrible compression/speed ratio. In some rare cases Comper may even overcome Nemesis's compression ratio, just like Kosinski. But in most of the cases, obviously, Nemesis has more efficient compression due to algorithm complexity and times it takes to decompress. While Comper compression had decompression speed as its primary goal, however, I was surprised how close it could get to complex and "heavy" compressions like Nemesis, like in my particular example with GHZ art.

    The decompression speed Comper has is so high that you may even be able to decompress small tilesets in-game on the fly, as if it was uncompressed art. The decompressor's code is also small and compact just like the format itself -- it's pretty simple, but effective.

    Decompressor's code (ASM68K version): http://pastebin.com/aMrfKTbz
    Decompressor's code (AS version): http://pastebin.com/fVp8d5Fr
     
    AkumaYin and Clownacy like this.
  7. RocketRobz

    RocketRobz Coolest of TWL, and Sonic fan Member

    Joined:
    Aug 20, 2009
    Messages:
    80
    I got the Kosinski Decompression code to work on Sonic 3 & Knuckles (HG disassembly).

    Now the Title Screen Sega/Sonic frames load faster!

    But on the Data Select Screen, the screen is glitched up, probably the art uses KosM.

    When loading a zone, the game crashes and restarts.

    Do you plan on optimization for Kosinski Module?

    Also, I have fixed the routine NemDec for AS. : http://pastebin.com/m7qm2CDK

    and CompDec (Works for both ASM68k and AS). : http://pastebin.com/22rgSm9s
     
    Last edited by a moderator: Nov 11, 2013
  8. Flamewing

    Flamewing Elite Hacker Member

    Joined:
    Aug 28, 2011
    Messages:
    37
    Location:
    France
    KosM uses the same encoding and decoding techniques as normal Kosinski, so it is improved by default for the encoder. The in-game decoder needs an additional copy of the code (with appropriately modified labels) inside the Process_Kos_Queue routine; you may want to not include the lookup table again and instead reuse the one from the decoder you already put in place.

    Also, I confirmed that there is quite a lot of improvements possible for the Enigma encoder in terms of compression ratio; I just need to find an efficient way to handle the copy word and incrementing word, as the rest can be dealt with pretty much like what I used for the LZSS formats.
     
  9. Flamewing

    Flamewing Elite Hacker Member

    Joined:
    Aug 28, 2011
    Messages:
    37
    Location:
    France
    So, I was running the tools to have some statistics to show and I discovered that the KosM encoder was far from optimal as it was not properly managing the internal padding of each module. I also did some things to improve the Nemesis encoder a bit by using some new heuristics. And finally, I found out a case (that does not occur in practice) were the Saxman could be improved: if it started with a sequence of zeroes. I updated the download link in the previous post; but here it is, for convenience.

    Anyway, stats: I reencoded all Sonic 1 Nemesis art, 256x256 blocks and its sole Kosinski art file; for Sonic 2, I reencoded all Nemesis art, all Kosinski art, all 16x16 and 128x128 blocks and all Saxman songs; and for S&K, I reencoded all Kosinski art, all Nemesis art and all moduled Kosinski art. I did this for both original KENS and the latest versions of my tools (in the download linked to above). The results:


                Uncomp Orig     KENS     Mine   Orig ratio KENS ratio My ratio KENS/Orig Mine/Orig Mine/KENS
    S1 256x256 335872 B   70896 B   70511 B   66300 B   21.11%     20.99%    19.74%   99.46%    93.52%    94.03%
    S1 Nemesis 361344 B  168306 B  166251 B  164198 B   46.58%     46.01%    45.44%   98.78%    97.56%    98.77%
    S1 Kos     4096 B    1424 B    1419 B    1366 B   34.77%     34.64%    33.35%   99.65%    95.93%    96.26%
    S1 Total    701312 B  240626 B  238181 B  231864 B   34.31%     33.96%    33.06%   98.98%    96.36%    97.35%
    S2 Saxman    29430 B   21935 B   21933 B   21819 B   74.53%     74.53%    74.14%   99.99%    99.47%    99.48%
    S2 16x16    42128 B   30688 B   30630 B   30342 B   72.84%     72.71%    72.02%   99.81%    98.87%    99.06%
    S2 128x128 262144 B   85104 B   84599 B   81098 B   32.46%     32.27%    30.94%   99.41%    95.29%    95.86%
    S2 Nemesis 329216 B  158758 B  157823 B  154990 B   48.22%     47.94%    47.08%   99.41%    97.63%    98.20%
    S2 Kos   238164 B  104816 B  104515 B  100446 B   44.01%     43.88%    42.18%   99.71%    95.83%    96.11%
    S2 Total    901082 B  401301 B  399500 B  388695 B   44.54%     44.34%    43.14%   99.55%    96.86%    97.30%
    S&K Nem   264000 B  113268 B  108641 B  107274 B   42.90%     41.15%    40.63%   95.91%    94.71%    98.74%
    S&K KosM   444544 B  205564 B  204595 B  197198 B   46.24%     46.02%    44.36%   99.53%    95.93%    96.38%
    S&K Kos 304096 B  126736 B  126350 B  120586 B   41.68%     41.55%    39.65%   99.70%    95.15%    95.44%
    S&K Total  1012640 B  445568 B 439586 B  425058 B   44.00%    43.41%    41.98%   98.66%    95.41%    96.70%

    All told, you can save 8762 B in S1, 12606 B in S2 and 20510 B in S3&K.

    Edit: fixed table formatting; also, sorry for double post (just noticed I did it).
     
    Last edited by a moderator: Nov 15, 2013
  10. nineko

    nineko I am the Holy Cat Member

    Joined:
    Mar 24, 2008
    Messages:
    1,902
    Location:
    italy
    I think you lost a digit when you formatted the table:

    That said, awesome work. Who cares about a double post if it carries interesting informations :)
     
  11. Flamewing

    Flamewing Elite Hacker Member

    Joined:
    Aug 28, 2011
    Messages:
    37
    Location:
    France
    Anyways, I found out a small but in the KosM encoder that prevented it from achieving its full potential -- it wasn't accounting for the 3-byte end-of-compression marker when accounting for the padding between modules. Now that it does, it is as good as it gets.

    I updated both posts with the new links and the table with the new results.

    Edit: And I also added back the missing digit, thanks for pointing it out.
     
    Last edited by a moderator: Nov 15, 2013
  12. DiscoTheBat

    DiscoTheBat Newcomer Exiled

    Joined:
    Jan 17, 2013
    Messages:
    15
    Indeed, it sure makes stuff faster, as I'm working with some transition effects and the extra speed may help my project.

    But as Bobesh8 has mentioned it, the plain Kos on S3K seems to crash/not function properly, albeit with some tweaks, the KosM seems to work perfectly with the code, I dunno what could possible cause this but for now, I'm just using the plain KosDec while using your modified KosDec on the KosM engine.
     
  13. DiscoTheBat

    DiscoTheBat Newcomer Exiled

    Joined:
    Jan 17, 2013
    Messages:
    15
    Sorry for the double post but I managed to find a workaround for people who want to hack S3K and can't use this code, while I was researching for why the code didn't worked right on S3K, I came up with a fix for the Normal Kos system which involves the change of some lines and the addition of two extra more, while this increases the cycle count a bit, it may not be enough to make a bad result for this improvement, in anyway, if anyone wants the code, here's: http://pastebin.com/ADWWVJph

    Also, Trox did helped me with the KosM stuff, just go to Set_Kos_Bookmark: and change from $42 to $40 and place the normal code as is, not my modified one, by doing that, you can have a fully working Kos_Decomp improved on your hack. Enjoy!
     
  14. DemonFox

    DemonFox Newcomer Member

    Joined:
    Aug 15, 2014
    Messages:
    10
    Dont know if this is just an issue that I've had that is some weird bug with Kega Fusion, Windows 7 64bit, and ReadySonic (that does not seem to show up with other emulators), but when you add vladikcomper's optimised Nemesis decompression subroutine to Sonic 1 - The Sega Logo loads the incorrect colours (and randomly works at other times) i.e.

    [​IMG]

    I fixed the issue by moving the ClearScreen command up the chain during Sega Screen (just though I'd share the fix in case any one else saw this bug),

    P.S I have no idea why the error is caused or why this seems to fix it :s


    ; ---------------------------------------------------------------------------
    ; Sega screen
    ; ---------------------------------------------------------------------------

    GM_Sega:                ; XREF: GameModeArray
            move.b    #$E4,d0
            bsr.w    PlaySound_Special ; stop music
            bsr.w    ClearPLC
            bsr.w    PaletteFadeOut
            bsr.w    ClearScreen ;move this here to fix the issue
            lea    ($C00004).l,a6
            move.w    #$8004,(a6)    ; use 8-colour mode
            move.w    #$8200+(vram_fg>>10),(a6) ; set foreground nametable address
            move.w    #$8400+(vram_bg>>13),(a6) ; set background nametable address
            move.w    #$8700,(a6)    ; set background colour (palette entry 0)
            move.w    #$8B00,(a6)    ; full-screen vertical scrolling
            clr.b    (f_wtr_state).w
            move    #$2700,sr
            move.w    (v_vdp_buffer1).w,d0
            andi.b    #$BF,d0
            move.w    d0,($C00004).l
            bsr.w    ClearScreen ;remove from here
    I added this so anybody who gets this error wont bug vladikcomper for a fix, like I did (Sorry again), that is an emulator bug, not an error with his optimsed decompressors

     
    Last edited by a moderator: Sep 22, 2014
    KCEXE likes this.
Thread Status:
Not open for further replies.