Accurate Kosinski compressor

Discussion in 'Showroom' started by Clownacy, Oct 16, 2021.

  1. Clownacy

    Clownacy Well-Known Member Member

    Aug 15, 2014
    (Apparently I can't make threads in the Staff Projects forum anymore... :
( )

    Right now we have many Kosinski compressors such as the ones in KENS, KENSSharp, mdcomp (a.k.a. KENSC), and clownlzss, but none of them are actually capable of compressing files in the exact same way as Sega's original Kosinski compressor. This means that if you decompressed all of the Kosinski files in a Sonic disassembly, recompressed them, and then built that disassembly, then the resulting ROM would not be the same as the original.

    Back in 2018 I started working on a compressor that would remedy this, and produce files identically to Sega's original compressor. While I did get pretty close, with my compressor being able to recompress most files accurately, there were still a few that differed. In the last few days, however, I've finally cracked it, addressing the last inaccuracy and now having a compressor that accurately reproduces every single Kosinski-compressed file in the entire Sonic trilogy.

    This might seem like an odd niche, but it does have its uses. For example, the Sonic disassemblies have to compress their assembled Z80 sound driver code, and if the compressor it uses is inaccurate, then the built ROM will be inaccurate. Until not too long ago, we'd been making do with a hacked-up version of KENS that would accurately-compress that specific code, but not anything else.

    Another use for something like this is that it would allow for the disassemblies to store all of their assets completely uncompressed, making it simpler for level editors and the like to support them. The disassemblies could then compress the files as part of their build process, similar to what some of the Pokemon disassemblies do.

    Part of the reason that accurate Kosinski compression has been so hard to achieve is that Sega's original compressor was riddled with bugs. For example, while the Kosinski format itself allows for up to 0x100 bytes to be compressed at once, Sega's compressor would only ever compress up to 0xFD bytes. Likewise, while compressed data had a 'range' of 0x2000 bytes, Sega's compressor would only ever reach 0x1F03 at most.

    The final bug is perhaps the most obscure and complicated: the final block of compressed data at the end of a file would often be unusual, in the sense that it wouldn't reference data in same pattern that the other blocks do. It turns out that the reason for this is that the original compressor would accidentally (or, rather, deliberately, as part of a bizarre optimisation) read past the end of the file when scanning for data to compress. Because of this, it would read garbage data beyond the end of the file, and only match the to-be-compressed data with data earlier in the file that is followed by a pattern of bytes that perfectly matches the garbage data. This might seem like an impossible behaviour to recreate, as the garbage data would have been random bytes in RAM located after the file buffer, which cannot be recreated, but it turns out that the original compressor actually streamed its file data into a ring buffer, and then, when the end of the file is reached, the compressor simply stops reading new data into it. This means that the garbage data which the compressor ends up reading is just data from 0x2000 bytes earlier in the file (or just a series of 00 bytes, if the file was less than 0x2000 bytes long). This is the inaccuracy that took me over three years to figure out.

    With this bug correctly emulated, my compressor now passes a test suite of every single bit of Kosinski-compressed data that I could find in Sonic 1, Sonic 2, Sonic 3, and Sonic & Knuckles. I'm hoping to expand this test suite with more 'official' Kosinski files in the future to catch any potential remaining inaccuracies.

    By the way, in case anyone reading this doesn't know, we already have an accurate Saxman compressor, as we've actually managed to find the original Saxman compressor's source code. We haven't had that kind of luck with Kosinski however, which is why an accurate recreation is necessary. Fun fact, it was actually because of this source code that I was able to figure out the final Kosinski bug... because the Saxman compressor does the exact same thing.

    Anyway, you can find my compressor on GitHub. To use it, you can either use the command line, or just drag-and-drop files onto it. The compressed file will be called 'out.kos' for Kosinski, and 'out.kosm' for Moduled Kosinski.

    For those interested in the code, it's written in C99 (but also compiles as valid C++11), and released under the zlib licence, so it should be fairly simple to incorporate into other projects. I hope to convert it to C89 at some point, for extra portability.
    Last edited: Oct 19, 2021