[Another cross-post from the blog. I figured that it is relevant here, being about creating a compressor for a compression format used by the Sonic games.] Long ago, I made a compressor for Kosinski, which is a compression format seemingly created by Sega and used in various Mega Drive games, including Sonic the Hedgehog. There are two other compression format used by those games as well: Enigma and Nemesis. Having recently become unhappy with the portability and code quality of Flamewing's mdcomp compression library, I decided that I would make my own Nemesis compressor. If only I knew what I was getting myself into... I began by reading the trusty Sega Retro page on the format, which provided a detailed specification of the format and an overview of the algorithms used to encode data in it. With that information alone, I was able to produce a working decompressor. It was interesting to see just how simple Nemesis was: it is a nybble-based run-length encoding scheme with each run placed into a kind of dictionary that substitutes it with a variable length code. The most common runs get the shortest codes, and the rarest runs get the longest codes. In effect, the data has two layers of compression applied to it. The process of substituting data for variable length codes is known as 'Huffman encoding', though this is a misnomer as I will explain later. With the decompressor done, it was time to move onto making the compressor. I had heard from Vladikcomper that the Nemesis data in the Sonic games was encoded with "Shannon coding", so I figured that I would begin with making a compressor which used that. Upon looking into it, I learned that Shannon coding was the first of its kind, dating all the way back to the 1940s. It derived the variable-length codes from the frequency of the symbol (nybble-run) within the data. It was pretty easy to implement, but soon it became obvious that this was not the algorithm that Sega used in their Nemesis compressor: the lengths of the variable-length codes differed unpredictably, and, rather than be directly derived from the symbol's frequency, the codes were clearly generated in a structured pattern. Undeterred, I continued working on my compressor using Shannon coding. In the end, I had it working and successfully producing valid Nemesis-encoded data. Unfortunately, the use of Shannon coding had proven to be especially unwise, as data produced by the compressor was considerably larger than the data produced by Sega's compressor. Whilst reading about Shannon coding, I learned of a similar algorithm known as Fano coding, which is frequently confused with Shannon coding. This would explain why Vladikcomper claimed that Shannon coding was used by Sega's Nemesis compressor. So, I began modifying my compressor to use Fano coding instead. Fano coding works completely differently to Shannon coding, producing its variable-length codes by repeatedly performing a binary split on a list of the symbols sorted by their frequency within the data. A single recursive function was all that was needed to implement this coding scheme. This matched the output of Sega's compressor exactly (though unfortunately some sorting quirks keep it from being binary-exact), meaning that my compressor was able to match its compression ratio, being neither more efficient nor less efficient! While it was nice to have a Nemesis compressor with a decent compression ratio, I wanted it to be able to compress data better than Sega's compressor. After all, that is what I did with my Kosinski compressor all those years ago. For this, I knew that there was only one solution: Huffman coding. Huffman coding is the perfection of what Shannon coding pioneered: while Shannon and Fano generate codes that are 'good', they are not perfect; in contrast, Huffman always produces codes that are optimal. The trade-off is that Huffman is a lot more complex to implement, requiring the construction and parsing of an entire binary tree using priority queues. While trivial to do with C++'s standard library, my compressor uses ANSI C (C89), so I had to get very creative in order to implement this as easily, but also efficiently, as possible. In the end, I was able to implement this using a single statically-allocated array. However, something was wrong: the generated data was larger than the output of Fano coding. How could Huffman, an optimal code generator, be less efficient than Fano? The first shortcoming that I noticed was with a quirk of the Nemesis format: there is a specific code that is reserved - 111111. Variable length codes are not allowed to use this code nor a prefix of it (1, 11, 111, etc.). With Fano coding, codes that conflicted with the reserved code were rare, whereas with Shannon and Huffman they were common. Conflicting codes had to either be rejected (causing all instances of the symbol they represent to instead be inlined into the bit stream) or moved to another, longer, code. The second shortcoming was that codes had a limit to the length that they could be: Nemesis only supports up to a length of 8, while Huffman coding was generating its optimal codes on the assumption that they could be any length. The fact that these overly-long codes had to be rejected meant that the output of Huffman coding was no longer optimal! Granted, Shannon coding and Fano coding are susceptible to this as well, but somehow it was just worse with Huffman. However, Huffman has a solution to this: there exists a variant of Huffman coding that is specifically intended for codes with a limited maximum length, producing codes that are optimal within this constraint. Performing Huffman coding in this manner requires what is known as the package-merge algorithm, which can be implemented as a slightly modified version of the usual Huffman algorithm with an extra queue. Rather than produce a whole tree, this algorithm produces a series of them, and, rather than produce the codes directly, parsing these trees produces the codes' lengths. A simple algorithm can be used to generate 'canonical Huffman' codes from these lengths. Unbelievably, despite implementing this algorithm, the generated data was still too large. I was able to find one file that compressed slightly better with my compressor than Sega's, but the vast majority compressed to a considerably larger size. Examining the data, I was able to discern a pattern: while Huffman coding allowed many more symbols to be assigned a code, the most common symbols would be assigned longer codes than they would be with Fano coding. The space saved by assigning codes to more symbols was counteracted by the most common symbols having larger codes. This was unavoidable, as Huffman is a 'greedy' algorithm, the same kind of algorithm that keeps standard LZSS compressors from being optimal: Huffman is only ideal when you do not consider that encoding fewer symbols may be for the better. I considered hackishly adding a heuristic, but could not stand the thought of it. Then I considered brute-forcing: I could repeat the tree-generation process over and over again, using fewer and fewer symbols until there was only one left, all the while I would be keeping track of which iteration would require the fewest bits to output. This proved to work incredibly well in practice: the brute-forcing does not scale with the size of the input data, but rather the number of symbols and the maximum code length, keeping the performance impact low; but, more importantly, this allowed Huffman to finally outdo Fano! It is not a huge amount: just a dozen bytes or so for small files or a few hundred bytes for larger ones, but that is about on-par with how my perfect Kosinski compressor compares to standard Kosinski compressors! It has been a wild ride, but I finally have a Nemesis compressor that outperforms the one used by Sega. It still produces larger data than Flamewing's compressor, but that is to be expected considering the crazy black magic which that thing does. I am glad to finally see the completion of this project: I have been working on it non-stop for three days. So many headaches and so many hours spent comprehending obscure, poorly-documented algorithms. If only I was good at reading academic papers. With this and clownlzss, I hope to replace all of Flamewing's compressors in ClownMapEd. Then I can stop getting a bunch of compiler warnings about the compressors treating the return value of 'std::istream::tellg()' as an integer. Ew. The source code to my Nemesis compressor and decompressor can be found here: https://github.com/Clownacy/clownnemesis
(Cross-post) Making my Nemesis Compressor Accurate While it was nice to create a compressor that surpassed Sega's own, what I wanted more was to create a compressor that perfectly mimicked it. When making a disassembly or decompilation of a game, assets should ideally be stored in a user-friendly format; the assets should be easy to view and easy to modify. However, often assets are left in their original compressed form. The reason for this is that the disassembly/decompilation is aiming to produce an exact copy of the game, down to matching binary, but a recompressed file would have different binary. By not decompressing the assets, the need for recompression is bypassed, avoiding this problem. There are usually multiple ways to encode data in a given compression format. Some ways are better than others, such as by being faster to compress, faster to decompress, using the least RAM, or producing the smallest compressed data. Other ways aren't necessarily better, but simply the result of the inner workings of the compressor, such as the order in which it does tasks, the sorting algorithm it uses, and even things as minute as whether it uses '<' or '<=' to check the order of two integers. Because of this, it is unlikely for two compressors to produce the exact same compressed data. When I was creating my Nemesis compressor, I was specifically aiming to mimic the behaviour, and thus output, of Sega's compressor. I noticed early on that the code table was arranged first by order of the matching nybble run's length and then by the value of the nybble itself, so I replicated that behaviour in my compressor. Later, I figured out that the codes were generated through Fano coding, so I implemented that too. But, even with the compressor complete, with both of these behaviours accounted for, the compressed output differed. Not long afterwards, I noticed that nybble runs with a length of less than 3 were never assigned a code, and mimicking this brought the data closer, but still without matching it exactly. At this point, the differences in the data were quite subtle: the codes were the same, the size of the data was similar, the inlined data was the same, but the assignment of codes to nybble runs was different. The original compressor appeared to perform an optimisation that mine did not: it was somehow deviating from the usual Fano coding to assign the shorter codes to the most common nybble runs. The nature of Fano coding makes the occasional assignment of shorter codes to less common nybbles unavoidable, so I mistook this to mean that Sega's compressor was not using Fano coding, and instead something more efficient like Huffman coding. After implementing Huffman coding, I found that it was far less accurate than Fano coding ever was, so I switched focus to making Fano more optimal. Eventually I discovered that Sega's compressor was not deviating from standard Fano coding, but rather it was performing some post-processing. In order to ensure that the most common nybble runs were assigned the shortest codes, it used a sorting algorithm to reassign the codes after Fano coding had already assigned them. I also found that, unlike the code table, codes were assigned to nybble runs first in the order of run length first and second in the order of the run nybble. Fixing this simply requiring swapping the order of two nested for-loops. But, even with both of these implemented, the compressed data was still not exactly the same. While the shortest codes correctly mapped to the most common nybble runs, codes of the same size would be assigned to different nybble runs. The fact that differences in ordering only applied to codes of the same size was enough of a clue for me to tell that this was a quirk of the compressor's sorting algorithm. Sorting algorithms can be divided into two groups: stable and unstable. Stable sorting algorithms will sort the data without changing the order of values that are equal, whereas unstable algorithms will. My compressor was using a stable sorting algorithm, and so all codes with the same length were in the same order both before and after sorting. For the output of Sega's compressor to have the codes in a different order meant that it was using an unstable algorithm. But how was I supposed to know which one it was? Every unstable sorting algorithm affects equal values in different ways, so I had to make my compressor use the exact same one that Sega's compressor used. I could have figured it out through trial-and-error, implementing every sorting algorithm and checking which one produced the correct output, but I did not want to do that as sorting algorithms can be very complex and hard to write. Instead, I figured that I would simply examine the data and figure out what sorting algorithm could create it. Here are some codes generated by Sega's compressor: Code: Code 1110110 encodes nybble A of length 2 and was seen 19 times. Code 11101110 encodes nybble 0 of length 6 and was seen 18 times. Code 11101111 encodes nybble 8 of length 3 and was seen 16 times. Code 1111000 encodes nybble F of length 3 and was seen 19 times. Code 11110010 encodes nybble 7 of length 3 and was seen 17 times. Code 11110011 encodes nybble D of length 2 and was seen 16 times. Code 1111010 encodes nybble 8 of length 5 and was seen 19 times. And here are those same codes generated by my compressor: Code: Code 1110110 encodes nybble A of length 2 and was seen 19 times. Code 11101110 encodes nybble 0 of length 6 and was seen 18 times. Code 11101111 encodes nybble 7 of length 3 and was seen 17 times. Code 1111000 encodes nybble F of length 3 and was seen 19 times. Code 11110010 encodes nybble D of length 2 and was seen 16 times. Code 11110011 encodes nybble 8 of length 3 and was seen 16 times. Code 1111010 encodes nybble 8 of length 5 and was seen 19 times. As expected, the shortest codes are assigned to the most common nybble runs, and the ordering of the 8-bit codes differ. Nybble runs 7-3, 8-3, and D-2 are in the wrong order. Across a variety of compressed data, I noticed a pattern: the ordering of equal nybble runs would only ever differ if those nybble runs were sandwiched between codes of a shorter length. This suggested that the the reordering was a side-effect of moving a nybble run in order to sort it. This is what the codes are before they are sorted: Code: Code 1110110 encodes nybble A of length 2 and was seen 19 times. Code 11101110 encodes nybble F of length 3 and was seen 19 times. Code 11101111 encodes nybble 8 of length 5 and was seen 19 times. Code 1111000 encodes nybble 0 of length 6 and was seen 18 times. Code 11110010 encodes nybble 7 of length 3 and was seen 17 times. Code 11110011 encodes nybble D of length 2 and was seen 16 times. Code 1111010 encodes nybble 8 of length 3 and was seen 16 times. Sorting is needed so that the nybble runs F-3 and 8-5 are assigned to codes 1111000 and 1111010, respectively. In my compressor, this is done without disturbing the other codes, while in Sega's compressor this is not. In a moment of serendipity, I tried sorting the data by hand by swapping each nybble with its destination. I swapped F-3 with 0-6, and 8-5 with 8-3, resulting in this: Code: Code 1110110 encodes nybble A of length 2 and was seen 19 times. Code 11101110 encodes nybble 0 of length 6 and was seen 18 times. Code 11101111 encodes nybble 8 of length 3 and was seen 16 times. Code 1111000 encodes nybble F of length 3 and was seen 19 times. Code 11110010 encodes nybble 7 of length 3 and was seen 17 times. Code 11110011 encodes nybble D of length 2 and was seen 16 times. Code 1111010 encodes nybble 8 of length 5 and was seen 19 times. This exactly matched the Sega-compressed data! I looked-into various unstable sorting algorithms, searching for one that worked by swapping in this way, and I found selection sort. After making my compressor sort its codes using this algorithm, I was delighted to see that the nybble runs were in the same order! I had finally cracked it! Recompressing other data exposed a few bugs which were affecting accuracy, as well as some edge-cases that both compressors handled differently. For instance, when deciding whether to use XOR mode or not, Sega's compressor would only compare the number of bytes used, rather than bits. When XOR mode and non-XOR mode use the exact same number of bytes, then non-XOR mode is preferred. Additionally, when performing its binary chop, the Fano code generator does not round when dividing by 2. With these issues addressed, my compressor was able to perfectly recompress every bit of Nemesis-compressed data in the first two Sonic the Hedgehog games! And that brings us to now. My compressor is able to both mimic Sega's compressor as well as surpass it. I've polished-up the code and released a Windows executable on GitHub. Being written in ANSI C (C89), I was able to compile it with Visual Studio 6, so it should run even on ancient versions of Windows. Hooray for portability. A use-case for this compressor is Hivebrain's Sonic 1 disassembly. Years ago, my accurate Kosinski compressor was integrated into it, allowing some of the game's assets to be stored uncompressed. With my Nemesis compressor, this can finally be extended to the majority of the game's assets! With this compressor, my accurate Kosinski compressor, and the rediscovered original Saxman compressor, 3 out of 4 of the Sonic compression formats have accurate compressors! All that remains now is Enigma.