I found the Saxman compressor source code

Discussion in 'Discussion & Q&A' started by Clownacy, Apr 10, 2020.

  1. Clownacy

    Clownacy Retired Staff lolololo Member

    Joined:
    Aug 15, 2014
    Messages:
    1,015
    I cannot believe I'm writing this. On an absolute whim, I found the source code to Sega's Saxman compressor, used for Sonic 2's sound engine and music files.

    Saxman is not completely custom: it's based on some public-domain Japanese LZSS code from 1989, by Haruhiko Okumura. The only modification, to my knowledge, is the addition of Saxman's unique ability to compress runs of 00 bytes - this is a mere eight-line edit.

    You can find the code here, in a file called "LZSS.C":
    https://web.archive.org/web/1999020...edu/pub/simtelnet/msdos/arcutils/lz_comp2.zip

    Here's the modification to make it produce zero-run matches:

    Code:
    void InsertNode(int r)
       /* Inserts string of length F, text_buf[r..r+F-1], into one of the
          trees (text_buf[r]'th tree) and returns the longest-match position
          and length via the global variables match_position and match_length.
          If match_length = F, then removes the old node in favor of the new
          one, because the old one will be deleted sooner.
          Note r plays double role, as tree node and position in buffer. */
    {
       int  i, p, cmp;
       unsigned char  *key;
       cmp = 1;  key = &text_buf[r];  p = N + 1 + key[0];
       rson[r] = lson[r] = NIL;  match_length = 0;
       for ( ; ; ) {
           if (cmp >= 0) {
               if (rson[p] != NIL) p = rson[p];
               else {  rson[p] = r;  dad[r] = p;  return;  }
           } else {
               if (lson[p] != NIL) p = lson[p];
               else {  lson[p] = r;  dad[r] = p;  return;  }
           }
           /* NEW CODE START */
           if (ftell(infile) < 0xFFF - THRESHOLD) {
               for (i = 0; i < F; i++)
                   if ((cmp = key[i] - 0) != 0)  break;
               if (i > match_length) {
                   match_position = 0xFFF - F - THRESHOLD;
                   if ((match_length = i) >= F)  break;
               }
           }
           /* NEW CODE END */
           for (i = 1; i < F; i++)
               if ((cmp = key[i] - text_buf[p + i]) != 0)  break;
           if (i > match_length) {
               match_position = p;
               if ((match_length = i) >= F)  break;
           }
       }
       dad[r] = dad[p];  lson[r] = lson[p];  rson[r] = rson[p];
       dad[lson[p]] = r;  dad[rson[p]] = r;
       if (rson[dad[p]] == p) rson[dad[p]] = r;
       else                   lson[dad[p]] = r;
       dad[p] = NIL;  /* remove p */
    }

    I still can't believe this... I never thought I'd see this code in my life.

    Just this morning, I started writing my own Saxman compressor, in the hopes of making it able to produce files identical to the original (so far, every community-made Saxman compressor is inaccurate), but I hit a brick wall when I realised the original files have no pattern to how they select dictionary-matches - sometimes they use the earliest data in the file, sometimes they use the latest, and sometimes they use data in the middle, even when there's perfectly good data before and after.

    Having been stumped by this issue, I decided to check-out an old 1989 compressor I read about on Wikipedia, to see how devs would compress LZSS files back-in-the-day.

    To my absolute amazement, I saw the compression format shared numerous similarities with Saxman - heck, even the decompressor shared similarities with the 68k/Z80 ASM decompressors used in Sonic 2: just look at how they detect when the descriptor field is empty!

    LZSS.C:
    Code:
           if (((flags >>= 1) & 256) == 0) {
               if ((c = getc(infile)) == EOF) break;
               flags = c | 0xff00;       /* uses higher byte cleverly */
           }                           /* to count eight */

    Sonic 2's 68k decompressor:
    Code:
        lsr.w    #1,d6    ; Next descriptor bit
        btst    #8,d6    ; Check if we've run out of bits
        bne.s    +    ; (lsr 'shifts in' 0's)
        jsr    SaxDec_GetByte(pc)
        move.b    d0,d6
        ori.w    #$FF00,d6    ; These set bits will disappear from the high byte as the register is shifted
    +

    Sonic 2's Z80 decompressor:
    Code:
        srl    c            ; c >> 1 (active control byte)
        srl    b            ; b >> 1 (just a mask that lets us know when we need to reload)
        bit    0,b            ; test next bit of 'b'
        jr    nz,+            ; if it's set, we still have bits left in 'c'; jump to '+'
        ; If you get here, we're out of bits in 'c'!
        call    zDecEndOrGetByte    ; get next byte -> 'a'
        ld    c,a            ; a -> 'c'
        ld    b,0FFh            ; b = FFh (8 new bits in 'c')
    +

    And, yes, it produces identical files.

    You know... for years, I've wanted to know the real names of the various compression formats used in the Mega Drive Sonic games. Time after time we've come so close, be it through leaked function names in Sonic 2's Nick Arcade prototype, leftover compilation data in Sonic & Knuckles Collection, or a complete symbol-list in Sonic CD PS2. Each time, the names of the decompression functions were conveniently absent. At least now, we finally have some closure on Saxman - it doesn't have a name! It's just 'LZSS.C'...

    EDIT (2020/10/07):
    It turns out there's a much simpler way to make LZSS.C produce Saxman files: just change both uses of the ASCII space character to 0. I've updated Sega Retro's Saxman page with instructions.
     
    Last edited: Oct 7, 2020
    ProjectFM, Monitor, KCEXE and 15 others like this.
  2. LazloPsylus

    LazloPsylus The Railgun Member

    Joined:
    Nov 25, 2009
    Messages:
    Location:
    Academy City
    Confirms Sega programmers knew LZ-based compression is awesome even back in the nineties.

    Seriously, though, another mystery solved. Nice job and congratulations on nabbing it.
     
  3. ValleyBell

    ValleyBell Well-Known Member Member

    Joined:
    Dec 23, 2011
    Messages:
    166
    Oh, so they used an LZSS modifcation as well. Great find!

    LZSS was very common at that time.
    • MegaDrive Sorcerian uses LZSS as well for compressing its sound driver and SMPS music.
      The 4 KB dictionary buffer is initialized with 00s.
      This is a common modification from the original implementation, which initializes it with spaces.
    • Games by Wolfteam on the MegaDrive and X68000 use LZSS as well. (Arcus Odyssey's sound driver on the MD, almost all data files in X68000 Wolfteam games.)
      It's main modification is to initialize the dictionary with various different byte patterns.
    I knew about that LZSS implementation before. There is a copy of the archive on ROMhacking.net: http://www.romhacking.net/documents/625/
     
  4. Nat The Porcupine

    Nat The Porcupine Point & Click Funny Man Member

    Joined:
    Jun 23, 2017
    Messages:
    71
    Location:
    Pennsyltucky
    I really hate to be the bearer of bad news but unfortunately, at least in the case of Sonic 2's sound driver, this does not appear to be the case. I decided to test this out in the latest version of the S2 disasm, since you committed this change to it, built a Rev01 ROM, extracted the compressed driver from it & compared it to one I extracted from a stock Rev01 rom myself. The length in bytes is not identical for the compressed drivers:

    Compressed size of the vanilla driver: 0xF64 bytes
    Compressed length of the built driver: 0xF4A bytes (1A bytes smaller)

    Even when taking into account the mysterious "4E" byte that appears at the end of the data on the stock compressed driver data (which you did acknowledge in the repo commit), the difference in length is fairly significant; it compresses slightly better. I'm no expert on compression methods, so I'm not really able to explain what exactly any binary differences between the two compressed drivers mean or what about this implementation could have caused them, but it does appear a bit more tweaking will be necessary before this can be labeled a truly era-accurate compressor. This is very very close though & it's good to know that we may yet be able to build a fully bit-perfect ROM. Keep up the good work, man.
     
    ProjectFM likes this.
  5. Clownacy

    Clownacy Retired Staff lolololo Member

    Joined:
    Aug 15, 2014
    Messages:
    1,015
    Did you run build.bat on its own? You need to pass '-a' to it to make it use the authentic compressor. Without it, it uses my clownlzss library, which produces smaller files by using the same technique as KENSC.

    chkbitperfect.bat uses '-a' and, when I tested it, all three produced ROMs were bit-perfect.
     
    Last edited: Oct 8, 2020
    ProjectFM likes this.
  6. Nat The Porcupine

    Nat The Porcupine Point & Click Funny Man Member

    Joined:
    Jun 23, 2017
    Messages:
    71
    Location:
    Pennsyltucky
    Ah, nevermind then, that was my bad. False alarm.