Table-of-Contents
Lempel–Ziv–Storer–Szymanski (LZSS) is a Dictionary-type lossless data compression algorithm that was created in 1982. For more information, see the Wikipedia article on LZSS or the 1989 section in this README.
Complete rewrite of the original 1989 LZSS.c public domain code written by Haruhiko Okumura. Rebuilt from the ground up for C23 with zero-copy I/O, cross-platform native compilation, and 7th Guest / retro game format compatibility.
-
Zero-Copy Memory-Mapped I/O
- Input files are memory-mapped directly (
mmapon POSIX,MapViewOfFileon Windows) - Eliminates per-byte
fgetc/fputcoverhead entirely - Sequential access hints (
MADV_SEQUENTIAL) for optimal kernel prefetching
- Input files are memory-mapped directly (
-
Unified Cross-Platform Design
- Single source file compiles natively on Linux and Windows
- Platform-specific code isolated via
#ifdef _WIN32blocks - No runtime dependencies or emulation layers
-
Original Parameters Preserved
N = 4096(ring buffer size)F = 18(maximum match length)THRESHOLD = 2(minimum match for encoding)- Fully compatible with 7th Guest, 11th Hour, and other retro game LZSS formats
-
Binary Search Tree Matching
- O(log N) average-case match finding using 256 separate BSTs (one per first byte)
- Original Okumura algorithm preserved for correctness and compatibility
-
Optimized Memory Layout
- Compact
Statestruct (~45KB) contains ring buffer and all tree arrays - Single allocation per encode/decode operation
- Output buffer with geometric growth (1.5x) to minimize reallocations
- Compact
-
Progress Reporting
- Real-time encoding progress displayed every 1MB
- Compression ratio reported on completion
-
C23 Compliance
_Static_assertfor compile-time parameter validation- Modern enum declarations with explicit types
- Clean separation of concerns
-
Minimal Footprint
- ~220 lines of code (down from 400+ in previous versions)
- No external dependencies beyond libc
- Single source file, no headers required
-
Binary-Safe
- Handles all 256 byte values correctly
- No ASCII assumptions anywhere in the codebase
-
Cross-Compilation Support
- Native Windows EXE from Linux using
clang-cl+ Windows SDK viaxwin - No Wine, no MSYS2, no emulation—produces genuine PE/COFF executables
- MSVC ABI compatible (static CRT linking)
- Native Windows EXE from Linux using
-
Visual Build Output
- Color-coded terminal output with emoji status indicators
- Build timing and binary size reporting
- One command:
./build.sh linuxor./build.sh windows
Cross-platform build system using Clang. Produces native binaries for both Linux and Windows from a single Linux host.
# Show help
./build.sh
# Build for Linux (native ELF)
./build.sh linux
# Build for Windows (native PE/COFF, no emulation)
./build.sh windows
# Build both platforms
./build.sh all
# Clean build artifacts
./build.sh cleanRequirements:
clang(C23 support)clang-clandlld-link(for Windows cross-compilation)- Windows SDK via
xwin(auto-installed on first Windows build)
Output:
./lzss— Linux x64 ELF binary./lzss.exe— Windows x64 PE binary (runs natively on Windows, no Wine needed)
The original code is written in C and is compatible with C89 in the context of Linux. Only the information pertaining to LZSS from Haruhiko Okumura's 1989 COMPRESS.TXT file has been surfaced in this README.
This scheme is initiated by Ziv and Lempel [1]. A slightly modified version is described by Storer and Szymanski [2]. An implementation using a binary tree is proposed by Bell [3]. The algorithm is quite simple: Keep a ring buffer, which initially contains "space" characters only. Read several letters from the file to the buffer. Then search the buffer for the longest string that matches the letters just read, and send its length and position in the buffer.
If the buffer size is 4096 bytes, the position can be encoded in 12 bits. If we represent the match length in four bits, the <position, length> pair is two bytes long. If the longest match is no more than two characters, then we send just one character without encoding, and restart the process with the next letter. We must send one extra bit each time to tell the decoder whether we are sending a <position, length> pair or an unencoded character.
The accompanying file LZSS.C is a version of this algorithm. This implementation uses multiple binary trees to speed up the search for the longest match. All the programs in this article are written in draft-proposed ANSI C. I tested them with Turbo C 2.0.
- [1] J. Ziv and A. Lempel, IEEE Trans. IT-23, 337-343 (1977).
- [2] J. A. Storer and T. G. Szymanski, J. ACM, 29, 928-951 (1982).
- [3] T. C. Bell, IEEE Trans. COM-34, 1176-1182 (1986).
LZHUF.C is Copyrighted, the rest is Public Domain.
- Complete rewrite of LZSS implementation in C23
- Zero-copy memory-mapped I/O (
mmap/MapViewOfFile) - Binary search tree matching (O(log N) average case)
- Cross-platform build system with native Windows EXE output via
clang-cl+xwin - Reduced codebase from 400+ lines to ~220 lines
- Fixed critical bug:
tree_initwas usingmemsetwithNIL & 0xFF(=0) instead of proper 16-bit NIL initialization - Fixed pointer type mismatch in
insert()function - Added real-time progress indicator during encoding
- Preserved original 1989 Okumura parameters (N=4096, F=18, THRESHOLD=2) for 7th Guest compatibility
- Removed legacy C++17 version and Visual Studio solution
C23version created, compiled with no warnings usingclang -std=c23 -Wall -O3 -g lzss.c -o lzss.exe, but there are definitely issues to debug/troubleshoot.- Documentation updates.
- Visual Studio 2022 C++ Solution and refactored
lzss.cppusing latest AI models was added! (legacy, DO NOT USE, included & archived for future analysis)
- Used
lldbto figure out why the new C++ version couldn't break out of thefor(;;)loop inInsertNode. Turns out that thervalue was one byte larger than theC89version in a side-by-side comparison. - Somehow, even using co-pilot it missed the
InitTreecall inside theEncodefunction. This was causing thervalue to be off by one byte. - Documentation updates.
- Added my own style of commenting and layout to the code
- Added the
mainfunction / Entry Point to the code - Added a
helpTextfunction to display the help text to the command-line - Added the ASCII Art logo to the code
- Added error checking in the
mainfunction for the input and output files for the Stand-alone solution.
- Moved from
C89toC++17 - Changed
definetoconstwhere applicable - Changed
FILEtofstreamwhere applicable - Changed most variable names to camelCase and full english words
- Replaced the various
printfcalls with a mix ofstd::cout,std::fixed, andstd::precisionas needed.