SIMD integer compression · C99

Decode billions of integers
per second.

MaskedVByte is a fast, vectorized decoder for VByte-compressed 32‑bit integers, with optional differential (delta) coding. It turns the venerable byte-oriented varint format into a SIMD-accelerated data path.

6.7 GB/s
Plain decode throughput
2.2B/s
Delta integers per second
~2×
Faster than scalar VByte
1 file
Drop-in C library

Measured on a single core decoding 16M integers; see Performance for details.

Why MaskedVByte?

The classic VByte (varint) format is compact and ubiquitous — MaskedVByte makes it fast to read back.

Vectorized decoding

Uses SSE4.1 to decode many integers at once with a mask-driven shuffle, instead of branching byte by byte.

📦

Standard VByte format

Reads ordinary continuation-bit varints. Compatible with the format used across search engines and databases.

📈

Differential coding

Built-in delta variants for sorted sequences and small gaps — fewer bytes per integer and even faster decoding.

🔎

Random access

select and search helpers let you jump into and scan delta-coded streams without full decompression.

🪶

Tiny & portable

Plain C99 with a clean header API. No dependencies. Builds with make or CMake; vendor it or install it.

🔬

Research-backed

The algorithm is described in a peer-reviewed paper and shipped in production search systems such as Lucene forks.

How it works

In the VByte format, each integer is stored in one to five bytes. The high bit of each byte — the continuation bit — says whether the integer continues into the next byte. A scalar decoder walks the stream one byte at a time, branching on every bit.

MaskedVByte instead:

  • Loads a 16‑byte block and extracts all continuation bits into a mask in one move.
  • Uses the mask to look up a shuffle pattern that gathers the right bytes for several integers at once.
  • Reconstructs the integers in parallel SIMD lanes, then advances past the bytes it consumed.

The result is a decoder whose cost is driven by data width rather than per-byte branching — turning unpredictable branches into predictable vector work.

vbyte_stream
// 4 integers → variable-length bytes
// high bit = "continues"

  120  → [0x78]
 1000  → [0xE8 0x07]
    3  → [0x03]
70000  → [0xF0 0xA2 0x04]

/* scalar: branch on every byte
   masked: read the mask once,
           shuffle, decode in lanes  */

Quick start

Clone, build, and decode in a few lines. Requires an x86‑64 CPU with SSE4.1 (an ARM/NEON shim is also included).

build.sh
# clone
git clone https://github.com/fast-pack/MaskedVByte
cd MaskedVByte

# build the library + tests
make
./unit          # quick correctness test

# or with CMake
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build
example.c
#include "varintencode.h"
#include "varintdecode.h"

int main(void) {
  int N = 5000;
  uint32_t *in     = malloc(N * sizeof(uint32_t));
  uint8_t  *comp   = malloc(N * sizeof(uint32_t));
  uint32_t *recov  = malloc(N * sizeof(uint32_t));

  for (int k = 0; k < N; ++k) in[k] = 120;

  // encode with classic VByte ...
  size_t n = vbyte_encode(in, N, comp);
  // ... decode fast with MaskedVByte
  masked_vbyte_decode(comp, recov, N);

  printf("Compressed %d ints to %zu bytes\n", N, n);
}

Performance

Throughput from the bundled benchmark decoding 16,777,216 integers, 5 repeats, single core.

./benchmark
MaskedVByte benchmark: 16777216 integers, 5 repeats

plain decode : 1384.94 mis/s   (6.751 GB/s, 4.87 bytes/int)
delta decode : 2221.56 mis/s   (3.341 GB/s, 1.50 bytes/int)
select_delta : validated (100000 random slots)
search_delta : validated (100000 random keys)

All results validated. Code looks good.

Figures above are from one representative run on Apple Silicon via the NEON shim; absolute numbers vary by CPU and data. On x86‑64 with SSE4.1, the original studies report MaskedVByte decoding at roughly twice the speed of an optimized scalar VByte decoder.

API at a glance

The full surface is two headers in include/. Here are the functions you will reach for most.

FunctionWhat it does
Encoding
vbyte_encode(in, length, bout)Encode an array of integers with classic VByte.
vbyte_encode_delta(in, length, bout, prev)Delta-encode a sorted array starting from prev.
Decoding
masked_vbyte_decode(in, out, length)Vectorized decode of length integers.
masked_vbyte_decode_delta(in, out, length, prev)Vectorized decode of a delta-coded stream.
masked_vbyte_decode_fromcompressedsize(in, out, inputsize)Decode exactly inputsize compressed bytes.
masked_vbyte_decode_fromcompressedsize_delta(...)Same, for a delta-coded stream.
Random access (delta)
masked_vbyte_select_delta(in, length, prev, slot)Return the value at a given position.
masked_vbyte_search_delta(in, length, prev, key, presult)Find the first value ≥ key.

Prefer the delta variants when your data is sorted or has small gaps — fewer bytes and faster decoding.

Use it from your project

After cmake --install, link against the exported target — or vendor the repository and add it as a subdirectory. Either way you get maskedvbyte::maskedvbyte.

The library is released under the permissive Apache 2.0 license, so it drops cleanly into both open-source and commercial projects.

CMakeLists.txt
# installed package
find_package(maskedvbyte CONFIG REQUIRED)
target_link_libraries(your_target
  PRIVATE maskedvbyte::maskedvbyte)

# or vendored as a subdirectory
add_subdirectory(path/to/MaskedVByte)
target_link_libraries(your_target
  PRIVATE maskedvbyte::maskedvbyte)

Citing this work

If MaskedVByte helps your research, please cite the papers behind it.

Jeff Plaisance, Nathan Kurz, Daniel Lemire. Vectorized VByte Decoding. International Symposium on Web Algorithms (iSWAG), 2015. arXiv:1503.07387
Daniel Lemire, Nathan Kurz, Christoph Rupp. Stream VByte: Faster Byte-Oriented Integer Compression. Information Processing Letters 130, February 2018, pp. 1–6. arXiv:1709.08990

Related libraries

Part of a broader family of high-performance integer compression tools.