K
kompress-ultra for Headroom
Headroom RFC / Proposal

Asymmetric Loss Modulation for Context Compression

Integrating learned context-pruning into Headroom. Achieve ~78% token savings and ~75% latency reduction while maintaining a near-perfect 0.993 exact-keep rate on critical reasoning tokens.

Interactive Kompress-Ultra Playground

See how the 4-role pipeline compresses chat history while preserving the critical-syntactic safety floor ($T_{\text{crit}}$).

Original: 0 tokens
Compressed: 0 tokens
0% Saved
1. Pruner & Safety Floor
2. Rewriter Output (Ultra Mode)

The Voting Ensemble Paradox

A multi-checkpoint voting ensemble is meant to be conservative, but under asymmetric training floors, the intuition inverts. Weak checkpoints veto correct keeps on their weakest strata, causing a stratum-wise Pareto collapse.

Interactive Paradox Simulator
Voter 1 (v3 - Noisy Floor) Weak: Paths
Recall (Identifiers)92%
Recall (File Paths)68%
Voter 2 (v5 - Domain-specific) Weak: Identifiers
Recall (Identifiers)70%
Recall (File Paths)95%
Ensemble Result Paradox Collapse
Ensemble Recall (Identifiers)70%
Ensemble Recall (File Paths)68%
Under AND voting, the ensemble's recall collapses to the weakest voter on each stratum (70% for Identifiers, 68% for File Paths), Pareto-dominated by any single strong model.

Theoretical Core

Learned context pruning improves long-context agent efficiency but introduces the Voting Ensemble Paradox. Under unanimity-to-keep (AND) voting ($k=1$ drop-if-any), the ensemble eviction indicator equals the pointwise maximum of the individual voter indicators:

Iens(x) = ⋁i=1..N Ii(x) = Ii*k(x)

This yields a stratum-wise Pareto collapse where the ensemble's recall equals that of the weakest voter on each stratum. As a corrective, `kompress-ultra` employs three core mechanisms:

  • Mechanism A (Asymmetric Loss Modulation): Adds a $3.0\times$ weighted cross-entropy penalty on critical-syntactic tokens ($T_{\text{crit}}$) during fine-tuning, concentrating gradients on the weakest strata.
  • Mechanism B (Post-Inference Regex Override): A surgical safety net applied after model scoring to force-keep critical tokens (paths, hex addresses, identifiers).
  • Mechanism C (Self-Labeling Loop): Closes the training loop by using $A+B$ as an oracle to relabel the training data, internalizing the safety net directly into the model weights.

Model Architecture

Dual-Head ModernBERT

`kompress-v8` uses a 149M-parameter ModernBERT backbone with LoRA fine-tuning applied to the last 4 attention layers. Two task heads share the encoder:

  • Token Classifier Head: Produces per-token eviction logits.
  • Span-CNN Head: Scores span-level coherence to prevent evictions from fragmenting syntactic units.

An Asymmetric Modulation Gate scales the token logits to suppress eviction in high-coherence spans:

Ïi(x) = σ(logitstok(x) - γ g(logitsspan(x)))
ModernBERT Encoder Token Head h_tok Span-CNN h_span σ Gate g

Empirical Benchmarks

Evaluated on the Heretic adversarial benchmark, `kompress-v8` dominates traditional prompt compression models on exact-keep rates of critical syntactic tokens.

Method Exact Keep % ($T_{\text{crit}}$) Keep Rate (Tokens) Avg. Latency
kompress-v8 (Ours, Production) 0.993 0.936 97.0 ms
kompress-v8 (Ours, `v4` SSL) 0.967 0.823
Random Eviction (Floor) 0.910 0.835 0.0 ms
LLMLingua-2 0.867 1.550 238.9 ms
TextRank (Extractive) 0.599 0.543 23.1 ms

Headroom Integration Proposal

We propose integrating `kompress-ultra` directly into Headroom (referencing Headroom PR #1419) as a core context-management middleware:

1. Middleware Chain Integration

Intercept outgoing LLM payload payloads in Headroom and run token-level classification via a local ONNX runtime of `kompress-v8`.

2. Configurable Safety Floors

Provide pre-configured regex patterns matching $T_{\text{crit}}$ class tokens to ensure 100% survival rates on critical system outputs.

3. Passive Memory Offloading

Seamlessly write evicted tokens to Headroom's memory spine (e.g. SQLite/Milvus) for semantic recall in future turns.