K
kompress-ultra for Headroom
Headroom RFC / Proposal

Asymmetric Loss Modulation for Context Compression

Integrating learned context-pruning into Headroom. Achieve ~78% token savings and ~75% latency reduction while maintaining a near-perfect 0.993 exact-keep rate on critical reasoning tokens.

Token Savings 78%[Paper p.14]
Latency Delta -75%[Paper p.14]
Exact Keep Rate 0.993[Paper p.16]

Interactive Kompress-Ultra Playground

Type or select a preset below. Watch the pruner and rewriter compress your prompt in real-time as you type.

Original: 0 tokens
Compressed: 0 tokens
0% Saved
Real-time (97ms)
V
Verbose Input (Prior Turns + Context)
Optimized Prompt (Passed to LLM)
K

The Voting Ensemble Paradox

ELI5: The Veto Committee

Imagine a committee of three experts deciding which words to keep in a document to save space. To be extremely conservative, the rule is: "If even one expert votes to delete a word, we delete it."

Each expert is smart, but has one blind spot where they always vote to delete. Because of the veto rule, every single critical item gets deleted because the expert who doesn't understand it vetoes it. The group becomes worse than any single expert on their own!

Interactive Veto Simulator
Expert 1 (v3) Blind to: Paths
Keep
Expert 2 (v5) Blind to: Commands
Keep
Expert 3 (v6) Blind to: IPs
Keep
Committee Decision Evicted! (Item is Lost)
Evicted
Under the AND veto rule, Expert 1 doesn't understand file paths and votes to delete it. Even though the other two experts voted to keep it, the item is evicted. The ensemble's recall collapses.[Paper p.12]

Theoretical Core

Learned context pruning improves long-context agent efficiency but introduces the **Voting Ensemble Paradox**. Under unanimity-to-keep (AND) voting ($k=1$ drop-if-any), the ensemble eviction indicator equals the pointwise maximum of the individual voter indicators:[Paper p.6]

$$I_{\text{ens}}(x) = \bigvee_{i=1}^N I_i(x) = I_{i^*_k}(x)$$
1. Group Decision: I_ens(x) The final output of the ensemble. Represents whether the collective group decides to evict or keep token $x$.
2. Veto Vote: \bigvee Logical OR operator. If even a single expert votes to evict ($\text{indicator} = 1$), the entire group evicts.
3. Weakest Link: I_i*_k(x) The weakest expert's decision. The entire group's performance is dragged down to the level of the least capable member on that topic.

This yields a stratum-wise Pareto collapse where the ensemble's recall equals that of the weakest voter on each stratum. As a corrective, `kompress-ultra` employs three core mechanisms:

The 4-Role Pipeline Architecture Raw History Verbose 1. Pruner Safety Floor 2. Rewriter Squeeze text 3. Circulator Milvus DB 4. Composer Synthesize Dense Context Compressed

Vaked Capability & Context (vakedc)

A decentralized routing and verification matrix: vaked-base defines node capacities, vaked orchestrates active routing, and vakedc signs context proofs.

Vaked Router Active Context T_crit Floor Passive Memory Milvus DB Capabilities CUDA / CPU Execution Bun / Nu

Model Architecture

Dual-Head ModernBERT

`kompress-v8` uses a 149M-parameter ModernBERT backbone with LoRA fine-tuning applied to the last 4 attention layers. Two task heads share the encoder:

An Asymmetric Modulation Gate scales the token logits to suppress eviction in high-coherence spans:

$$\tilde{I}_i(x) = \sigma\left(\text{logits}_{\text{tok}}(x) - \gamma g(\text{logits}_{\text{span}}(x))\right)$$
ModernBERT Encoder Token Head h_tok Span-CNN h_span σ Gate g

Empirical Benchmarks

Evaluated on the Heretic adversarial benchmark, kompress-v8 dominates traditional prompt compression models.

Method
Exact Keep % ($T_{\text{crit}}$) Percentage of critical syntactic tokens (paths, errors, code) successfully preserved after pruning.
Keep Rate (Tokens) The ratio of output tokens divided by input tokens. Lower means more compression.
Avg. Latency Average processing time in milliseconds for the context pruner to run.
kompress-v8 (Ours, Production) 0.993[Paper p.16] 0.936 97.0 ms
kompress-v8 (Ours, `v4` SSL) 0.967[Paper p.16] 0.823 Offline Checkpoint: Not evaluated for active inference latency.
Random Eviction (Floor) 0.910[Paper p.16] 0.835 0.0 ms
LLMLingua-2 0.867[Paper p.16] 1.550 Context Expansion: Kept 155% of original tokens (caused context bloat). 238.9 ms
TextRank (Extractive) 0.599[Paper p.16] 0.543 23.1 ms

Headroom Integration Proposal

We propose integrating `kompress-ultra` directly into Headroom (referencing Headroom PR #1419) as a core context-management middleware:

1. Middleware Chain Integration

Intercept outgoing LLM payload payloads in Headroom and run token-level classification via a local ONNX runtime of `kompress-v8`.

2. Configurable Safety Floors

Provide pre-configured regex patterns matching $T_{\text{crit}}$ class tokens to ensure 100% survival rates on critical system outputs (originally reviewed in headroom PR #1400).

3. Passive Memory Offloading

Seamlessly write evicted tokens to Headroom's memory spine (e.g. SQLite/Milvus) for semantic recall in future turns.

Reviews & Feedback

Submit a review of this proposal. Reviews are cryptographically signed by your browser and submitted via a **GitHub Pull Request**, guaranteeing they are **provably immutable** (the author cannot modify them without breaking the signature).

Submit a Review
Verified Reviews
Loading verified reviews...

Academic Telemetry & Verification

This site is dedicated strictly to academic research. There are no tracking scripts, Google Ads, or third-party cookies. The connection is proxied and secured solely through Cloudflare.

Cryptographic Attestation
"I, Peter Lodri, the owner of <this>, in terms of cryptographic easter eggs andOr hashes (eg.: genesis hash - vaked.dev) trust Cloudflare. 2026-06-29 --- and I understand that I'm kinda `forced` to choose yet I do this by my own will, noboy is forcing me or giving me any monetary whatever." — peter
TLS Root Certificate Verifier
Common Name (CN)proposal.vaked.dev
IssuerGoogle Trust Services (WE1)
ProtocolTLSv1.3
Cipher SuiteAEAD-CHACHA20-POLY1305-SHA256
Root Cert AuthorityGTS Root R1 (Verified)
Ralph-Loop Telemetry (Academic Dogfeeding)
L1 Loop Active
Slices Processed 14,820
Active Sandboxes Bun / Nushell
Inference Latency 97.4 ms
Token Savings Rate 78.5%

Glossary

Context Raw conversation history, code, and tool outputs fed into an LLM.
Compression Reducing prompt length to save tokens and speed up inference.
learned context-pruning Using a machine learning model to select which tokens to keep.
token savings The percentage of prompt tokens removed by the pruner.
latency Time taken for the LLM to generate a response.
critical-syntactic safety floor Tokens (like code, errors) that must never be pruned.
chat history The back-and-forth log of messages in an agent session.
Voting Ensemble Paradox When voting makes an ensemble worse than its individual models.
stratum-wise Pareto collapse Recall collapsing to the weakest model's level on each topic.
training floors Baselines of performance guaranteed during model training.
multi-checkpoint voting ensemble Grouping multiple model checkpoints to make decisions.
unanimity-to-keep (AND) Keeping a token only if every single model votes to keep it.

Ecosystem & Related Work

This research is part of a broader ecosystem. All source code, dataset distributions, and experiment logs are open-source and publicly available for replication: