Longer contexts are easier to compress (not harder)
python
dev.to
The common assumption: longer sequences are harder to compress because there's more information to retain. Our experiments show the opposite. The data Same model (Mistral-7B), same compression method, same eviction rate. Only the prefix length changes: Prefix length 35% eviction 60% eviction 80% eviction 500 tokens +0.90% PPL +4.5% PPL +6.6% PPL 1,600 tokens +0.14% PPL +0.82% PPL +2.1% PPL 3,500 tokens +0.43% PPL +1.3% PPL +2.6% PPL At 1,600 tokens, 60% eviction gives