Longer contexts are easier to compress (not harder)

python dev.to

The common assumption: longer sequences are harder to compress because there's more information to retain. Our experiments show the opposite. The data Same model (Mistral-7B), same compression method, same eviction rate. Only the prefix length changes: Prefix length 35% eviction 60% eviction 80% eviction 500 tokens +0.90% PPL +4.5% PPL +6.6% PPL 1,600 tokens +0.14% PPL +0.82% PPL +2.1% PPL 3,500 tokens +0.43% PPL +1.3% PPL +2.6% PPL At 1,600 tokens, 60% eviction gives

Read Full Tutorial open_in_new
arrow_back Back to Tutorials