[R] Physical Token Dropping (PTD)

I am releasing a proof-of-concept for Physical Token Dropping (PTD). Instead of applying a zero-mask to an N × N attention matrix, PTD physically extracts the top-K tokens to compute attention and FFN on a strictly smaller tensor, genuinely reducing FLOPs and VRAM. Core Architecture: Multi-Query Router: A low-rank projection scores token importance. Block-Shared Routing: A single routing decision is shared across a block of layers to amortize overhead.