torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]

I've been working on the consumer-multi-GPU PCIe bottleneck - Nvidia removed NVLink from the 4090/5090, and splitting a 70B model across two consumer cards drops you to ~30 GB/s over PCIe peer-to-peer. Spent the last few months building a Python library that uses the GPU's otherwise-idle NVENC/NVDEC silicon to compress activations and KV cache on the fly, then ships the small bitstream across the same wire. Repo: (Apache 2.0) Prior art (this isn't novel as an idea) LLM.265 - "Video Codecs are Secretly Tensor Codecs" (late 2025.