ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding

ArXi:2604.14612v1 Announce Type: new Self-speculative decoding is an inference technique for large language models designed to speed up generation without sacrificing output quality. It combines fast, approximate decoding using a compact version of the model as a draft model with selective re-evaluation by the full target model. Some existing methods form the draft model by dynamically learning which layers to skip during inference, effectively creating a smaller subnetwork to speed up computation.