SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

ArXi:2604.18610v1 Announce Type: cross Multimodal Large Language Models (MLLMs) have achieved remarkable progress but incur substantial computational overhead and energy consumption during inference, limiting deployment in resource-constrained environments. Spiking Neural Networks (SNNs), with their sparse event-driven computation, offer inherent energy efficiency advantages on neuromorphic hardware, yet extending them to MLLMs faces two key challenges: heterogeneous modalities make uniform spike encoding insufficient, and high-resolution image inputs amplify timestep unfolding overhead.