AI RESEARCH

DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

arXiv CS.LG

ArXi:2604.14552v1 Announce Type: cross Modern datacenters increasingly rely on low-power, single-slot inference accelerators to balance performance, energy efficiency, and rack density constraints. The NVIDIA T4 GPU has become widely deployed due to strong performance per watt and mature software. Its successor, the NVIDIA L4 GPU, Results show that reduced precision significantly improves performance, with INT8 achieving up to 58x throughput improvement over CPU baselines.