AI RESEARCH
DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance
arXiv CS.LG
•
ArXi:2604.14552v1 Announce Type: cross Modern datacenters increasingly rely on low-power, single-slot inference accelerators to balance performance, energy efficiency, and rack density constraints. The NVIDIA T4 GPU has become widely deployed due to strong performance per watt and mature software. Its successor, the NVIDIA L4 GPU, Results show that reduced precision significantly improves performance, with INT8 achieving up to 58x throughput improvement over CPU baselines.