GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

ArXi:2603.28708v1 Announce Type: new This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63% reduction in memory usage. We.