Cross-Platform Performance Evaluation of Matrix Multiplication: Insights from MKL, cuBLAS, and SYCL
Abstract
Matrix multiplication is a fundamental operationin deep neural network training and scientific computing, optimized through libraries such as Intel MKL and NVIDIA cuBLAS. MKL enhances CPU execution using multithreading and AVX-based vectorization, improving memory bandwidth utilizationand computational throughput. Conversely, cuBLAS leverages CUDA’s massive parallelism, employing thousands of GPU cores and Tensor Cores to accelerate matrix computations, though Tensor Core usage introduces numerical precision loss. SYCL extends heterogeneous computing capabilities, enabling efficient workload distribution across CPUs and GPUs. This study analyzes execution time, computational efficiency, and power consumption, utilizing PAPI and PERF to evaluate third- and fourth- generation Intel CPUs and selected NVIDIA GPUs. Results indicate that MKL delivers high CPU performance, while SYCL offers an alternative approach with distinct efficiency characteristics. GPU-based benchmarks show that cuBLAS with Tensor Cores achieves maximum throughput but at the cost of precision, whereas cuBLAS without Tensor Cores preserves accuracy with minimal performance trade-offs. These differences highlight the importance of optimization strategies in artificial intelligence and scientific computing, where scaling models and simulations demand efficient, high-performance, and sustainable computation.
Keywords
Matrix Multiplication; Performance Evaluation; Power consumption; CUDA; MKL; SYCL