Skip to content

Harshit Kumar

  • Home
  • Projects
  • Blog
  • Poems
← Blog

CUDA

All Categories →

GPU programming and CUDA for high-performance computing covering parallel programming, memory hierarchies, and optimizing compute-heavy workloads like matrix multiplication.

2 posts

  • Matrix Multiplication in CUDA
    CUDA

    Matrix Multiplication in CUDA

    Implementing matrix multiplication in CUDA from a naive CPU baseline to GPU-accelerated versions using tiled shared memory for deep learning workloads.

    Jun 07, 2024 · 17 min read
  • Mixed Precision and Quantization: Accelerating Deep Learning Training and Inference
    Deep Learning

    Mixed Precision and Quantization: Accelerating Deep Learning Training and Inference

    Comprehensive guide to mixed precision training (FP16/FP32) and INT8 quantization, covering GPU architecture, Tensor Cores, loss scaling, AMP, PTQ, QAT, and layer fusion with practical code examples.

    May 22, 2022 · 28 min read
Harshit Kumar 2026 About this site Creative Commons License