Work

AI Research

Deploying Large Language Models on Kinara's Edge AI Processor: Novel Quantization Techniques and Compiler Optimization

References

Enhancing CLIP Model Performance through Transformer Block Analysis and Optimization

In this study, I concentrated on optimizing OpenAI's CLIP model by enhancing its Transformer blocks, specifically focusing on the Key-Query-Value (KQV) projection layers through quantization observer analysis. The goal is to improve accuracy with minimal performance impact. Additionally, I investigated the impact of quantization errors in mean, variance, and inverse square root in Layer normalization within the Transformer block, proposing corrective measures for performance enhancement. Collaborating with Durga and Titash, we also analyzed systematic outliers in hidden layer features, developing a new quantization computation to mitigate errors arising from these outliers, as illustrated in the figure below.

Clip Outliers

References

Advancements in Inverse Square Root Approximation for Neural Network Normalization Layers

I (along with Siva) developed an innovative function approximation for efficient ASIC processors, specifically targeting the inverse square root function. Our novel algorithm revolutionizes its application in normalization layers within neural networks, supporting both powers-of-two quantization and scale with zero-point quantization. Compared to existing techniques, our approximation is approximately 2x faster and exhibits a 30% improvement in accuracy. Evaluation metrics, including Mean Squared Error (MSE), Signal-to-Noise Ratio (SNR), and Mean Absolute Error (MAE), showcase significant advancements in both speed and precision.

Results

MetricOld MethodsNew Method% Improvement
MSE0.0004040.000038~90.56%
SNR (dB)61.96870772.234044~16.52%
MAE0.0126920.002127~83.22%
MSE %0.000025%0.000002%~90.62%
Max Error0.1592680.066406~58.36%
Min Error0.0000000.000021N/A
Avg Error0.0126920.002127~83.22%

The diagram below shows the old methods in red and the new method in blue.

Layer norm analysis

The new algorithm's impact on transformer blocks, widely employed in diffusion models, LLMs, and image generation models like Stable Diffusion, is particularly noteworthy. The enhanced precision and computational efficiency contribute significantly to improving inference speeds in Language Models and facilitating the generation of high-quality images in models like Stable Diffusion

Impact of Observers on Rounding Techniques during Quantization Aware Training (QAT)

In this study, I investigated the influence of different observers on the convergence behavior of quantization aware training (QAT) when coupled with various rounding techniques. The observers considered are min_max_observer and moving_average_min_max_observer, and the rounding techniques analyzed include rne_c/c++, rne_python, and rai. The evaluation is conducted across multiple epochs to understand the dynamics of convergence.

Results

ObserverEpochRNE c/c++RNE pythonRAI
Min Max Observer00.7350.720740.72072
10.735680.720720.72068
20.736260.718820.7204
MA Min Max Observer00.73750.7462600.74622
10.739020.746560.74686
20.73760.747380.74768

QAT Rounding

References

Comparative Analysis of Rounding Techniques in Post Training Quantization for ResNet50 Model

In this study, I systematically explored the impact of different rounding techniques in post-training quantization, specifically focusing on ResNet50 model. The evaluated rounding methods include Rounding Away from Infinity (RAI), Round to Nearest Even (RNE), and Ada-round Simulator.

Results

PlatformModel2500 Image Set Accuracy
Original PyTorch ModelResNet5076.84%
RAIResNet5074.00%
RNEResNet5076.16%
Ada-round SimulatorResNet5076.84%

References:

Performance Enhancement through Activation Analysis and Precision Optimization in YOLOv5 Models

In this study, I focussed on optimising YOLOv5 accuracy through activation distribution analysis. The quantized model, using Post-Training Quantization (PTQ) without Quantization-Aware Training (QAT), achieved improved precision by selectively offsetting activation functions to claim more bits for higher precision. Subsequent adjustments in the mathematical operations within the network compensated for the changes introduced in the activation layers.

Swish with and without offset

Results

Model ConfigurationAverage Precision (AP) @[IoU=0.50:0.95]
Original float model0.532
Quantized model with offset (PO2)0.516
Quantized model without offset (PO2)0.478

Designed the Math behind Precision-Preserving Kernels for Complex Mathematical Operations such as ROIAlign and Bilinear Interpolation on ASIC

In this study, I introduced efficient kernels design in int8 precision and mathematical optimization for ROIAlign function. Custom ROIAlign kernels ensure accurate region-based feature extraction in computer vision. I addressed the challenges in bilinear interpolation due to quantization errors, emphasizing the importance of preserving precision.

Challenges in Quantizing Bilinear Interpolation:

Error in Bilinear Interpolation

Small errors in quantization could result in significant shifts in bounding boxes (as shown in the figure above, where the red dot represented the actual FP32 point intended for the pixel within the red box, but quantization errors would shift the point to a new position as represented by the blue dot, resulting in the selection of the pixel represented by the blue rectangle). This shift caused high errors in the bounding boxes, ultimately resulting in incorrect object detections in YOLO models. I successfully navigated this challenge, achieving a quantization strategy that struck a delicate balance between computational efficiency and precision.

Reference

Kernel Development for Efficient Powers-of-Two Approximation Exponentiation and Application to SoftMax Function

This research introduced a novel kernel designed for accurate powers-of-two approximation exponentiation. Leveraging polynomial fitting and the Newton-Raphson method, our approach (along with Aditya) optimized the computation of exponentiation, offering a balance between precision and efficiency. The kernel's versatility extends beyond traditional exponentiation applications to include the SoftMax function.

ex = 2(x/ln(2))

Reference


Academic Research

Fine-Tuning and Quantization Techniques for Enhanced Efficiency in LLMs for Task-Specific Code Generation

Convolutional Neural Networks based Dementia and Tumor Classification from MRI Brain Images

Paper Link Github Code Link


Twitter . LinkedIn . Substack . Medium . GitHub . Instagram . xr4936s4@duck.com

© varmology.RSS