Definition
The goal of model compression centers on developing a reduced model from the original while maintaining accuracy. The simplified model is one that is reduced in size and/or latency from the original. A size reduction means that the compressed model has fewer and/or smaller parameters and, therefore, requires less RAM for execution, which is a positive development as other parts of the application will have more memory to work with. A latency reduction means less time for the model to make a prediction, or inference, based on an input to the trained model, generally resulting to lower energy consumption at runtime [1].
For example, imagine
a scenario where an AI diagnostic tool at a rural health clinic in Ebonyi State
must analyze malaria blood samples through smartphone microscopy without
relying on internet connection to urban hospitals. The problem with deploying
medical AI across primary healthcare centers is that clinics usually work with
donated tablets and old smartphones with outdated processors. As a result, they
have insufficient RAM to run complex models, limited storage already occupied
by patient records, and unreliable power supply for charging devices.
That's where model compression becomes life-saving, enabling community health workers in remote villages to provide AI-assisted diagnoses using basic technology, potentially reducing maternal mortality and childhood disease fatalities significantly.
Origin
Techniques for compressing and speeding up DNN models started from 2014. Between 2014 and 2021, research on DNN model compression techniques has resulted to the field been subdivided into six research categories involving lightweight network structure design, NAS, low-rank decomposition, network quantization and KD, and some methods that are combinations of each other [3].
Context and Usage
Model
compression techniques are frequently employed in various applications,
particularly in mobile and embedded systems, which have constrained resources. Healthcare,
finance, and autonomous system industries leverage model compression to enable
efficient real-time processing. They are also utilized in deploying models on
cloud services, enabling faster responses and lower operational costs.
Why it Matters
Most on-device
machine learning engineer deal with the problem of deploying models on low resource
devices. Usually, you may discover that the trained model is too heavy or resource
intensive for devices like mobile phones, IoT systems, or edge devices after completing
the standard ML pipeline of data collection, preprocessing, and designing a
high-performance model. Knowing the resource constraints of your target
hardware and improving your model to meet those requirements is vital. This
process is referred to as model compression [4].
In Practice
SqueezeBits is a
good example of a real-life case study of model compression in practice. They
are proficient and capable in deploying compression techniques in relation to their
client’s target hardware constraints. In 2023, they utilized quantization,
pruning, and knowledge distillation to compress the Stable Diffusion model. In
generating a 512 x 512 image using this model, they attained a remarkable
inference latency of less than 7 seconds on a Galaxy S23 and less than a second
on an iPhone 14 Pro device. they continually work to stay current with the
latest technique to find newer, more functional ways to improve and speed up
your AI model [2].
Learn More
Related Model
Training and Evaluation concepts:
- Loss Function: Mathematical measure of how far a model's predictions are from actual values
- Model Deployment: Process of integrating a trained model into production environments for real-world use
- Model Evaluation: Process of assessing how well a model performs on test data and other metrics
- Model Explainability: Techniques and methods for making AI model decisions transparent and understandable
- Model Interpretability: Ability to understand and explain how a model makes decisions
- Peterson, H. (2020). An Overview of Model Compression Techniques for Deep Learning in Space
- Cheon, S. (2024). 4 Types of AI Compression Methods You Should Know
- Lyu, Z., Yu, T., Pan, F., Zhang, Y., Luo, J., Zhang, D., Chen, Y., Zhang, B., & Li, G. (2023). A survey of model compression strategies for object detection
- Doost, S,. A. (2024). Model Compression Techniques: An Introductory and Comparative Guide
