Tech Term Decoded: Model Compression

Definition

The goal of model compression centers on developing a reduced model from the original while maintaining accuracy. The simplified model is one that is reduced in size and/or latency from the original. A size reduction means that the compressed model has fewer and/or smaller parameters and, therefore, requires less RAM for execution, which is a positive development as other parts of the application will have more memory to work with. A latency reduction means less time for the model to make a prediction, or inference, based on an input to the trained model, generally resulting to lower energy consumption at runtime [1].

For example, imagine a scenario where an AI diagnostic tool at a rural health clinic in Ebonyi State must analyze malaria blood samples through smartphone microscopy without relying on internet connection to urban hospitals. The problem with deploying medical AI across primary healthcare centers is that clinics usually work with donated tablets and old smartphones with outdated processors. As a result, they have insufficient RAM to run complex models, limited storage already occupied by patient records, and unreliable power supply for charging devices.

That's where model compression becomes life-saving, enabling community health workers in remote villages to provide AI-assisted diagnoses using basic technology, potentially reducing maternal mortality and childhood disease fatalities significantly.

KD Model Compression in AI
A knowledge distillation model compression technique [2].

Origin

Techniques for compressing and speeding up DNN models started from 2014. Between 2014 and 2021, research on DNN model compression techniques has resulted to the field been subdivided into six research categories involving lightweight network structure design, NAS, low-rank decomposition, network quantization and KD, and some methods that are combinations of each other [3].

Context and Usage

Model compression techniques are frequently employed in various applications, particularly in mobile and embedded systems, which have constrained resources. Healthcare, finance, and autonomous system industries leverage model compression to enable efficient real-time processing. They are also utilized in deploying models on cloud services, enabling faster responses and lower operational costs.

Why it Matters

Most on-device machine learning engineer deal with the problem of deploying models on low resource devices. Usually, you may discover that the trained model is too heavy or resource intensive for devices like mobile phones, IoT systems, or edge devices after completing the standard ML pipeline of data collection, preprocessing, and designing a high-performance model. Knowing the resource constraints of your target hardware and improving your model to meet those requirements is vital. This process is referred to as model compression [4].

In Practice

SqueezeBits is a good example of a real-life case study of model compression in practice. They are proficient and capable in deploying compression techniques in relation to their client’s target hardware constraints. In 2023, they utilized quantization, pruning, and knowledge distillation to compress the Stable Diffusion model. In generating a 512 x 512 image using this model, they attained a remarkable inference latency of less than 7 seconds on a Galaxy S23 and less than a second on an iPhone 14 Pro device. they continually work to stay current with the latest technique to find newer, more functional ways to improve and speed up your AI model [2].

Learn More

Related Model Training and Evaluation concepts:

  • Loss Function: Mathematical measure of how far a model's predictions are from actual values
  • Model Deployment: Process of integrating a trained model into production environments for real-world use
  • Model Evaluation: Process of assessing how well a model performs on test data and other metrics
  • Model Explainability: Techniques and methods for making AI model decisions transparent and understandable
  • Model Interpretability: Ability to understand and explain how a model makes decisions

 Reference

  1. Peterson, H. (2020). An Overview of Model Compression Techniques for Deep Learning in Space
  2. Cheon, S. (2024). 4 Types of AI Compression Methods You Should Know
  3. Lyu, Z., Yu, T., Pan, F., Zhang, Y., Luo, J., Zhang, D., Chen, Y., Zhang, B., & Li, G. (2023). A survey of model compression strategies for object detection
  4. Doost, S,. A. (2024). Model Compression Techniques: An Introductory and Comparative Guide 

Kelechi Egegbara

Kelechi Egegbara is a Computer Science lecturer with over 12 years of experience, an award winning Academic Adviser, Member of Computer Professionals of Nigeria and the founder of Kelegan.com. With a background in tech education, he has dedicated the later years of his career to making technology education accessible to everyone by publishing papers that explores how emerging technologies transform various sectors like education, healthcare, economy, agriculture, governance, environment, photography, etc. Beyond tech, he is passionate about documentaries, sports, and storytelling - interests that help him create engaging technical content. You can connect with him at kegegbara@fpno.edu.ng to explore the exciting world of technology together.

Post a Comment

Previous Post Next Post