J. Pablo Muñoz is a research scientist at Intel Labs, where he leads the research on compression and fine-tuning techniques to improve model performance on the emerging visual AI systems team. Co-authors Jinjie Yuan is a deep learning engineer specializing in natural language processing applications, and Nilesh Jain is a principal engineer who leads the research on emerging visual AI systems at Intel Labs.
Highlights
- Utilizing block pruning techniques, Intel Labs researchers developed the Mamba-Shedder solution to remove redundancies in Mamba-based models, improving their computational and memory efficiency.
- Due to their smaller footprint, the resulting compressed models accelerate inference and provide insights into the impact on the accuracy of the different building blocks.
- Using recovery tuning during the post-compression stage effectively reduces the accuracy gap of pruned models, thereby bringing their performance closer to that of the original models.
Intel Labs researchers developed Mamba-Shedder, a novel compression method for selective structured state space models (SSMs) and their variants, which achieves a speedup of up to approximately 1.4x during inference while maintaining model performance, making it suitable for real-time applications. By focusing on reducing parameters and computation costs in Mamba-based models, Mamba-Shedder removes unnecessary parts to make these models smaller and faster. The technique is applicable across tasks that require general language sequence modeling. The proposed technique offers an alternative with improved computational efficiency and scaling capabilities, especially for long sequences.
Selective structured SSMs such as Mamba have been proposed by the larger research community to address the inefficiencies of Transformer-based models. Transformers rely on self-attention mechanisms to process sequences quadratically, while structured SSMs use state-space representations to achieve linear scaling. The Mamba architecture, with its S6 and state space duality (SSD) blocks (Mamba 2), has served as the foundation for many recent SSM-based models. SSD blocks connect more complex attention mechanisms, as found in Transformers and the linear computational efficiency of SSMs. In turn, these models have inspired the development of robust hybrid architectures that combine the strengths of Transformers and SSMs. Hybrid models, such as Hymba, have demonstrated their ability to compete with purely Transformer-based models, resulting in enhanced efficiency, performance, and scalability.
Intel Labs researchers explored the compression of SSM-based models, particularly Mamba and its hybrids, demonstrating that redundant components can be removed with a minor impact on model performance.
How Mamba-Shedder Works
Mamba-Shedder is a novel approach that leverages a pre-trained Mamba-based model and a search space of subcomponents at various granularities, including entire Mamba blocks and SSM modules. In the case of hybrids, Mamba-Shedder explores the simultaneous removal of attention heads, channels in multilayer perceptrons (MLPs), or hybrid-head modules, as seen in Hymba. This process is illustrated in Figure 1. Once the search space has been constructed, Mamba-Shedder explores the iterative removal of subsets of components, providing valuable insights about inefficiencies present in the original models.
Mamba-Shedder employs a training-free approach to identify the least essential elements for removal. This method is inspired by similar strategies used in Transformer-based large language models. By focusing on the least critical components, Mamba-Shedder ensures that the pruning process has minimal impact on the model's overall performance. This approach is particularly beneficial for large-scale models.
Figure 1. Mamba-Shedder targets specific components of Mamba-based models and examines the impact of removing the identified redundancies.
Removing Redundancies and Recovery Tuning of Pruned Models
Depending on the target Mamba-based model, Mamba-Shedder finds opportunities for component removal with a minor impact on the model’s accuracy, as observed in Figure 2 (more detailed results can be found in our paper). This pruning process can lead to improvements in computational efficiency and inference speed.
Figure 2. SSM removal on Mamba-2 models. The model can tolerate the removal of several components with a minor effect on its overall accuracy and perplexity.
Intel Labs researchers have also explored the recovery of downstream accuracy in pruned models by incorporating an additional post-compression stage known as recovery tuning. Mamba-Shedder demonstrates that it can recover accuracy drops, reaching performance levels similar to the original, unpruned model, with a smaller memory and computational footprint, highlighting the benefits of having a leaner, more efficient model without sacrificing accuracy.
Accelerating AI Model Inference with Innovative Solutions
Mamba-Shedder’s findings are orthogonal to other compression techniques, such as weight compression. As a follow up to the original Mamba-Shedder work, Intel Labs researchers are exploring the application of weight compression techniques to pruned models, resulting in additional acceleration during inference.
The work on Mamba-Shedder by Intel Labs underscores the potential of selective structured state space models as efficient alternatives to Transformer-based models. By identifying and removing redundant components, Mamba-Shedder enhances the computational efficiency and inference speed of Mamba-based models. The recovery tuning stage further ensures that the pruned models maintain high accuracy, making them viable for practical applications. This research opens new avenues for optimizing model architectures and compression techniques, paving the way for more efficient and powerful AI models.
Mamba-Shedder’s code is available on Intel Labs’ GitHub space.
References
Dong, X., Fu, Y., Diao, S., Byeon, W., Chen, Z., Mahabaleshwarkar, A. S., Liu, S., Matthijs, V. K., Chen, M., Suhara, Y., Lin, Y., Kautz, J., & Molchanov, P. 2024. Hymba: A Hybrid-head Architecture for Small Language Models. arXiv. https://arxiv.org/abs/2411.13676
Gu, A. & Dao, T., 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv. https://arxiv.org/abs/2312.00752
Muñoz, J. P., Yuan, J., & Jain, N. 2025. Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). https://aclanthology.org/2025.naacl-long.195/
Zhong, L., Wan, F., Chen, R., Quan, X., & Li, L. 2024. BlockPruner: Fine-grained Pruning for Large Language Models. arXiv. https://arxiv.org/abs/2406.10594
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.