Does DeepSeek* Solve the Small Scale Model Performance Puzzle?

Eze_Lanza · ‎02-05-2025

Learn how the DeepSeek-R1 distilled reasoning model performs and see how it works on Intel hardware

DeepSeek* just launched its first-generation reasoning model, Deepseek-R1, with distilled versions. Although it may look like another good model claiming to outperform all the existing models on solving complex tasks, the real excitement lies in its ability to transfer advanced reasoning to a small language model (SLM).

Large language models (LLMs) can sometimes present challenges for enterprises to test/deploy locally due to their high computational requirements, complex infrastructure needs, and dependency on large-scale hardware accelerators. As a result, applications often rely on external APIs, which offer a practical alternative and support a wide range of use cases. In other scenarios, factors such as operational constraints, strategic priorities, or specific deployment requirements drive the need to run models locally. Addressing these scenarios effectively requires solutions that optimize performance, scalability, and accessibility

To address these challenges, SLMs with advanced reasoning capabilities appear to bridge the gap between large-scale performance and local or affordable deployment using more readily available compute resources. These small, yet highly capable, models offer complex reasoning while maintaining efficiency and optimizing resource consumption. Enterprises can use them as effective solutions for scalable, cost-effective, and reliable AI applications.

But how effective are they, really?

How Do Reasoning Models Work?

A reasoning model generates output differently from an LLM. An LLM generates its first token immediately based on statistical likelihood, optimizing for fluency and speed. In contrast, reasoning models may delay generation to plan intermediate steps, prioritizing logical accuracy over rapid response.

Feature	LLMs	Reasoning Models
Speed of first token	Fast	Delayed (due to planning)
Mechanism	Predicts token based on statistical likelihood	Uses intermediate reasoning before token selection
Fluency vs Accuracy	Prioritizes fluency	Prioritizes accuracy
Pattern vs Logical Steps	Patterns from data	Follows logical steps

This distinction becomes relevant as stronger reasoning capabilities in SLMs lead to better problem-solving, improved code generation, and more reliable AI assistants powered by techniques such as retrieval-augmented generation (RAG).

DeepSeek-R1 exemplifies this distinction with its innovative Group Relative Policy Optimization (GRPO) approach, which evaluates and improves its outputs autonomously without relying on traditional external reward models. Additionally, DeepSeek-R1 incorporates chain-of-thought (CoT) reasoning, allowing the model to break down complex tasks into small, logical steps, leading to more transparent and accurate outputs.

Testing DeepSeek 8B vs Llama 3.1 8B

We aren’t spectators, and we wanted to test it out! Small models can run on laptops, so we used an AI PC with an Intel® Core™ Ultra 7 processor, which delivers high performance-per-watt AI inference. The AI PC laptop we used had 32 GB of RAM.

We used a DeepSeek-distilled SLM, DeepSeek-R1-Distill-Llama-8B. A distilled model is a small, optimized version of a larger AI model, trained to retain most of its capabilities while reducing the compute resource requirements.

Check out the video of this test:

With knowledge distillation, a large, high-performing model (the "teacher," DeepSeek-R1) transfers its knowledge to a small model (the "student," LLaMA architecture). This process reduces size and computational requirements while preserving, as much as possible, reasoning ability, making it fast and more practical for local use. We compared it with Llama 3.1-8B, a leading general-purpose LLM in the same parameter size class thus requiring similar deployment resources, and we evaluated their performance using a logic-based question.

Which weighs more: a pound of water, two pounds of bricks, a pound of feathers, or three pounds of air? 

True Answer: Three pounds of air. 

LLAMA 3.1 8B

A classic example of a trick question! 

In each case, the answer is... they all weigh the same amount. 

A pound of water weighs 1 pound. 
Two pounds of bricks weigh 2 pounds. 
A pound of feathers weighs 1 pound. 
Three pounds of air also weigh 3 pounds (air is essentially weightless in this context).

So, in each case, the answer is simply "a pound" or the number of pounds mentioned. 

DeepSeek-R1-Distill-Llama-8B

<think>

Okay, so I'm trying to figure out which of these items weighs more. The options are: a pound of water, two pounds of bricks, a pound of feathers, and three pounds of air. Hmm, at first glance, it seems like some of these might be trickier than they look because they're not all physical objects you can easily compare. 

Let me start by breaking down each item. A "pound" is a unit of weight, so when someone says a pound of water, that's exactly 16 fluid ounces, right? Similarly, a pound of feathers would also be 16 fluid ounces since it's just the weight of the feathers in a pound. Two pounds of bricks sounds straightforward too; that's two times 16 ounces, so 32 ounces or 2 pounds. 

Now, the tricky part might be with three pounds of air. Air is something you can't hold like a solid object, but in terms of weight, it does have mass. So, three pounds of air would actually be equivalent to three pounds in weight, even though it's not something you can physically pick up. But wait, sometimes when people talk about air weight, they might think of it differently, especially in things like airplane weighings where air is considered negligible because it's so light or maybe not counted at all. 

Let me think about this again. A pound of water is 16 oz, two pounds of bricks would be 32 oz, a pound of feathers is also 16 oz. So far, two pounds of bricks seem to weigh the most in terms of ounces. But then there's three pounds of air, which would be 48 ounces if we're considering it as weight. Wait, that can't be right because three pounds is three times a pound, so each pound is 16 ounces. Therefore, three pounds would be 48 ounces. 

But wait, I might be confusing fluid ounces with weight ounces here. Actually, when dealing with air, sometimes it's measured in different units like cubic feet or liters, but the question specifically mentions pounds, which are a unit of weight. So, if we're talking about weight, three pounds of air would indeed be 48 ounces. However, in reality, when measuring air for weight, things get tricky because air is compressible and its density changes with altitude and temperature. 

In everyday terms, though, when someone talks about the weight of air, they might not mean it literally as a physical object but rather referring to atmospheric pressure or something else. But in this question, I think we're supposed to take "three pounds of air" at face value as 3 pounds, which would be 48 ounces. 

So comparing all options: 16 oz (water) and 16 oz (feathers), 32 oz (bricks), and 48 oz (air). So three pounds of air is the heaviest. But wait, that seems counterintuitive because we can't feel the weight of air as much as a physical object. 

Let me double-check. A pound of water is about 16 oz, which is roughly 0.45 kg. Two pounds of bricks would be 2 kg, and a pound of feathers is also 0.45 kg. Three pounds of air, if we consider standard conditions (like at sea level), the density of air is about 1.225 kg/m³. So three pounds in kilograms is 3 / 2.20462 ≈ 1.35 kg. Then, using density, volume would be mass divided by density: 1.35 / 1.225 ≈ 1.10 m³. But that's volume. Since we're comparing weights (mass), the air itself weighs 1.35 kg, which is about 3 pounds. 

Wait, I'm getting confused here. If three pounds of air has a mass of three pounds, then it's equivalent in weight to three pounds. So when someone says "three pounds of air," they mean that the air itself weighs three pounds. Therefore, comparing all options, the three-pound air is the heaviest because it's 3 pounds versus the others being 1 pound each or two pounds. 

But I might be overcomplicating this. The question is straightforward: which weighs more? So: 

- A pound of water = 1 lb - Two pounds of bricks = 2 lb - A pound of feathers = 1 lb - Three pounds of air = 3 lb 

So, the three-pound air would weigh the most at 3 lb. 

Wait, but sometimes in physics, air is considered to have negligible weight because it's a gas, but that's more about its effect on objects. Here, the question is asking about the weight of the air itself, so it's three pounds versus others being less. 

I think I've got it now. The three-pound air is the heaviest.

</think> 

 ----------------------------

The item that weighs the most is three pounds of air. 

Step-by-Step Explanation:

Understanding Units: Each item is measured in pounds (lb), a unit of weight.  
Convert Pounds to Ounces (if necessary):
- 1 pound = 16 ounces.  
Calculate Each Item's Weight:
- A pound of water = 1 lb.
- Two pounds of bricks = 2 lb.
- A pound of feathers = 1 lb.
- Three pounds of air = 3 lb.  
Comparison: Among the options, three pounds of air is the heaviest at 3 lb.

 ----------------------------

Answer: Three pounds of air weighs more than the other items listed. 

NOTE: The entire answer was intentionally added to the post so you can follow each model’s reasoning.

DeepSeek-R1 applied careful reasoning to correctly determine the answer, while Llama 3.1 8B made a logical misstep, leading to the wrong conclusion. DeepSeek-R1 took longer to reach a conclusion, which can impact the response time for an answer. This will be relevant for enterprise use cases, where the model understands the prompt with context, delivers accurate and contextually aware responses quickly, and integrates seamlessly into production environments without compromising scalability or efficiency.

Why Use a Small Language Model?

Although the reasoning results are impressive, it’s not just about having a model to solve “logic problems.” For enterprises, the ideal solution is a model that balances speed, accuracy, and reasoning power. It ensures it meets the demands of applications while maximizing the wide variety of compute resources typically available across an enterprise, from CPUs to specialized accelerators. Even when it comes to deploying applications, where there is a need to rely on open source frameworks like Open Platform for Enterprise AI (OPEA), organizations can seamlessly deploy, optimize, and scale AI models across diverse environments, from on-premises infrastructure to cloud and edge devices. 

For enterprises, there are two key takeaways:

Accuracy/Reasoning: An SLM that excels at reasoning tasks can replace much larger models (for example, 70B Parameters) in some use cases, making AI applications more efficient.
Hardware needs: SLMs are known to be light and run even on regular PCs. This not only enables developers to test their AI applications on their laptops, but it opens up a wide range of deployment possibilities, like more efficient external APIs and local deployments,

But there’s more to consider. Your first thought might be, “If I can run it on my PC, do I need a GPU anymore?” While this is partly true, in that an SLM does help you avoid costly hardware, there are nuances to consider.

When you deploy your model for your AI application, you will need to consider its scalability. In other words, how many users will interact with it, and how many requests will it handle?

At higher usage levels of your AI applications, more powerful hardware will be necessary. Whether that means cloud, on-site datacenter, or external APIs, Intel® Xeon® Scalable processors are widely available and capable of running SLM inference. For added capacity, you can add dedicated AI accelerators such as Intel® Gaudi®. To keep it simple, an 8B model will serve more users than a 70B model with the same hardware.

It is truly a game-changer to be able to develop and test highly capable AI applications with SLMs using local AI PCs, with the flexibility to deploy to readily available datacenter or PC-based compute resources.

Conclusion

This blog showcased one of the several tests we’ve made using multiple distilled DeepSeek-R1 models. Additionally, we tested with mathematical operations from CHAMP Dataset. In most cases, DeepSeek-R1 outperformed other models in reasoning tasks.

However, its performance varied across our tests, highlighting opportunities for further enhancement. DeepSeek acknowledges these limitations in its paper, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Despite this, the progress made with DeepSeek-R1 is notable. It represents a step toward creating more cost-efficient AI, bringing developers closer to cutting-edge technology.

Did You Try It? Experience DeepSeek with Intel Hardware

With the premise that the ideal solution balances speed, accuracy, and reasoning power, ensuring it meets enterprise demands while minimizing processing time and resource usage, explore DeepSeek for yourself on these Intel platforms:

AI PC: Intel® Tiber™ AI Cloud provides access to AI-powered PC capabilities, enabling efficient on-device AI workloads with improved performance and responsiveness.

Intel® Xeon® Scalable processors: Designed for high-performance computing, Xeon processors are available through multiple cloud service providers. Additionally,  Intel Tiber AI Cloud offers access to cutting-edge Xeon-based AI infrastructure for enhanced scalability and workload optimization.

Intel® Gaudi® 2 AI accelerators: Tailored to improve deep learning price-performance, Gaudi 2 is optimized for training and inference in large-scale AI models. See Denvr Dataworks for details on deployment and availability.

Check out the tests we performed:

About the Author

Ezequiel Lanza, Open Source AI Evangelist, Intel

Ezequiel Lanza is an Intel open source AI evangelist, passionate about helping people discover the exciting world of AI. He’s also a frequent AI conference presenter and creator of use cases, tutorials, and guides to help developers adopt open source AI tools. He holds an MS in data science. Find him on X and LinkedIn

Ronnic · ‎05-08-2025

Enables on to have a deep thinking of the concepts being undertaken.