Specialized Cognitive Experts Emerge in Large AI Reasoning Models

musashi_hinck · ‎05-21-2025

Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Man Luo, and Sungduk Yu are AI research scientists in the Multimodal Cognitive AI Lab, led by Vasudev Lal at Intel Labs. Chendi Xue is a machine learning engineer with the Product Software Group at Intel.

Highlights

Intel researchers found that DeepSeek-R1 demonstrates greater semantic specialization in expert routing compared to earlier mixture of experts (MoE) models.
The model exhibits structured cognitive processes, indicating expert specialization extends to reasoning strategies.
Intel research provides insights into optimizing MoE models for enhanced reasoning and efficiency.

Intel research scientists found preliminary evidence of semantic specialization in the expert layers of DeepSeek-R1, a phenomenon that has not been observed before in mixture of experts artificial intelligence (AI) models. In a new published study, researchers found a connection between deep learning architecture and function in the powerful open-weight MoE model. Released earlier this year, the 671 billion parameter DeepSeek-R1 shows reasoning capabilities comparable to proprietary frontier models such as the approximately 1.75 billion parameter GPT-4o MoE model from OpenAI and Anthropic’s Claude 3.5 Sonnet.

MoE models refer to large language models (LLMs) with sparse feed-forward networks (FFNs). Instead of passing data through an all-to-all dense connection where every neuron in one layer is connected to every neuron in the next layer, MoE modules first direct the input through a router that sends the data to a subset of smaller FFNs, referred to as experts. The router selects the most appropriate experts based on the input. Past research has studied whether these experts specialize in particular types of inputs, and found evidence largely indicative of token specialization. For example, inputs composed of similar letters or characters were routed to similar sets of experts.

Figure 1 DeepSeek-R1 architecture.png

Figure 1: Diagram detailing DeepSeek-V3 (and R1) architecture. The top right box shows how the feed-forward network is broken down into a mixture of experts module. Image credit: DeepSeek-V3 Technical Report.

Using a pair of controlled experiments, Intel researchers found evidence that in contrast to smaller models, inputs to DeepSeek-R1 are routed based on the meaning of the inputs, a phenomenon that they refer to as semantic specialization. This emergent property shows a link between function and form that emerges in large deep learning architectures and helps deepen the scientific community’s understanding of how these powerful AI models work.

Investigating Semantic Specialization

The first experiment employed a word sense disambiguation (WSD) task using polysemic words that have more than one meaning. In the test, the target word appears in two sentences, either with the same meaning (sense) or differing senses. If a polysemic word is routed differently, then this is evidence that routing occurs based on meaning. For example:

I keep my money in a bank.
I kicked the ball by the river bank.

At each MoE module, eight out of a possible 256 experts are chosen to pass through the inputs. By comparing the rate of overlap of these eight selected experts when the word sense was the same but the meaning differed, the scientists were able to determine whether the routing mechanism was informed by the semantic meaning.

The results showed that for the DeepSeek-R1 model, there is significantly higher expert overlap where the word sense is the same than when it differs. In contrast, the rate of expert overlap differs less between the two cases in earlier MoE models like Mixtral. This suggests that DeepSeek-R1 exhibits stronger semantic specialization than previous models, which may contribute to its performance.

Cognitive Specialization in Reasoning

In the second experiment, the scientists analyzed DeepSeek-R1's reasoning structure using the agentic AI DiscoveryWorld environment as a testbed. DiscoveryWorld is a large-scale agentic environment suite that tests the abilities of an agent to perform the scientific method. For example, the team used Reactor Lab, where the agent must tune the frequency of quantum crystals to achieve the final goal of activating a reactor. To succeed, the agent must formulate and test hypotheses by using available tools, literature, and its own memory.

Figure 2 DeepSeek-R1 in DiscoveryWorld Intel Labs.png

Figure 2: Visual depiction of DeepSeek-R1 playing the Reactor Lab environment in DiscoveryWorld. For the experiment, R1 used a text prompt with a structured description of the environment (not the visual observation above).

At a high level, DeepSeek-R1's reasoning output on DiscoveryWorld displays many indicators reminiscent of System 2 thinking, such as backtracking, self-evaluation, and situational awareness. In cognitive science, System 1 and System 2 refer to different modes of thinking. System 2 thinking in AI involves developing models that can process information methodically, similar to how humans reason through complex problems. With System 1 thinking, AI models often rely on fast pattern-recognition techniques to make quick decisions.

Figure 3. Chain of thought output from DeepSeek-R1 in the Reactor Lab environment in DiscoveryWorld.

To analyze DeepSeek-R1’s reasoning capabilities and expert specialization in cognitive processes, the scientists trained a sparse autoencoder (SAE), a specialized tool used to understand the internal workings of LLMs by identifying the features that the model uses to make predictions. The SAE was employed to map internal activations to reasoning patterns, showing how experts specialize in cognitive processes.

The findings indicate that DeepSeek-R1 follows a structured reasoning approach, incorporating self-evaluation and hypothesis testing. The team discovered cognitive specialization, where different experts are responsible for distinct reasoning processes. R1 consistently chose a small set of experts for reasoning patterns identified by the SAE, indicating that experts are not just semantically specialized but also control cognitive processes such as high-level reasoning.