Multi-Modal Brand Agent: Connecting Visual Logos to Business Intelligence

Jade-Worrall · ‎05-12-2025

Authors:

Sri Raj Aryan Karumuri, Sr Solutions Engineer, Intel Liftoff

Rahul Unnikrishnan Nair, Head of Engineering, Intel Liftoff

As visual branding becomes more pervasive, the ability to identify companies from logos and instantly retrieve relevant corporate information offers immense value, especially for market intelligence, brand analysis, and competitive research. This article walks through the implementation of a multimodal AI agent that combines image understanding and web search capabilities using local Hugging Face models, LangChain agents, and Intel® XPU-accelerated inference. By integrating tools like Moondream2 for vision-language understanding and DuckDuckGo Search, the system takes an image input (a company logo), identifies the brand, and returns structured information such as company name, headquarters, website, and domain expertise. This solution runs locally and is optimized for Intel hardware, making it suitable for secure, cost-effective deployment scenarios.

The Intel® Data Center GPU Max 1100: Powering Advanced AI

The performance described in this article, particularly the acceleration achieved using Intel Extension for PyTorch (IPEX) , is enabled by powerful hardware like the Intel® Data Center GPU Max Series. The GPU (Max 1100) is available in the free Intel® Tiber™ AI Cloud JupyterLab environment and as dedicated instances:

Compute Architecture (Xe-HPC):
- Xe-cores: 56 dedicated cores forming the foundation for GPU compute tasks.
- Intel® Xe Matrix Extensions (XMX) Engines: 448 engines providing deep systolic arrays optimized for accelerating the dense matrix and vector operations prevalent in AI and deep learning models.
- Vector Engines: 448 engines complementing the XMX units for broader parallel processing tasks.
- Ray Tracing Units: 56 units for hardware-accelerated ray tracing, enhancing visualization capabilities.

Memory Hierarchy:
- High Bandwidth Memory (HBM2e): 48 GB of HBM2e memory delivers 1.23 TB/s of bandwidth, crucial for large datasets and complex models like those used in multimodal embeddings.
- Cache: Features 28 MB L1 and 108 MB L2 cache to keep data close to the compute units, minimizing latency.

Connectivity:

PCIe Gen 5: Utilizes a fast PCIe Gen 5 x16 host interface for high-speed data transfer between the CPU and GPU.

Software Ecosystem (oneAPI):

The Max Series GPUs are designed to work seamlessly with the Intel oneAPI open, standards-based programming model. This allows developers to use frameworks like HuggingFace Transformers, Pytorch, Intel Extension for Pytorch etc and other libraries optimized for Intel architectures (CPUs and GPUs) to accelerate AI pipelines without proprietary lock-in.

What is this Code About?

This code walks through the implementation of a multimodal intelligent agent capable of identifying companies from logo images and retrieving structured business information like company name, headquarters, website, and industry. It uses LangChain to orchestrate reasoning steps and tools, and leverages Hugging Face models for both vision and language tasks—running efficiently on Intel® hardware using the Intel Extension for PyTorch (IPEX).

The main components of this code are:

Multimodal Reasoning Agent (LangChain):
A LangChain agent manages the decision-making process, deciding when to invoke tools like image analysis and web search. It follows a structured reasoning pattern (ReAct) to determine intermediate actions before producing a final answer.
LLM:
The Qwen/Qwen2.5-3B-Instruct model is used to interpret logos in images and extract brand names. It is a vision-language model that accepts image and text prompts, returning natural language descriptions or answers.
Vision:
The vikhyatk/moondream2 is used to interpret logos in images and extract brand names. It is a vision-language model that accepts image and text prompts, returning natural language descriptions or answers.
Real-Time Web Search:
The agent uses “DuckDuckGoSearchRun” from LangChain’s toolset to fetch real-time company information such as official websites, headquarters, and business activities based on the detected brand name.
Hardware Acceleration with Intel IPEX:
The code is optimized to detect and use Intel’s XPU (Accelerated Processing Unit) or CPU with IPEX for PyTorch. This enhances performance during image processing and model inference, making it ideal for edge or resource-conscious environments.

Use Cases and Applications

This multimodal agent setup is ideal for scenarios like:

Market Intelligence & Competitive Analysis – Identify competitors by logo and retrieve insights in real-time.
Brand Monitoring & Compliance – Recognize brands appearing in user-generated or monitored content.
Enterprise Knowledge Systems – Enable internal search engines that accept logo images as input.
Educational Demos for AI/ML Practitioners – Showcase the power of combining vision-language models with reasoning agents.
Retail or Advertising Analytics – Automatically detect brands featured in ads, packaging, or social media imagery.

Requirements:

torch – For running deep learning models used in image and text understanding.
intel_extension_for_pytorch (IPEX) – Enables optimized PyTorch performance on Intel hardware, including support for XPU acceleration.
transformers – For loading and using the vision-language model (Moondream2) and language model (Qwen 2.5 3B).
langchain – Provides the agent framework that integrates tools like image analysis and web search for intelligent multi-step reasoning.
langchain_community – Contains the DuckDuckGoSearchRun tool used to fetch real-time information from the internet.
Pillow – Used to load and process image data for the image analysis tool.

Code Breakdown:

This project integrates image understanding and web search tools into a reasoning agent that can process an image (e.g., a company logo), identify the company, and retrieve detailed information such as its name, headquarters, website.

Step 1: Import Necessary Libraries

The code starts by importing all the essential Python libraries required for building a multimodal AI agent.

# Core Libs

import os

from PIL import Image



# Torch with optimizations

import torch

import intel_extension_for_pytorch as ipex # For IPEX acceleration



# LangChain Core

from langchain_core.prompts import PromptTemplate

from langchain_core.runnables import RunnableSequence



# LangChain Agent Components

from langchain.agents import initialize_agent, Tool

from langchain.agents.agent_types import AgentType



# LangChain Tool Extensions

from langchain_community.tools import DuckDuckGoSearchRun

from langchain_huggingface import HuggingFacePipeline



# Transformers for Local LLM and Vision Model

from transformers import (

    pipeline,

    AutoTokenizer,

    AutoModelForCausalLM,

    AutoProcessor,

)

Step 2: Define the ImageAnalyzer Class:

This block defines an ImageAnalyzer class that uses a locally hosted vision-language model (vikhyatk/moondream2) to identify content in images—specifically logos. It preprocesses the image and prompt, runs inference on the image using the model, and returns a textual description. The class supports hardware acceleration by automatically selecting an Intel XPU if available.

class ImageAnalyzer:

    def __init__(self, model_name="vikhyatk/moondream2", device="xpu"):

        self.device = device

        self.processor = AutoProcessor.from_pretrained(model_name)

        self.model = AutoModelForCausalLM.from_pretrained( model_name,

trust_remote_code=True, torch_dtype= torch.float16, ).to(self.device)

        self.model.eval() 

        if self.device == "xpu":

            self.model = ipex.optimize( self.model,dtype=torch.float16, inplace=True)




    def analyze(self, image_path: str, prompt: str = "Identify the logo shown in this image?") -> str:

        image = Image.open(image_path).convert("RGB")

        with torch.inference_mode(): 

            answer_dict =   self.model.query(image=image, question=prompt)

        return answer_dict["answer"]

Step-3: Define the Image Analysis Tool Function

This function image_tool_fn serves as a simple wrapper around the ImageAnalyzer class. It takes an image path as input, creates an instance of ImageAnalyzer, and returns the analysis result (e.g., a company name or logo description) by calling the analyze method.

def image_tool_fn(image_path: str) -> str: 

analyzer = ImageAnalyzer() 

return analyzer.analyze(image_path)

Step 4: Define Tools for Image Analysis and Web Search

In this step, two tools are created using LangChain's Tool class:

ImageAnalyzer: Wraps the image analysis function to identify company logos in images.
duckduckgo_search: Wraps the DuckDuckGo search tool to fetch company-related information like headquarters, website, and services.

image_tool = Tool( 

name="ImageAnalyzer", 

func=image_tool_fn, 

description="Analyze an image and identify the company/logo shown.") 



search_tool = Tool( 

name="duckduckgo_search", func=DuckDuckGoSearchRun(), 

description="Search for company details like HQ, website, services etc." )

Step 5: Load Local Language Model (LLM) with Intel Hardware Acceleration

This function initializes and returns a Hugging Face language model pipeline optimized for inference on Intel hardware (e.g., XPU via IPEX):

Model Used: “Qwen/Qwen2.5-3B-Instruct”, a lightweight instruction-tuned language model ideal for reasoning tasks.
Precision: Uses “torch.float16” for performance efficiency.
Hardware Acceleration: The model is offloaded to xpu (Intel Accelerated Processing Unit).
Pipeline Parameters:

max_new_tokens: Limits response length.
temperature, top_p, do_sample: Controls randomness and diversity in generation.
repetition_penalty: Discourages repeated phrases.

The function returns a HuggingFacePipeline, which can be integrated directly with LangChain.

def load_local_llm():

    model_id = "Qwen/Qwen2.5-3B-Instruct"

    tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to("xpu")

    pipe = pipeline(

        "text-generation",

        model=model,

        tokenizer=tokenizer,

        max_new_tokens=512,

        temperature=0.2,

        top_p=0.95,

        do_sample=True,  

        repetition_penalty=1.1

    )

    return HuggingFacePipeline(pipeline=pipe)

Step 6: Define a Custom Prompt Template for the Agent

This function defines a structured prompt that guides the agent in how to interact with the tools (ImageAnalyzer and duckduckgo_search) and how to format its reasoning steps:

PromptTemplate: Constructs the interaction format dynamically using available tools.
Agent Instructions:

Starts with a question.
Thinks step-by-step (Thought:).
Chooses an appropriate tool (Action:).
Feeds the tool required input (Action Input:).
Collects the tool’s response (Observation:).
Repeats this until it derives the final answer.

Final Answer: Should include company name, headquarters, website, and what the company does.

def create_custom_prompt():

    return PromptTemplate.from_template("""

You are an intelligent agent that helps identify companies from logos and find detailed company information.



Tools available:

{tools}



Use this format:



Question: the question to solve

Thought: your reasoning step

Action: tool to use, one of [{tool_names}]

Action Input: the input for the tool

Observation: the result

... (repeat Thought/Action/Action Input/Observation)

Thought: I now know the final answer

Final Answer: the complete answer with company name, HQ, website, services



Begin!



Question: {input}

""")

Step 7: Create the Agent Using Custom Tools and Prompt

This function creates the agent by combining:

Tools:

ImageAnalyzer: Analyzes images to detect logos.
DuckDuckGoSearch: Searches the web for company information.

Local Language Model: Handles reasoning and generates responses.
Prompt: Structures the agent's workflow to identify the company and gather details.
LLM Chain: Links the prompt with the language model for decision-making.
Agent Initialization: The “initialize_agent” function creates the agent, setting parameters such as:

Agent Type: Uses CHAT_ZERO_SHOT_REACT_DESCRIPTION, which allows the agent to react to unstructured queries and decide what actions to take.
Handling Parsing Errors: Ensures that errors are appropriately handled during the reasoning process.
Max Iterations: Limits the number of steps the agent will take during an interaction.

The agent identifies logos, searches for company info, and provides a detailed response.

def create_agent():

    tools = [image_tool, search_tool]

    llm = load_local_llm()

    prompt = create_custom_prompt()

    chain = prompt | llm  



    return initialize_agent(

        tools=tools,

        llm=llm,

        agent=AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION,

        llm_chain=chain,

        handle_parsing_errors=True,

        verbose=True,

        max_iterations=5

    )

Step 8: Running the Agent for Logo Analysis and Company Details

This block performs the following:

Image Analysis: The image (logo1.png) is analyzed to detect the company name using image_tool_fn.
Query Generation: A query is created to fetch details like company name, headquarters, website, and services.
Agent Execution: The agent is created and invoked to gather company information.
Output: The final details (company name, headquarters, website, services) are displayed.

if __name__ == "__main__":

    image_path = "./logo1.png"



    print("🧠 Analyzing image...")

    result = image_tool_fn(image_path)

    print(" Detected logo/company name:", result)



    query = f"Identify the company shown in the image as '{result}' and give me its details like name, headquarters, website, and what it does."



    agent = create_agent()

    print("\n Running agent...")

    response = agent.invoke(query)



    print("\n Final Answer:\n", response['output'])

Output:

Given Input image-1:

Output:

[Output Text]: The company shown in the image is Intel Corporation, a technology company based in Santa Clara, California, USA, known for producing microprocessors, flash memory, and other semiconductor products. Their website is www.intel.com.

Given Input Image-2:

Output:

[Output Text]: The logo shown in the provided image belongs to Toro Company, a US-based manufacturer of outdoor power equipment with its headquarters in Bloomington, Minnesota. Their official website is www.toro.com.

Given Input Image-3:

Output:

[Output Text]: The company shown in the image is Apple Inc., a leading technology company based in Cupertino, California, known for producing consumer electronics, computer software, and online services under the brand name Apple. Some of its popular products include iPhones, iPads, Macs, and Apple Watches.

Complete Runnable Code:

Below is the complete Python script for the use case discussed in this article. It demonstrates how to build a multimodal AI agent capable of identifying company logos from images and providing detailed company information. The script utilizes an image analysis tool, powered by the vikhyatk/moondream2 model, to extract logos from images. Additionally, it integrates a DuckDuckGo search tool to retrieve company-specific details such as name, headquarters, website, and services. By combining these tools with a local language model (Qwen/Qwen2.5-3B-Instruct), the agent can reason and provide comprehensive answers. The system is designed to leverage Intel’s hardware acceleration via IPEX to optimize performance, especially for demanding image analysis and text processing tasks.

Prerequisites:

Ensure you have the necessary libraries installed. The required libraries are torch, intel-extension-for-pytorch, langchain_core, langchain_community, langchain_huggingface, Pillow, duckduckgo-search, pyvips_binary, pyvips and transformers. You can install them using pip:

```

pip list torch intel-extension-for-pytorch langchain_core langchain_community langchain_huggingface Pillow transformers



# if not available, install them via:

pip install torch intel-extension-for-pytorch langchain_core langchain_community langchain_huggingface Pillow transformers duckduckgo-search pyvips_binary pyvips

(Note: Ensure you install intel-extension-for-pytorch specifically for your hardware/environment if not using Intel Tiber AI Cloud).

```

You can copy the entire code block below, save it as a Python file (e.g., multimodal_search.py), and run it to see the multimodal agent in action. (Note: Make sure to upload an image)

# Core Libs

import os

from PIL import Image



# Torch with optimizations

import torch

import intel_extension_for_pytorch as ipex # For IPEX acceleration



# LangChain Core

from langchain_core.prompts import PromptTemplate

from langchain_core.runnables import RunnableSequence



# LangChain Agent Components

from langchain.agents import initialize_agent, Tool

from langchain.agents.agent_types import AgentType



# LangChain Tool Extensions

from langchain_community.tools import DuckDuckGoSearchRun

from langchain_huggingface import HuggingFacePipeline

# Transformers for Local LLM and Vision Model

from transformers import (

    pipeline,

    AutoTokenizer,

    AutoProcessor,

    AutoModelForCausalLM,

)



class ImageAnalyzer:

    def __init__(self, model_name="vikhyatk/moondream2", device="xpu"):

        self.device = device

        self.processor = AutoProcessor.from_pretrained(model_name)

        self.model = AutoModelForCausalLM.from_pretrained( model_name,

trust_remote_code=True, torch_dtype=self.dtype, ).to(self.device)

        self.model.eval() 

        if self.device == "xpu": 

            self.model = ipex.optimize( self.model,dtype=torch.float16, inplace=True )




    def analyze(self, image_path: str, prompt: str = "Identify the logo shown in this image?") -> str:

        image = Image.open(image_path).convert("RGB")

        with torch.inference_mode(): 

            answer_dict =   self.model.query(image=image, question=prompt)

        return answer_dict["answer"]



def image_tool_fn(image_path: str) -> str:

    analyzer = ImageAnalyzer()

    return analyzer.analyze(image_path)



image_tool = Tool(

    name="ImageAnalyzer",

    func=image_tool_fn,

    description="Analyze an image and identify the company/logo shown."

)



search_tool = Tool(

    name="duckduckgo_search",

    func=DuckDuckGoSearchRun(),

    description="Search for company details like HQ, website, services etc."

)



def load_local_llm():

    model_id = "Qwen/Qwen2.5-3B-Instruct"

    tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to("xpu")

    model = ipex.optimize(model, dtype=torch.float16)

    pipe = pipeline(

        "text-generation",

        model=model,

        tokenizer=tokenizer,

        max_new_tokens=512,

        temperature=0.2,

        top_p=0.95,

        do_sample=True,  

        repetition_penalty=1.1

    )

    return HuggingFacePipeline(pipeline=pipe)



def create_custom_prompt():

    return PromptTemplate.from_template("""

You are an intelligent agent that helps identify companies from logos and find detailed company information.



Tools available:

{tools}



Use this format:



Question: the question to solve

Thought: your reasoning step

Action: tool to use, one of [{tool_names}]

Action Input: the input for the tool

Observation: the result

... (repeat Thought/Action/Action Input/Observation)

Thought: I now know the final answer

Final Answer: the complete answer with company name, HQ, website, services



Begin!



Question: {input}

""")



def create_agent():

    tools = [image_tool, search_tool]

    llm = load_local_llm()

    prompt = create_custom_prompt()

    chain = prompt | llm  



    return initialize_agent(

        tools=tools,

        llm=llm,

        agent=AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION,

        llm_chain=chain,

        handle_parsing_errors=True,

        verbose=True,

        max_iterations=5

    )



if __name__ == "__main__":

    image_path = "./logo3.png"



    print("🧠 Analyzing image...")

    result = image_tool_fn(image_path)

    print(" Detected logo/company name:", result)



    query = f"Identify the company shown in the image as '{result}' and give me its details like name, headquarters, website, and what it does."



    agent = create_agent()

    print("\n Running agent...")

    response = agent.invoke(query)



    print("\n Final Answer:\n", response['output'])

Note on Model’s Output:

While the model is designed to generate useful responses based on patterns learned from large-scale datasets, it's important to keep the following in mind:

Probabilistic Nature: Responses are generated based on probability and context. This means the model may occasionally produce inaccurate or imprecise information.
Training Data Limitations: The quality and relevance of outputs depend heavily on the data used during training. Outdated or incomplete data may lead to gaps in knowledge or misleading answers.
Version Differences: Outputs may vary across different model versions due to updates in architecture, training methods, or datasets. Consistency isn't guaranteed between versions.

Future Directions:

While this implementation demonstrates a powerful approach to building multimodal agents, there are several directions for further enhancement:

Component Specialization: Each component (vision analysis, search, reasoning) could be further specialized and optimized for its specific task.
Distributed Architecture: The system could evolve toward a more distributed architecture where components communicate through well-defined interfaces.
Enhanced Output Processing: More sophisticated output processing could provide even cleaner, more structured results for end users.
Additional Modalities: Beyond images and text, the system could incorporate other modalities like audio or video analysis.

This implementation serves as a foundation that can be extended and enhanced as requirements evolve and new capabilities become available.

Conclusion:

This script demonstrates how to build an intelligent multimodal agent that can identify companies from logos and fetch real-time company information using tools like a visual language model and web search. Leveraging LangChain’s agent framework, a local Hugging Face model for reasoning, and Intel’s hardware acceleration (via IPEX), this system enables seamless integration of image understanding and internet search. This use case is particularly useful in brand monitoring, market intelligence, and intelligent document processing, where visual data and factual lookups need to be combined in real-time.

Try it Yourself on Intel® Tiber™ AI Cloud: You can run this code and explore the performance of the Intel Data Center GPU Max 1100 directly. Intel Tiber™ AI Cloud offers:

Free JupyterLab Environment: Get hands-on access to a Max 1100 GPU for training and experimentation by creating an account at cloud.intel.com and launching a GPU-accelerated notebook from the "Training" section.
Virtual Machines & Bare Metal: Access single Max 1100 GPU VMs (starting at $0.39/hr/card) or powerful multi-GPU systems connected via high speed bridges. PoC credits are available for qualifying AI startups via the Liftoff program. Find more details on the Intel Tiber™ AI Cloud Pricing Page.

Related resources

Intel® Tiber™ AI Cloud - Cloud platform for AI development and deployment

6th Gen Intel® Xeon® Scalable Processor - Latest generation of enterprise server processors

Intel® Gaudi® 2 AI accelerator - High-performance AI training processor designed for deep learning workloads