- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Bottom line for me though is the hardware is now playing nice. Time to build the software stack on top. The select and tune models as well as add in RAG. So a lot of work ahead. But early results? Very promising.
From a businessman and end user perspective there really needs to be more easily accessible documentation. The reality is without using ChatGPT, Grok, and Gemma I might not have figured this out, although lspci checking was my idea. Getting all the docker switches right was all AI. Shouldn't be that hard.
Example, the GitHub readme lists LLMs that are validated, yet no parameters to get them to run. That would be really helpful to have. Akin to my business making a chemical formulation to production, we keep lots of notes on how we did it. So the information should be there, just make it public. AKA these are the python parameters we passed to load the model.
Lastly, is there an AI or pro card specific forum I am missing? Seems to me that would be better for everyone rather than being lumped in with gaming questions.
Thank you.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi 24MYP,
Thank you for contacting Intel Technical Support regarding your Intel LLM Scaler setup with multiple B60 cards. I can see you've made significant progress getting your hardware working and are now ready to build your software stack.
To better assist you with your LLM Scaler configuration and provide the most relevant guidance, I'd like to understand your specific requirements:
- What is your primary goal with this LLM setup? Are you focusing on inference, training, or both?
- What type of workloads are you planning to run? For example, are you working with specific model sizes, batch processing requirements, or real-time applications?
- What are your performance requirements? Are you optimizing for throughput, latency, or a balance of both?
- What is your intended use case? Is this for research, commercial applications, or development purposes?
- How many B60 cards are you running in your current setup?
Understanding your specific goals and requirements will help me connect you with the right resources and provide more targeted answers to your questions about Ubuntu Server vs Desktop performance, kernel validation, and documentation parameters.
Your feedback about documentation accessibility is valuable, and I want to make sure I address your needs effectively.
Best regards,
Dean R.
Intel Customer Support Technician
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am running four Maxsun dual GPU B60 cards on a dual Xeon Gold 6430 system with 256gb of system ram on this machine.
Goals: use this machine to tune and update 4-5 smaller models and run initial inference. Later I will roll out a cluster with 1-2 of the dual cards each to host models.
Use case is for my business. We are a small coatings and adhesives company that also makes OTC products. I will have models for several use cases.
1) Regulatory compliance
2) Lab R&D
3) Process improvement
4) General AI tuned to work in our company's realm.
5) I will likely also use a model to help me build an FDA compliant database for production batch records.
Main goal is to keep our IP off of public AI.
So basically with 8 total GPUs this system is the big boy of the group that will do tuning, set up RAG, test out models for fit in the intended end uses. Later I expect I'll need at least three servers to comfortably handle the load but I'm going to build them one at a time, likely on W790 boards.
The servers can easily run on Ubuntu server, this machine needs to be both, but it's a workstation first. I expect once I have models tuned for each domain I may be hosting with it during the week and tuning on weekends.
I need to be really clear about my abilities. I am a PC and Linux hobbiest mainly but I did build our first company network and domain in 2000. Today it's a five node PVE cluster with Ceph running five networks (I isolated Ceph and Ceph monitoring). Basically I'm self taught but effectively designing and operating my company's IT infrastructure and have done so for 27 years. I say that to say this: I kinda need "dumbed down" answers.
As for models right now I'm tinkering with Mixtral and Llama just to get started. I can't get gpt-oss to run, it's missing model type in its config file.
In a nutshell we are a small business with a smaller IT budget than a large corporation. I see the Arc Battlematrix system as a perfect and cost effective solution for companies like mine that can't afford $30k AI accelerators every few years. I like the scaler part of the Intel solution because I can infer that I can upgrade to the next series at a pace that matches my budget. My PVE cluster runs perfectly on older dual Xeon platforms, those are the servers I would begin to replace one by one with gen5 PCIe systems as I continue my rollout. But I'm at step one: picking models and gathering tuning data.
Hope that helps. I kind of think I'm your target audience?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi 24MYP,
Thank you for the detailed information about your impressive 4x Maxsun dual GPU B60 setup for your coatings and adhesives business. Your use case for keeping AI in-house while maintaining cost-effectiveness is exactly what these solutions are designed for.
Before I provide specific guidance, I need to clarify a few technical details:
- What operating system are you currently running on the dual Xeon system?
- Have you successfully run any models yet on your current setup, and if so, what performance are you seeing?
Best regards,
Dean R.
Intel Customer Support Technician
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am running Kubuntu 25.04.
I am seeing excellent results. Here is a link to an X post I made last night after running a concurrent user test. https://x.com/CCdscustom33902/status/1988114266025849332?t=f9hpme7P5s-zSYcw5bbG7Q&s=19
I got those results on Mixtral 8x7b Instuct while also having Llama 3.1 70b loaded into vram. Xpu-smi showed about it 95w pee GPU under that load.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi 24MYP,
Thank you for providing the additional technical details about your setup. I can see you're running Kubuntu 25.04 and achieving excellent results with your 4x Maxsun dual GPU B60 configuration. Your concurrent user test results with Mixtral 8x7b Instruct while having Llama 3.1 70b loaded into VRAM at approximately 95W per GPU are quite impressive.
Your business use case for keeping AI in-house while maintaining cost-effectiveness with the Intel Arc B60 cards is exactly the type of implementation these solutions are designed to support. I understand your questions about Ubuntu Server vs Desktop performance implications and kernel validation for your specific configuration.
I need to check this internally to provide you with accurate information about the performance differences between Ubuntu Server and Desktop distributions for your LLM Scaler setup, as well as details about validated kernel versions and documentation resources.
I will get back to you once I have the information available regarding your specific technical questions about Ubuntu performance optimization and kernel compatibility.
Thank you for your patience while I research this matter.
Best regards,
Dean R.
Intel Customer Support Technician
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @24MYP,
Let me address your questions:
- While we don't expect any limitations on AMD host platforms, the software enablement timeline that we shared for 2025 are for validated Linux containers on Intel Xeon-W series (2000 and 3000) and Intel Xeon-SP host platforms. So, you are welcome to use any other host of your choice, but we won't be able to help outside of what we have validated.
- From https://github.com/intel/llm-scaler/tree/main/vllm#11-install-bare-metal-environment,:~:text=platform_docker_file-,1.1%20Install%20Bare%20Metal%20Environment,-First%2C%20install%20a, both Ubuntu 25.04 Desktop (on Xeon-W) and Ubuntu 25.04 Server (on Xeon-SP) are validated.
- Kernels outside of the default one used by the validated Ubuntu distros are outside of our current validation effort. (see https://github.com/intel/llm-scaler/blob/main/vllm/FAQ.md#can-i-update-the-kernel-version-or-other-drivers-of-ubuntu-to-get-the-latest-fixes)
- For specific instructions on how to use each model, you can refer to the model's site. For example: https://huggingface.co/mistralai/Ministral-8B-Instruct-2410#:~:text=28.4-,Usage%20Examples,-vLLM%20(recommended).
- There is currently no specific Intel community forum for Arc Pro, but we will take it into consideration to create one. In the meanwhile, the best place to help on Project Battlematrix is by posting here: https://github.com/intel/llm-scaler/issues
Also, these are a couple of links that you might find useful:
https://github.com/intel/llm-scaler/blob/main/vllm/FAQ.md
https://github.com/intel/llm-scaler/blob/main/vllm/KNOWN_ISSUES.md
Regards,
Esteban R
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not using AMD CPUs. I am using Intel Gold 6430 CPUs, two of them.
I did wonder about cross NUMA performance. My motherboard layout places two cards/four GPUs per CPU.
I am seeing about 15-16 t/s on Llama 3.1 Instruct. Good but lower than expected. Could this be a cross NUMA issue?
I did find a flaw in docker "latest" and had to revert back to the older one on GitHub. It gave me a speed up to 18 t/s but it is hallucinations not actual answers. I posted this flaw on the GitHub issues page. After reverting back I am back to a smooth but somewhat slow 15 t/s. Feed free to check my issue on the official GitHub for more.
Also Ubuntu 25.10 is out but I am holding off the upgrade until someone can confirm it will work and not cause issues.
Waiting also for updated drivers to address performance as well as fan curves. Xpu-smi seems to be in its infancy here.
That said the system does work well and predictably.
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Token throughput can be affected by multiple factors such as CPU performance, the number of GPUs being used, etc. My advice is to benchmark your system following the instructions here: https://github.com/intel/llm-scaler/tree/main/vllm#15-benchmarking-the-service:~:text=the%20API%20key.-,1.5%20Benchmarking%20the%20Service,-vllm%20bench%20serve and post it https://github.com/intel/llm-scaler/issues where the developers of the Project Battlematrix can take a look at it and advise you directly.
Regards,
Esteban R
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page