How Prediction Guard Delivers Trustworthy AI on Intel® Gaudi® 2 AI Accelerators

Adam_Wolf · ‎09-10-2024

With the increasing usage of open-source tools and software at the enterprise level, namely in regards to generative AI and large language models (LLMs), it is important to discuss the essential strategies and technologies required to implement secure, scalable, and efficient large language models (LLMs) for enterprise applications. In this Intel webinar led by Daniel Whitenack, Ph.D., founder of Prediction Guard, and Rahul Unnikrishnan Nair, Engineering Lead at Intel Liftoff for Startups, we discuss the critical aspects of deploying LLMs using open models, ensuring data privacy, and maintaining high accuracy.

See the video: How Prediction Guard Delivers Trustworthy AI on Intel® Gaudi® 2 AI Accelerators

Key Requirements for Enterprise LLM Adoption

The webinar identifies three core requirements for successful enterprise LLM adoption: the use of open models, ensuring data privacy, and maintaining high accuracy. Open models, such as Llama 3 and Mistral, provide enterprises with the ability to download model weights and access inference code, offering greater control and customization. This contrasts with closed models, which are accessed through APIs without transparency into underlying processes. Ensuring data privacy is paramount, particularly as enterprises handle sensitive information such as personally identifiable information (PII) and protected health information (PHI). Compliance with standards like HIPAA is often necessary in such cases. High accuracy is also essential, requiring robust mechanisms to validate the outputs of LLMs against ground truth data to mitigate issues like hallucinations, i.e., generating false or misleading information in the output despite being grammatically correct and coherent.

Challenges with Closed Models

Closed models, such as those provided by OpenAI and Cohere, present several challenges. Enterprises cannot see how their inputs and outputs are processed, leading to potential biases and errors. Without transparency, users may encounter moderation errors and latency fluctuations without understanding the causes. Additionally, prompt injection attacks can exploit closed models to leak sensitive data, posing significant security risks. These issues underscore the importance of using open models for enterprise applications.

Prediction Guard's Approach

Prediction Guard’s platform addresses these challenges through a combination of secure hosting, robust safeguards, and performance optimizations. Secure hosting is achieved by hosting models in private environments within Intel® Tiber™ Developer Cloud, leveraging Intel® Gaudi® 2 AI accelerators for enhanced performance and cost-efficiency. Input filters are used to block prompt injections and mask or replace PII before it reaches the LLM. Output validators ensure the factual consistency of LLM outputs by comparing them against ground truth data.

Migrating to Intel® Gaudi® 2

Prediction Guard’s migration to Intel® Gaudi® 2 processors was executed in several phases, each addressing specific technical requirements and optimizations. During the initial migration phase (July to September 2023), custom model servers were deployed on bare metal Intel® Gaudi® 2 hardware. Using Optimum Habana, Prediction Guard swapped standard Hugging Face* classes for Gaudi® 2-optimized versions. Dynamic batching was implemented to handle bursts of usage, and static shapes were managed to optimize inference efficiency.

The optimization phase (September 2023 to April 2024) involved load balancing across multiple Gaudi® 2 machines, optimizing prompt handling by bucketing similar-sized prompts and padding them for better throughput, and transitioning to the TGI Gaudi framework for streamlined model server management.

In the current scaling phase (April 2024 to present), Prediction Guard has migrated to Kubernetes-based infrastructure within Intel® Tiber™ Developer Cloud, combining CPU and Gaudi node groups. Automation for deployment, monitoring for uptime and performance, and integrating Cloudflare for DDoS protection and CDN services have been implemented.

Performance and Cost Benefits

The transition to Gaudi® 2 yielded significant improvements. Prediction Guard achieved a 2x increase in throughput for enterprise workloads and realized a 10x reduction in compute costs compared to previous GPU solutions. The latency reduction to sub-200ms time-to-first-token places Prediction Guard at the forefront of industry performance. These benefits were achieved without sacrificing performance, showcasing the cost-efficiency and scalability of Gaudi® 2.

Technical Insights and Recommendations

The speakers emphasized that a robust enterprise AI solution requires more than just access to an LLM API. Ensuring the accuracy and trustworthiness of outputs involves rigorous validation against ground truth data. Integrating sensitive data necessitates strong privacy and security measures, making data handling a critical consideration in AI system design. Prediction Guard’s phased approach to optimizing Gaudi® 2 usage provides a model for other developers. Starting with validating basic functionality, then incrementally optimizing and scaling based on performance metrics and user feedback, is key to successful deployment.

More on Technical Implementation

During the initial migration phase, managing static shapes involved configuring model servers to handle variable prompt lengths by padding them to predetermined sizes, optimizing memory and compute usage. Dynamic batching allowed the system to accumulate a window of requests and process them in bulk, improving throughput and reducing latency. In the optimization phase, load balancing across multiple Gaudi® 2 servers was implemented to manage traffic efficiently and avoid bottlenecks. Refining the handling of input prompts by categorizing them into buckets based on size and padding within each bucket further boosted performance. Transitioning to the TGI Gaudi framework streamlined model server management.

In the scaling phase, deploying an Intel Kubernetes Service (IKS) cluster that combines CPU and Gaudi node groups facilitated scalable and resilient deployment. Automation for deployment processes and implementing monitoring tools ensured high availability and performance. Configuring inference parameters and managing key-value caches optimized model serving efficiency.

Practical Implementation Tips

For developers and enterprises looking to implement similar AI solutions, starting with open models to retain control and customization capabilities is recommended. Ensuring that sensitive data is handled securely and in compliance with relevant standards is critical. Adopting a phased approach to optimization, starting with basic functionality and gradually refining performance based on metrics and feedback, is also key to successful deployment. Lastly, leveraging frameworks like Optimum Habana and TGI Gaudi can streamline integration and optimization efforts.

Conclusion

Prediction Guard’s comprehensive approach, in collaboration with Intel, showcases how enterprises can deploy secure, scalable, and efficient AI solutions. By leveraging Intel® Gaudi® 2 and Intel® Tiber™ Developer Cloud, Prediction Guard provides a robust platform for enterprise AI adoption, addressing critical concerns around control, customization, data privacy, and accuracy. The technical insights and practical recommendations shared in the webinar offer valuable guidance for developers and enterprises navigating the complexities of LLM deployment.

We also encourage you to check out Intel’s other AI Tools and framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio.

About the Speakers

Daniel Whitenack

Founder and CEO, Prediction Guard

Daniel Whitenack (aka Data Dan) is a Ph.D. trained data scientist and founder of Prediction Guard. He has more than ten years of experience developing and deploying machine learning models at scale, and he has built data teams at two startups and an international NGO with 4000+ staff. Daniel co-hosts the Practical AI podcast, has spoken at conferences around the world, and occasionally teaches data science/analytics at Purdue University.

Rahul Unnikrishnan Nair

Engineering Lead, Intel® Liftoff for Startups

As the Engineering Lead at Intel® Liftoff, Rahul brings his extensive experience in applied AI and engineering to mentor early-stage AI startups. With over a decade of experience in machine learning and applied deep learning, including significant work in generative AI, his dedication lies in helping these startups transform their innovative ideas into fully-fledged, market-ready products with a strong emphasis on use-case-driven, practical engineering and optimization.