Posted on behalf of Harvey Johnson
I’m Harvey Johnson, a second year ECE master’s student at the University of Nottingham and an Intel® Student Ambassador for oneAPI. Ever since I learnt to read and write, I have been fascinated with computers and how computer programs work, more specifically, how to make codes run fast. Over the years I have progressed from programming in Python to mostly C/C++ and some ASM mixed in where necessary. I do programming as a hobby in my spare time and work on whatever interests me, so I mostly do small projects or moderate contributions to larger projects.
What it’s like being an Intel® Student Ambassador for oneAPI
One of the reasons I joined the Intel® Student Ambassador Program for oneAPI is to get the inside track on software. The Student Ambassador program has proved to be very beneficial to me for this, as we get training and support resources on oneAPI and Intel hardware that are not available to the public along with regularly scheduled conference calls where what gets presented is very interesting and educative. Another interesting part is that I have the opportunity to go on a podcast hosted by Intel engineers where Josh and I talked about our experiences navigating the world of accelerated computing and how we landed on the solution we use for Gavin. To crown it all, I was awarded the 2022 oneAPI Student Ambassador of the year.
I know you must have been thinking, what is Gavin? This article is going to mention Gavin a lot. Gavin is a small project aimed at producing a semi coherent NLP/LLM that I (Harvey) & Josh have been working on over the past few years. I don’t touch most of the fancy AI stuff and instead prefer to do problem solving on the back end such as dataset pre-processing and distribution for training. Gavin is not meant to be state of the art nor high performing but instead a vehicle for us to learn about various aspects of programming and to have fun testing things out.
Dataset storage and pre-processing is an important aspect of AI/ML model training. A high-quality dataset needs to be curated and accessed in reasonable amounts of time. Previously Gavin used base64 encoded files to store pickled versions of the encoded data, which were cumbersome to load up and slowed down training considerably due to the large SerDes overhead & the mechanisms that were in place to handle the data.
I solved this problem by completely re-writing how we handled our data for training Gavin using some simple tricks to decrease file size and allow for near zero (0) overhead accessing of the samples for training. This was done by firstly overhauling the method of storage of the dataset, using a custom file type that stored the samples in mixed precision to use the lowest number of bits possible to store each sample.
Once the file spec was designed, a library of C++ functions to read/write & modify the file was created with bindings for Python. The next step taken was to encapsulate the file as a class that exposed itself similarly to the NumPy array() class such that the programmer could interact with the file as if it was an array stored in memory. This was done because often most operations on the data were not mathematical and were instead read / writes and the servers Gavin was trained on can do multiple 10s GB/s read and write from SSDs, which allowed for minimal overhead of this abstraction.
These modifications not only reduced the sizes of our dataset but also allowed for much faster iteration on the dataset and fine-grained control of loading it through TF generator style abstractions.
The next step is modifying the tokenization algorithm, this is still on going and various methods are being tested. The goal is to maximize the amount of data that can be represented by a single encode and minimize the total number of encodes produced. Various methods such as BPE and WordPiece have been implemented and evaluated for their performance and memory efficiency in building the vocab and in encoding/decoding the data.
How Intel Tools have Helped
Intel’s oneAPI toolkits has been invaluable in helping optimize the dataset pipeline for Gavin. Firstly I must acknowledge that doing it all custom is not the optimal solution anymore but has been done due to the purpose of Gavin being a vehicle for learning.
Intel® VTune™ Profiler has proven to be very helpful in advising where bottlenecks in code are such as showing how long certain function calls take and % of function calls each function is allowing us to track and modify the algorithms to minimize branching and thus improve performance along with find pain points in execution time and target them for optimization. This coupled with smart algorithm design and a philosophy of less is more (in terms of functions) has led to the toolset being remarkably fast compared to what used to be in place for handling the dataset. A good example of this is the file class where functions are in place to serialize the data, the initial implementation was much slower than final due to how it did precision checks. Intel® VTune™ Profiler indicated that the function was taking a considerable amount of time to execute and prompted a closer look at what was inefficient code.
Another aspect of oneAPI we decided to use for dataset management was SYCL, this was used for our implementation of the BPE algorithm where I attempted to offload the work to the GPU to speed up building the encodes. This was successful with a single 3090 performing ~11x faster than a 12900ks in building encodes, this is because the BPE algorithm lends itself to vectorization very well. Although this is no longer our preferred method due to the move to WordPiece style encoding, it is still a good example of where for a time SYCL proved to be advantageous and allowed us to rapidly iterate on datasets and more importantly ingest and package the massive amount of raw data (multiple TBs) that we use on relatively cheap consumer hardware in bearable amounts of time.
Although we currently only rely on Intel tools for a portion of the project, they have allowed us to explore options for optimizations and advise on areas of improvement for our code, providing 11x speedup to a portion of our dataset pre proc workflow & build out scalable and more importantly cross vendor code that can be deployed to all the hardware we use to test.
Learn more about the Intel® Student Ambassador Program for oneAPI
Learn about oneAPI, a simplified, unified, cross-architecture programming model: software.intel.com/oneapi
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.