The Intel RecSys2023 team ranked #2 in the industry track of RecSys Challenge 2023, presenting a state-of-the-art privacy-preserving recommendation system with graph-enhanced feature engineering.
The team of 9 machine learning and data engineering experts participated in the challenge for the third consecutive year, ranking #13 in 2021 and #4 in 2022. The results demonstrate the full potential of applying Intel data, AI products, and reference solutions to real-world recommendation system challenges, representing a multi-billion-dollar industry.
The RecSys Challenge has become a key event at the ACM Conference on Recommender Systems. It has attracted thousands of participants from industry and academia and has allowed researchers and practitioners to benchmark their work against each other in a friendly and open setting. This year's challenge was organized by ShareChat, IIM Visakhapatnam, Huawei, and Amazon. Based on the data provided by ShareChat, the goal was to predict the probability that an advertisement impression resulted in click-through and, subsequently, an installation.
In pursuit of Intel's "AI Everywhere" initiative, the team published the solution as an open source project on GitHub.
Solution Overview
Privacy Preserved Feature Engineering
For the privacy-preservation dataset, the team could not directly apply classical feature engineering methods because the semantics of the individual features were not provided. Inspired by Intel Auto-Feature-Engineering workflow, the team proposed a novel feature engineering pipeline specially designed for the privacy-preservation recommendation system, which dug out the underlying information from the feature distribution to enrich the features' expressiveness. The proposed pipeline included three significant steps: (1) data analysis and classification, aimed at analyzing basic properties of features and classifying them into several roles, setting up the baseline for later feature engineering; (2) massive feature engineering, adopting and improving current feature engineering methods for category features, dense features, and time features, and (3) feature selection functionality designed to select the minimum required features for final training and predictions. Full details are in the team's paper.
Figure1. Privacy-preserved feature engineering overview.
Graph Neural Network (GNN) Enhancement
GNN is a commonly used method in recommendation systems. It can be used to learn a representation of the nodes in a graph, known as a node embedding. The team proposed GNN-enhanced feature engineering inspired by the Intel GNN and Analytics Workflow using the Intel Fraud Detection Reference Kit. This solution generates two graphs – a bipartite graph and a similarity graph based on the role identification. GNN is then employed to learn the relationship both in a self-supervised and a supervised way before finally generating the embedding for each impression. These results serve as new features to enhance the dataset. As designed and implemented, this enhancement efficiently catches the underlying information from different roles and impressions, further improving final accuracy.
Figure2. Bipartite graph and similarity graph.
Scalable and Extensible End-to-End Ensemble
In the final stage of the challenge, the team developed a comprehensive ensemble method to improve the solution's performance further while maintaining scalability and efficiency of each sub-task. With this ensemble strategy, the solution can seamlessly expand when new datasets or models are introduced, all without needing to retrain the original model. Additionally, this method capitalizes on the advantages of multiple models, further enhancing overall performance. The submission was the ensemble of 3 different models, each trained on different gradient-boosted decision trees with different feature sets generated from two feature engineering methods. The ensemble solution achieved a Normalized Cross Entropy (NCE) score of 5.89 by taking the weighted sum of 3 model outputs, which resulted in the #2 ranking on the final leaderboard.
Figure3. The overall ensemble pipeline.
Riding the Reference Workflows
Intel AI Reference Kits focus on solving domain-specific problems across various industries. Each kit includes model code, training data, instructions for the machine learning pipeline, libraries, and oneAPI components for cross-architecture performance. Intel's submission for this challenge leveraged the two workflows below:
Intel Auto Feature Engineering Workflow
To facilitate iterative data processing and feature engineering, the Auto Feature Engineering Workflow was used. This workflow automatically analyzes feature attributes and generates new features for tabular datasets to improve data expressivity, training accuracy, and developer efficiency. Its feature analysis and feature engineering utilities are specifically useful in the privacy-preserving feature engineering effort.
Intel GNN and Analytics workflow
Graph Neural Networks (GNNs) are effective models for generating node/edge embeddings that can be used as rich features to improve the accuracy of various tasks. The end-to-end Intel GNN and Analytics Workflow reads tabular data, transforms it into a graph format, and then uses a GNN to learn embeddings that can be used as rich features in a downstream task. These capabilities have shown excellent performance in fraud detection applications and were leveraged in this challenge to help enhance feature engineering.
The Team
The Intel RecSys2023 Team: Xue, Chendi; Wang, Xinyao; Zhou, Yu; Zhang, Jian; Palangappa, Poovaiah M; Motwani, Ravi H; Brugarolas Brufau, Rita; Kakne, Aasavari Dhananjay; Ding, Ke.
The Team
The Intel RecSys2023 Team:
- Xue, Chendi
- Wang, Xinyao
- Zhou, Yu
- Zhang, Jian
- Palangappa, Poovaiah M
- Motwani, Ravi H
- Brugarolas Brufau, Rita
- Kakne, Aasavari Dhananjay
- Ding, Ke
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.