Intel® Distribution for Python*
Engage in discussions with community peers related to Python* applications and core computational packages.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
447 Discussions

Stackoverflow implicit feedback recommendation system

IVerg1
Beginner
4,757 Views

I am trying to build a user - item recommendation engine based on the Stackoverflow favourite vote questions.

The objective:

To build a webpage / IDE plugin where the user receives his top N recommended questions based on:

- his previous favourite votes on Stackoverflow

- the programming language he is currently using (this will be a filter using the question tag, ex. only # java questions)

The input data:

I am using the Stackexchange data dump which can be found here: https://archive.org/download/stackexchange stackexchange directory listing; from there I've extracted the data that I thought would be useful:

Votes table (each User - Question pair represents a favourite vote for the question from the user):

UserId - QuestionId

Tags table:

QuestionId - TagId

I also have a lot details about each user/question which would make sense in a content-based approach. The only content I used so far are the question tags.

Problems/Properties of the data:

- the data consists of implicit feedback -> a user either marked a question as favourite or he didn't (binary problem 0/1)

- the data set is quite large, training and evaluating the a model takes a lot of time (votes CSV file has a few GB)

Progress so far:

So far I've tried a few different approaches, most of them are some sort of collaborative filtering:

- the first thing I tried was using cosine similarity to get top N question - question recommendations, just to test if the results are better than random

- then I've tried using Spark's Alternating Least Squares Matrix Factorisation model but the results were also mediocre, because I am using implicit feedback data and the ALS technique is built for Explicit Data

- I've also tried using another MF model with Bayesian Personalised Ranking loss function, which is better suited for implicit data. The library I used here is LightFM and the metric for evaluation is ROC AUC https://www.kaggle.com/iancuv/lightfm-demo?scriptVersionId=3670161 https://www.kaggle.com/iancuv/lightfm-demo?scriptVersionId=3670161

Open questions / suggestions:

Do you have any suggestions of some other approaches I should use?

How would you approach this problem?

What preprocessing of the data makes sense to achieve better results?

Is any of the mentioned techniques a good choice for this problem?

Would a only content-based approach make sense?

If yes, how can I improve the results?

I should also mention ( you probably figured it out ) that I'm a CS student, new to the AI/machine learning field. The only applications I've done in the past are related to either simple regression or classification, nothing as complicated as implicit feedback recommendation systems. I know the problem/questions I've mentioned above are very specific but any help is very much appreciated.

Useful links:

http://lyst.github.io/lightfm/docs/home.html Welcome to LightFM's documentation! — LightFM 1.14 documentation

https://spark.apache.org/docs/latest/api/python/ Welcome to Spark Python API Docs! — PySpark master documentation

https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/ Alternating Least Squares – Data Science Made Simpler

https://arxiv.org/pdf/1205.2618.pdf https://arxiv.org/pdf/1205.2618.pdf - Bayesian Personalised Ranking MF http://stanford.edu/~rezab/classes/cme323/S15/notes/lec14.pdf

 

0 Kudos
2 Replies
idata
Employee
3,423 Views
Hi,

 

We are closing this discussion since we do not handle these types of question in our community.

 

If you have a question about Intel specific AI frameworks/tools, we would be happy to address your queries.

 

Thanks & Regards,

 

Sandhiya
0 Kudos
Reply