Reddit Relevant XKCD
So I embarked on a machine learning quest as a senior project. With all these advances in AI, I wanted to have some fun myself :P. You can view it here.
I was inspired by Dan Zhang and Megan Ruthven’s Relevant XKCD finder. They used ExplainXKCD to populate their dataset. It’s pretty good but could it be better?
I wondered if Reddit’s comment corpus could help me get a better model with higher accuracy. Once and a while I’d see comments where users would helpfully post “Relevant XKCD” links in response to certain comments. With all that data out there, I thought it would be a good way to try out Machine Learning techniques.
I used Python and SK-learn library to create my multi-class text-classification model.
So I essentially wanted my model to classify a series of words and output a number, which is the XKCD id.
Here are my posts (That I may or may not write):
Retrieving Reddit data from Google BigQuery
Finding the Right Model (MultinomialNB and SGDClassifier – SKlearn)
Loading the Model (and Complaining about Memory Usage)