Machine Learning

Final Thoughts About RelevantXKCD project

Kevin Fang

20 May 2017 — 2 min read

screen-shot-2017-05-20-at-11-22-49-am

After I wrapped a simple Django server and posted it on Reddit in the subreddit /r/xkcd, I got some great feedback!

For the Model

As /u/drcopus helpfully enlightened me, I should've done data augmentation on the original training set. I could've taken these comments, and substituted synonyms and even split each sentence in a comment into different sets.

He also mentioned that I should've finetuned a Neural Network rather than create one from scratch. I thought a <1% was fishy, so now I know that it just needs to run more. (Which means that training would take a very long time)

Essentially, attempting to create a model just takes a lot of time, and it is going to take a lot of processing time before progress can be accurately gauged.

Flaws

I think the model works pretty well, although I've noticed that the top result isn't usually the "best" simply because how people tended to search for one word queries...

Also not all comics were equally represented. Luckily most of the comics (1773 of them in the training data) had at least 2 training examples. But having 20 or more was less than 50%. This biased the system since more training examples = more vocabulary = higher probability to see comic.

2 or more times: 1716 (96.78%) 5 or more times: 1467 (82.74%) 20 or more times: 793 (44.72%) 50 or more times: 392 (22.10%) 200 or more times: 116 (6.54%) 1000 or more times: 12 (0.67%)

Here is a list of all comic IDs that had NO data (therefore would never been seen) -- it isn't part of the comic percentages above --

Some of these are really good, however, obviously these aren't the popular ones.

187 213 223 347 372 437 474 510 536 618 711 744 812 823 825 930 991 999 1006 1359 1466 1522 1556 1574 1596 1631 1648 1651 1699 1713 1733 1746 1754 1762 1778 1780 1783 1784 1798 1800 1802 1805 1811 anything higher than 1811

Slightly Unrelated Stuff...

So over that week, I compiled the top queries:

Everything else was quite narrow, so I decided against showing the queries. The queries are expected because the top 3 reddit comments said:

I had a good laugh at the 'dragon' result

Try "desolate"

It also shows that Bobby Tables is a classic (sql).

<side> It's funny that a week later that XKCD's newest comic had to do with Machine Learning. Pretty much what I did lol. </side>

Cheering the NYC Marathon

NYC marathon day is exciting for all true New Yorkers, whether you are racing or not. It’s crazy to imagine (and then see only a portion of) 50k people run across all 5 boroughs. I watched the birds-eye view of the race track and it was already so long.

Why I don’t plan on running the NYC marathon

When meeting new people, my conversations inevitably shift to the NYC marathon. Running is a common hobby for those looking to be or staying in shape, those looking for new friends, or looking for a quest outside of work. It’s a broadly relatable activity and can spark many follow-up

What to do in Austin, Texas?

I love going back home to Austin. It's a fantastic city with many things to do. Although I spend most of my time with family, I occasionally have the opportunity to go to my favorite nostalgic places. If you told me you were visiting Austin right now, here

From photos to video diaries: how I remember the past

One challenging question I've always had was how to remember the past. I know that my memory isn't the best, and when my friends and I reminisce about the good ol' days, our accounts begin to differ. Who knows what another few years will make