Loading the Model (and Complaining about Memory Usage)

How I loaded the files in python

I used joblib to save the files

from sklearn.externals import joblib feature_list = count_vect.get_feature_names() model = "model105" joblib.dump(feature_list, model + '_vocabulary.pkl') joblib.dump(tfidf_transformer, model + '_transform.pkl') joblib.dump(clf, model + '.pkl', compress=9)

and joblib to load the model to the memory

class LemmaTokenizer(object): def __init__(self): self.wnl = WordNetLemmatizer() def __call__(self, doc): return [self.wnl.lemmatize(t) for t in word_tokenize(doc)] count_vect = CountVectorizer(tokenizer=LemmaTokenizer(),vocabulary=joblib.load('model105_vocabulary.pkl')) tfidf_transformer = joblib.load('model105_transform.pkl') clf = joblib.load('model105.pkl') clf.densify()

I used LemmaTokenizer to parse the text (it's a stemmer).

Memory Usage

Turns out that the model  (300MB) when loaded into memory was around 800MB of RAM. Which meant that Heroku couldn't host because the slug size is too big and the used RAM is way too big. Very frustrating. Which means that only way to host this would be on a server on AWS, Google Cloud or Digital Ocean.

I decided to wrap the model around a django website and walla, I'm done!