Text classification with scikit-learn: how to get a representation of a new document from a pickle model


Eugenio

I have a document binomial classifier that takes a tf-idf representation of a set of training documents and applies logistic regression to it:

lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])

lr_tfidf.fit(X_train, y_train)

I save the model in pickle and use it to classify new documents:

text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)

How can I get the representation (features + frequencies) that the model uses for this new document without explicitly computing it?

EDIT: I'm trying to explain better what I'm trying to get. I used predict_proba and I want new documents to be represented as vectors of term frequencies (according to the rules used in the stored model) and these frequencies are multiplied by the coefficients learned by the logistic regression model to predict the class. am i right? If yes, how can I get the term and term frequency of the new document as used by predict_proba?

I am using sklearn v 0.19

Vivek Kumar

According to my comment, you need to access the tfidfVectorizer from inside the pipeline. This can easily be done by:

tfidfVect = text_model.named_steps['vect']

Now you can use the vectorizer 's transform()methods to get the tfidf value.

tfidf_vals = tfidfVect.transform(new_document)

The tfidf_valswill be a sparse matrix containing a single column of terms found in TFIDF new_document. To check which terms exist in this matrix, you need to use tfidfVect.get_feature_names().

Related


How to get number of features from fitted scikit-learn model?

Wavlin I am trying to extract the number of features from the model after fitting the model to the data. I browsed the catalog of models and found ways to get only a specific model number (e.g. looking at the dimensionality of the SVM support vector), but I di

How to get number of features from fitted scikit-learn model?

Wavlin I am trying to extract the number of features from the model after fitting the model to the data. I browsed the catalog of models and found ways to get only a specific model number (e.g. looking at the dimensionality of the SVM support vector), but I di

How to get number of features from fitted scikit-learn model?

Wavlin I am trying to extract the number of features from the model after fitting the model to the data. I browsed the catalog of models and found ways to get only a specific model number (e.g. looking at the dimensionality of the SVM support vector), but I di

Similarity metric scikit-learn document classification

Work I'm doing some work in document classification using scikit-learn. For this, I represent my documents with a tf-idf matrix and feed this information to a Random Forest classifier, which works great. I just want to know what similarity measure is used by t