Text classification with scikit-learn: how to get a representation of a new document from a pickle model

Eugenio

I have a document binomial classifier that takes a tf-idf representation of a set of training documents and applies logistic regression to it:

lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])

lr_tfidf.fit(X_train, y_train)

I save the model in pickle and use it to classify new documents:

text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)

How can I get the representation (features + frequencies) that the model uses for this new document without explicitly computing it?

EDIT: I'm trying to explain better what I'm trying to get. I used predict_proba and I want new documents to be represented as vectors of term frequencies (according to the rules used in the stored model) and these frequencies are multiplied by the coefficients learned by the logistic regression model to predict the class. am i right? If yes, how can I get the term and term frequency of the new document as used by predict_proba?

I am using sklearn v 0.19

Vivek Kumar

According to my comment, you need to access the tfidfVectorizer from inside the pipeline. This can easily be done by:

tfidfVect = text_model.named_steps['vect']

Now you can use the vectorizer 's transform()methods to get the tfidf value.

tfidf_vals = tfidfVect.transform(new_document)

The tfidf_valswill be a sparse matrix containing a single column of terms found in TFIDF new_document. To check which terms exist in this matrix, you need to use tfidfVect.get_feature_names().

Text classification with scikit-learn: how to get a representation of a new document from a pickle model

Eugenio I have a document binomial classifier that takes the tf-idf representation of a set of training documents and applies logistic regression to it: lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))]) lr_tfidf.fit(X_train, y

Text classification with scikit-learn: how to get a representation of a new document from a pickle model

Eugenio I have a document binomial classifier that takes a tf-idf representation of a set of training documents and applies logistic regression to it: lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))]) lr_tfidf.fit(X_train, y_t

Text classification with scikit-learn: how to get a representation of a new document from a pickle model

Machine Learning/NLP Text Classification: Train a Model from a Corpus of Text Files - scikit-learn

j I'm very new to machine learning and I was wondering if someone could walk me through this code and why it doesn't work. This is a variation of my own scikit-learn tutorial, which can be found here: http://scikit-learn.org/stable/tutorial/text_analytics/work

How can I preprocess new instances for classification so that the feature encoding is the same as the model using Scikit-learn?

dark grey I am creating a model using multi-class classification of data, the model has 6 features. I am using LabelEncoder to preprocess data with the following code. #Encodes the data for each column. def pre_process_data(self): self.encode_column('feedb

How can I preprocess new instances for classification so that the feature encoding is the same as the model using Scikit-learn?

How to get a list of attributes from a fitted model in Scikit learn?

bohemia Is there any way to get a list of features (attributes) from a used model (or a whole table of used training data) in Scikit-learn? I am using some preprocessing like feature selection and I want to know the selected features and removed features. For

How to get a list of attributes from a fitted model in Scikit learn?

How to get number of features from fitted scikit-learn model?

Wavlin I am trying to extract the number of features from the model after fitting the model to the data. I browsed the catalog of models and found ways to get only a specific model number (e.g. looking at the dimensionality of the SVM support vector), but I di

How to get a list of attributes from a fitted model in Scikit learn?

How to get number of features from fitted scikit-learn model?

Rich visual representation of estimators in Scikit Learn - get text instead of graphs

shock I'm trying to use the new pipeline visualization feature in scikit-learn. I'm getting the output as text, not the pipeline visualization in the jupyter book or google collab. I expected the figure to appear in the Scikit-Learn documentation. please sugge

Rich visual representation of estimators in Scikit Learn - get text instead of graphs

Scikit Learn - How to use SVM and random forest for text classification?

Crista23 I have a set, trainFeaturesa set testFeatureswith positive, neutral and negative labels: trainFeats = negFeats + posFeats + neutralFeats testFeats = negFeats + posFeats + neutralFeats For example, an entry inside trainFeatsis (['blue', 'yellow', 'gr

Scikit learn multi-label classification, get labels from MultiLabelBinarizer

but In a multi-label classification problem, I use the MultiLabelBinarizer to convert my 20 text labels into a binary list of zeros and ones. After prediction, I get a list of 20 binary values and I want to output the corresponding text labels. I'm just wonder

Scikit learn multi-label classification, get labels from MultiLabelBinarizer

Similarity metric scikit-learn document classification

Work I'm doing some work in document classification using scikit-learn. For this, I represent my documents with a tf-idf matrix and feed this information to a Random Forest classifier, which works great. I just want to know what similarity measure is used by t

Text classification with scikit-learn: how to get a representation of a new document from a pickle model

Related

Text classification with scikit-learn: how to get a representation of a new document from a pickle model

Text classification with scikit-learn: how to get a representation of a new document from a pickle model

Text classification with scikit-learn: how to get a representation of a new document from a pickle model

Text classification with scikit-learn: how to get a representation of a new document from a pickle model

Text classification with scikit-learn: how to get a representation of a new document from a pickle model

Machine Learning/NLP Text Classification: Train a Model from a Corpus of Text Files - scikit-learn

How can I preprocess new instances for classification so that the feature encoding is the same as the model using Scikit-learn?

How can I preprocess new instances for classification so that the feature encoding is the same as the model using Scikit-learn?

How can I preprocess new instances for classification so that the feature encoding is the same as the model using Scikit-learn?

How can I preprocess new instances for classification so that the feature encoding is the same as the model using Scikit-learn?

How can I preprocess new instances for classification so that the feature encoding is the same as the model using Scikit-learn?

How to get a list of attributes from a fitted model in Scikit learn?

How to get a list of attributes from a fitted model in Scikit learn?

How to get number of features from fitted scikit-learn model?

How to get a list of attributes from a fitted model in Scikit learn?

How to get number of features from fitted scikit-learn model?

How to get number of features from fitted scikit-learn model?

Rich visual representation of estimators in Scikit Learn - get text instead of graphs

Rich visual representation of estimators in Scikit Learn - get text instead of graphs

Rich visual representation of estimators in Scikit Learn - get text instead of graphs

Rich visual representation of estimators in Scikit Learn - get text instead of graphs

Scikit Learn - How to use SVM and random forest for text classification?

Scikit learn multi-label classification, get labels from MultiLabelBinarizer

Scikit learn multi-label classification, get labels from MultiLabelBinarizer

Scikit learn multi-label classification, get labels from MultiLabelBinarizer

Scikit learn multi-label classification, get labels from MultiLabelBinarizer

Scikit learn multi-label classification, get labels from MultiLabelBinarizer

Scikit learn multi-label classification, get labels from MultiLabelBinarizer

Similarity metric scikit-learn document classification

Ranking