scikit-learn: how to use two different datasets for training and testing

to die

I am trying to use different datasets as training set and test set respectively. But with the following code, I get:

File "main.py", line 84, in main_test X2 = tf_transformer.transform(word_counts2) File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 1020, in transform n_features, expected_n_features)) ValueError: Input has n_features=1293 while the model has been trained with n_features=1625

def main_test(path = None):
    dir_path = path or 'dataset'
    files = sklearn.datasets.load_files(dir_path)
    util.refine_all_emails(files.data)
    word_counts = util.bagOfWords(files.data)
    tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
    tf_transformer.fit(word_counts)
    X = tf_transformer.transform(word_counts)

    dir_path = 'testset'
    files2 = sklearn.datasets.load_files(dir_path)
    util.refine_all_emails(files2.data)
    word_counts2 = util.bagOfWords(files2.data)
    # tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
    # tf_transformer.fit(word_counts2)
    X2 = tf_transformer.transform(word_counts2)

    clf = sklearn.svm.LinearSVC()

    test_classifier(X, files.target, clf, X2, files2.target, test_size=0.2, y_names=files.target_names, confusion=False)


def test_classifier(X, y, clf, X2, y2, test_size=0.4, y_names=None, confusion=False):
    X_train, X_test, y_train, y_test = X, X2, y, y2
    clf.fit(X_train, y_train)
    # clf.fit(X_test, y_test)
    y_predicted = clf.predict(X_test)

    print colored('Classification report:', 'magenta', attrs=['bold'])
    print sklearn.metrics.classification_report(y_test, y_predicted, target_names=y_names)

Ibraim Ganiev

because when you call

word_counts2 = util.bagOfWords(files2.data)

It produces results for words that the tfidf transformer has never seen in the training set, and those words do not have opposite frequencies.

You just need to generate counts for the words that appear in the training set, maybe CountVectorizer can help you.

How to do GridSearchCV with different datasets for training and testing?

User 6903745 I want to find the best parameters for a RandomForest classifier (using scikit-learn) to generalize well to other datasets (probably not iid). I was thinking of using the entire training dataset for grid search while evaluating the scoring functio

Different datasets for training and testing machine learning models

Kashgar Gandhi I'm currently working on the BNP Paribas Cardiff Claim Management dataset from kaggle, I've coded on python (jupyter notebook) for the train dataset, I've used 20% of it for testing. This study requires me to test my model on a completely differ

How to combine training and testing datasets in the same format

User 1896653 I am using this dataset for practice : http://archive.ics.uci.edu/ml/datasets/Census+Income I loaded training and test data. # Downloading train and test data trainFile = "adult.data"; testFile = "adult.test" if (!file.exists (trainFile)) downloa

How to combine training and testing datasets in the same format

Use different types of columns as training datasets

Vikpo I had only one column (string type data) as my training set before, I want to consider another corresponding column (amount column of float type) together with the detail column as the training set. In the Amount column, negative values indicate debits a

How to pass different datasets for training and testing without splitting the dataframe. (python)?

Lively I've gone through multiple questions that help split your dataframe into train and test using scikit, nothing else. But my problem is that I have 2 different csvs (2 different dataframes from different years). I want to use one of them as a train and th

How to pass different datasets for training and testing without splitting the dataframe. (python)?

How to use 2 datasets, 1 for training and 1 for testing on WEKA for sentiment analysis

Wannabepro So I have 3 datasets for sentiment analysis and I only want to use 1 dataset to build the model and the rest for testing purposes only. The model I will be using is SVM (SMO algorithm). The original dataset had only 2 attributes (text, labels), but

How to use 2 datasets, 1 for training and 1 for testing on WEKA for sentiment analysis

How can I use scikit Learn to ensure that the test and training sets have the same features?

Ren Lake I'm trying to predict unseen data for a re-appearing ticket? I have many categorical variables. These variables can be the same or repeated. How can I make sure the functionality is the same? Functionality may vary depending on incoming volume? y=pred

How can I use scikit Learn to ensure that the test and training sets have the same features?

How to train/scale very large datasets with scikit-learn?

Tedhorsh I am using the sentiment140 dataset consisting of 1.6 million tweets to train and analyze the accuracy of different classifiers in the python scikit-learn library. I am using the following code snippet to vectorize tweets into feature vectors before f

How to train/scale very large datasets with scikit-learn?

How to normalize training and testing datasets via make_pipeline()

Filippo Sebastio I am learning how to use make_pipeline to run a K-means model to normalize the values of my dataset columns. I'm taking a DataCamp course, but it's not clear why they fit and predict the models on the same dataset - in the Datacamp case "Sport

How to normalize training and testing datasets via make_pipeline()

Filippo Sebastio I am learning how to use make_pipeline to run a K-means model to normalize the values of my dataset columns. I'm taking a DataCamp course, but it's not clear why they fit and predict the model on the same dataset - in the Datacamp case "Sports

How to apply two different functions on two datasets

Selena Chavez I have 2 datasets: data = StringIO(""" date value 24-Jan-16 0.786 25-Feb-16 0.781 29-Apr-16 0.786 15-May-16 0.761 16-Jun-16 0.762 04-Sep-16 0.783 22-Oct-16 0.797 23-Nov-16 0.792 09-Dec-16 0.783 25-Dec-16 0.788 26-Jan-17 0

How to apply two different functions on two datasets

How to stratify training and test data in Scikit-Learn?

Salas I am trying to implement a classification algorithm for the Iris dataset (downloaded from Kaggle). In the Species column, the categories (iris, iris, versicolor, iris) are in sorted order. How to stratify training and test data using Scikit-Learn? Metal

scikit-learn: how to use two different datasets for training and testing

Related

How to do GridSearchCV with different datasets for training and testing?

Different datasets for training and testing machine learning models

How to combine training and testing datasets in the same format

How to combine training and testing datasets in the same format

How to combine training and testing datasets in the same format

How to combine training and testing datasets in the same format

How to combine training and testing datasets in the same format

How to combine training and testing datasets in the same format

How to combine training and testing datasets in the same format

Use different types of columns as training datasets

How to pass different datasets for training and testing without splitting the dataframe. (python)?

How to pass different datasets for training and testing without splitting the dataframe. (python)?

How to use 2 datasets, 1 for training and 1 for testing on WEKA for sentiment analysis

How to use 2 datasets, 1 for training and 1 for testing on WEKA for sentiment analysis

How can I use scikit Learn to ensure that the test and training sets have the same features?

How can I use scikit Learn to ensure that the test and training sets have the same features?

How can I use scikit Learn to ensure that the test and training sets have the same features?

How can I use scikit Learn to ensure that the test and training sets have the same features?

How can I use scikit Learn to ensure that the test and training sets have the same features?

How to train/scale very large datasets with scikit-learn?

How to train/scale very large datasets with scikit-learn?

How to normalize training and testing datasets via make_pipeline()

How to normalize training and testing datasets via make_pipeline()

How to apply two different functions on two datasets

How to apply two different functions on two datasets

How to stratify training and test data in Scikit-Learn?

How to stratify training and test data in Scikit-Learn?

How to stratify training and test data in Scikit-Learn?

How to stratify training and test data in Scikit-Learn?

Ranking