scikit-learn: how to use two different datasets for training and testing


to die

I am trying to use different datasets as training set and test set respectively. But with the following code, I get:

File "main.py", line 84, in main_test X2 = tf_transformer.transform(word_counts2) File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 1020, in transform n_features, expected_n_features)) ValueError: Input has n_features=1293 while the model has been trained with n_features=1625

def main_test(path = None):
    dir_path = path or 'dataset'
    files = sklearn.datasets.load_files(dir_path)
    util.refine_all_emails(files.data)
    word_counts = util.bagOfWords(files.data)
    tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
    tf_transformer.fit(word_counts)
    X = tf_transformer.transform(word_counts)

    dir_path = 'testset'
    files2 = sklearn.datasets.load_files(dir_path)
    util.refine_all_emails(files2.data)
    word_counts2 = util.bagOfWords(files2.data)
    # tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
    # tf_transformer.fit(word_counts2)
    X2 = tf_transformer.transform(word_counts2)

    clf = sklearn.svm.LinearSVC()

    test_classifier(X, files.target, clf, X2, files2.target, test_size=0.2, y_names=files.target_names, confusion=False)


def test_classifier(X, y, clf, X2, y2, test_size=0.4, y_names=None, confusion=False):
    X_train, X_test, y_train, y_test = X, X2, y, y2
    clf.fit(X_train, y_train)
    # clf.fit(X_test, y_test)
    y_predicted = clf.predict(X_test)

    print colored('Classification report:', 'magenta', attrs=['bold'])
    print sklearn.metrics.classification_report(y_test, y_predicted, target_names=y_names)
Ibraim Ganiev

because when you call

word_counts2 = util.bagOfWords(files2.data)

It produces results for words that the tfidf transformer has never seen in the training set, and those words do not have opposite frequencies.

You just need to generate counts for the words that appear in the training set, maybe CountVectorizer can help you.

Related


How to combine training and testing datasets in the same format

User 1896653 I am using this dataset for practice : http://archive.ics.uci.edu/ml/datasets/Census+Income I loaded training and test data. # Downloading train and test data trainFile = "adult.data"; testFile = "adult.test" if (!file.exists (trainFile)) downloa

How to combine training and testing datasets in the same format

User 1896653 I am using this dataset for practice : http://archive.ics.uci.edu/ml/datasets/Census+Income I loaded training and test data. # Downloading train and test data trainFile = "adult.data"; testFile = "adult.test" if (!file.exists (trainFile)) downloa

How to combine training and testing datasets in the same format

User 1896653 I am using this dataset for practice : http://archive.ics.uci.edu/ml/datasets/Census+Income I loaded training and test data. # Downloading train and test data trainFile = "adult.data"; testFile = "adult.test" if (!file.exists (trainFile)) downloa

How to combine training and testing datasets in the same format

User 1896653 I am using this dataset for practice : http://archive.ics.uci.edu/ml/datasets/Census+Income I loaded training and test data. # Downloading train and test data trainFile = "adult.data"; testFile = "adult.test" if (!file.exists (trainFile)) downloa

How to combine training and testing datasets in the same format

User 1896653 I am using this dataset for practice : http://archive.ics.uci.edu/ml/datasets/Census+Income I loaded training and test data. # Downloading train and test data trainFile = "adult.data"; testFile = "adult.test" if (!file.exists (trainFile)) downloa

How to combine training and testing datasets in the same format

User 1896653 I am using this dataset for practice : http://archive.ics.uci.edu/ml/datasets/Census+Income I loaded training and test data. # Downloading train and test data trainFile = "adult.data"; testFile = "adult.test" if (!file.exists (trainFile)) downloa

How to combine training and testing datasets in the same format

User 1896653 I am using this dataset for practice : http://archive.ics.uci.edu/ml/datasets/Census+Income I loaded training and test data. # Downloading train and test data trainFile = "adult.data"; testFile = "adult.test" if (!file.exists (trainFile)) downloa

Use different types of columns as training datasets

Vikpo I had only one column (string type data) as my training set before, I want to consider another corresponding column (amount column of float type) together with the detail column as the training set. In the Amount column, negative values indicate debits a

How to train/scale very large datasets with scikit-learn?

Tedhorsh I am using the sentiment140 dataset consisting of 1.6 million tweets to train and analyze the accuracy of different classifiers in the python scikit-learn library. I am using the following code snippet to vectorize tweets into feature vectors before f

How to train/scale very large datasets with scikit-learn?

Tedhorsh I am using the sentiment140 dataset consisting of 1.6 million tweets to train and analyze the accuracy of different classifiers in the python scikit-learn library. I am using the following code snippet to vectorize tweets into feature vectors before f

How to apply two different functions on two datasets

Selena Chavez I have 2 datasets: data = StringIO(""" date value 24-Jan-16 0.786 25-Feb-16 0.781 29-Apr-16 0.786 15-May-16 0.761 16-Jun-16 0.762 04-Sep-16 0.783 22-Oct-16 0.797 23-Nov-16 0.792 09-Dec-16 0.783 25-Dec-16 0.788 26-Jan-17 0

How to apply two different functions on two datasets

Selena Chavez I have 2 datasets: data = StringIO(""" date value 24-Jan-16 0.786 25-Feb-16 0.781 29-Apr-16 0.786 15-May-16 0.761 16-Jun-16 0.762 04-Sep-16 0.783 22-Oct-16 0.797 23-Nov-16 0.792 09-Dec-16 0.783 25-Dec-16 0.788 26-Jan-17 0

How to stratify training and test data in Scikit-Learn?

Salas I am trying to implement a classification algorithm for the Iris dataset (downloaded from Kaggle). In the Species column, the categories (iris, iris, versicolor, iris) are in sorted order. How to stratify training and test data using Scikit-Learn? Metal

How to stratify training and test data in Scikit-Learn?

Salas I am trying to implement a classification algorithm for the Iris dataset (downloaded from Kaggle). In the Species column, the categories (iris, iris, versicolor, iris) are in sorted order. How to stratify training and test data using Scikit-Learn? Metal

How to stratify training and test data in Scikit-Learn?

Salas I am trying to implement a classification algorithm for the Iris dataset (downloaded from Kaggle). In the Species column, the categories (iris, iris, versicolor, iris) are in sorted order. How to stratify training and test data using Scikit-Learn? Metal

How to stratify training and test data in Scikit-Learn?

Salas I am trying to implement a classification algorithm for the Iris dataset (downloaded from Kaggle). In the Species column, the categories (iris, iris, versicolor, iris) are in sorted order. How to stratify training and test data using Scikit-Learn? Metal