scikit-learn: how to use two different datasets for training and testing
to die
I am trying to use different datasets as training set and test set respectively. But with the following code, I get:
File "main.py", line 84, in main_test X2 = tf_transformer.transform(word_counts2) File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 1020, in transform n_features, expected_n_features)) ValueError: Input has n_features=1293 while the model has been trained with n_features=1625
def main_test(path = None):
dir_path = path or 'dataset'
files = sklearn.datasets.load_files(dir_path)
util.refine_all_emails(files.data)
word_counts = util.bagOfWords(files.data)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
tf_transformer.fit(word_counts)
X = tf_transformer.transform(word_counts)
dir_path = 'testset'
files2 = sklearn.datasets.load_files(dir_path)
util.refine_all_emails(files2.data)
word_counts2 = util.bagOfWords(files2.data)
# tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
# tf_transformer.fit(word_counts2)
X2 = tf_transformer.transform(word_counts2)
clf = sklearn.svm.LinearSVC()
test_classifier(X, files.target, clf, X2, files2.target, test_size=0.2, y_names=files.target_names, confusion=False)
def test_classifier(X, y, clf, X2, y2, test_size=0.4, y_names=None, confusion=False):
X_train, X_test, y_train, y_test = X, X2, y, y2
clf.fit(X_train, y_train)
# clf.fit(X_test, y_test)
y_predicted = clf.predict(X_test)
print colored('Classification report:', 'magenta', attrs=['bold'])
print sklearn.metrics.classification_report(y_test, y_predicted, target_names=y_names)
Ibraim Ganiev
because when you call
word_counts2 = util.bagOfWords(files2.data)
It produces results for words that the tfidf transformer has never seen in the training set, and those words do not have opposite frequencies.
You just need to generate counts for the words that appear in the training set, maybe CountVectorizer can help you.