Unstable accuracy of Gaussian mixture model classifier from sklearn
I have some data from two different speakers (MFCC feature for speaker recognition). Each person has 60 vectors of 13 features (120 total). Each of them has its own label (0 and 1). I need to display the results on a confusion matrix. But GaussianMixture
sklearn's model is not stable. For each program run, I receive a different score (sometimes accuracy is 0.4, sometimes 0.7...). I don't know what I'm doing wrong because similarly I created SVM and k-NN models and they work fine (stable accuracy around 0.9). Do you know what I'm doing wrong?
gmmclf = GaussianMixture(n_components=2, covariance_type='diag')
gmmclf.fit(X_train, y_train) #X_train are mfcc vectors, y_train are labels
ygmm_pred_class = gmmclf.predict(X_test)
print(accuracy_score(y_test, ygmm_pred_class))
print(confusion_matrix(y_test, ygmm_pred_class))
Short answer: you should simply not use GMM for classification.
long answer...
From an answer to a related thread, Multiclass classification using Gaussian Mixture Models with scikit learn (emphasis added):
Gaussian mixture is not a classifier. This is a density estimation method, and it's not a good idea to expect that its components will magically align with your class. [...] GMM just tries to fit a mixture of Gaussians to your data, but nothing forces it to place them according to the labels (not even provided in the fit call). Sometimes this works - but only for small problems , where the classes are so well separated that even Naive Bayes will work, but in general it's just an ineffective tool for the problem.
And the comments of the interviewee himself (again emphasising the emphasis in the original):
As stated in the answer - GMM is not a classifier, so it is impossible to answer if you are using "GMM classifier" correctly. Using a GMM as a classifier is by definition incorrect, and there is no "efficient" way to use it in a problem like this, as it is not what this model was designed for. What you can do is build a suitable generative model for each class. In other words, build your own classifier where each label fits a GMM , then use the assigned probabilities to do the actual classification. Then it is a suitable classifier. See github.com/scikit-learn/scikit-learn/pull/2468
(For what it's worth, you might want to note that the interviewee was a research scientist at DeepMind and was the first to earn a machine-learning
gold badge at SO )
To elaborate further (that's why I didn't simply mark the question as a duplicate):
Indeed, there is a post called GMM classification in the scikit-learn documentation :
Demonstration of Gaussian mixture models for classification.
I guess this didn't exist in 2017, when the above reply was written. However, dig into the provided code and you'll see that the GMM model is actually used in the way proposed by lejlot above; there are no in-form declarations classifier.fit(X_train, y_train)
- all usage is in-form classifier.fit(X_train)
, ie no actual labels are used.
This is exactly what we would expect from a class clustering algorithm (which is indeed a GMM), not a classifier. scikit-learn again provides an option to provide labels in the GMM fit
method :
fit
(self, x, y = none)
What you've actually used here (again, probably didn't exist in 2017, as the responses above suggest), however, given what we know about GMM and its usage, it's not clear what this parameter is for (and, Let me say that scikit-learn has a place in some practices that seem sensible from a pure programming perspective , but pointless from a modeling perspective ).
Final remark: While fixing the random seed (as suggested in the comments) seems to "work", it's probably not a good idea to trust a "classifier" that provides an accuracy range between 0.4 and 0.7 based on the random seed . ..