Understanding Gaussian Mixture Models
I'm trying to understand the results of the scikit-learn Gaussian Mixture Model implementation. See the example below:
#!/opt/local/bin/python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
# Define simple gaussian
def gauss_function(x, amp, x0, sigma):
return amp * np.exp(-(x - x0) ** 2. / (2. * sigma ** 2.))
# Generate sample from three gaussian distributions
samples = np.random.normal(-0.5, 0.2, 2000)
samples = np.append(samples, np.random.normal(-0.1, 0.07, 5000))
samples = np.append(samples, np.random.normal(0.2, 0.13, 10000))
# Fit GMM
gmm = GaussianMixture(n_components=3, covariance_type="full", tol=0.001)
gmm = gmm.fit(X=np.expand_dims(samples, 1))
# Evaluate GMM
gmm_x = np.linspace(-2, 1.5, 5000)
gmm_y = np.exp(gmm.score_samples(gmm_x.reshape(-1, 1)))
# Construct function manually as sum of gaussians
gmm_y_sum = np.full_like(gmm_x, fill_value=0, dtype=np.float32)
for m, c, w in zip(gmm.means_.ravel(), gmm.covariances_.ravel(),
gmm.weights_.ravel()):
gmm_y_sum += gauss_function(x=gmm_x, amp=w, x0=m, sigma=np.sqrt(c))
# Normalize so that integral is 1
gmm_y_sum /= np.trapz(gmm_y_sum, gmm_x)
# Make regular histogram
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=[8, 5])
ax.hist(samples, bins=50, normed=True, alpha=0.5, color="#0070FF")
ax.plot(gmm_x, gmm_y, color="crimson", lw=4, label="GMM")
ax.plot(gmm_x, gmm_y_sum, color="black", lw=4, label="Gauss_sum")
# Annotate diagram
ax.set_ylabel("Probability density")
ax.set_xlabel("Arbitrary units")
# Draw legend
plt.legend()
plt.show()
Here I first generate a sample distribution consisting of Gaussians and then fit a Gaussian mixture model to these data. Next, I want to calculate the probability of some given input. Conveniently, the scikit implementation provides a way to score_samples
do this . Now, I am trying to understand these results. I always thought, that I could take the Gaussian parameters from the GMM fit and construct the same distribution by summing and then normalizing the integral to 1. However, as you can see in the plot, score_samples
this method fits perfectly (red line) with the original data (blue histogram), while the manually constructed distribution (black line) does not. I'm trying to understand what's wrong with my idea and why I can't construct the distribution myself by summarizing the Gaussians given by the GMM fit! Thanks so much for your input!
Just in case someone in the future wants to know the same thing: the individual components must be normalized, not the sum:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
# Define simple gaussian
def gauss_function(x, amp, x0, sigma):
return amp * np.exp(-(x - x0) ** 2. / (2. * sigma ** 2.))
# Generate sample from three gaussian distributions
samples = np.random.normal(-0.5, 0.2, 2000)
samples = np.append(samples, np.random.normal(-0.1, 0.07, 5000))
samples = np.append(samples, np.random.normal(0.2, 0.13, 10000))
# Fit GMM
gmm = GaussianMixture(n_components=3, covariance_type="full", tol=0.001)
gmm = gmm.fit(X=np.expand_dims(samples, 1))
# Evaluate GMM
gmm_x = np.linspace(-2, 1.5, 5000)
gmm_y = np.exp(gmm.score_samples(gmm_x.reshape(-1, 1)))
# Construct function manually as sum of gaussians
gmm_y_sum = np.full_like(gmm_x, fill_value=0, dtype=np.float32)
for m, c, w in zip(gmm.means_.ravel(), gmm.covariances_.ravel(), gmm.weights_.ravel()):
gauss = gauss_function(x=gmm_x, amp=1, x0=m, sigma=np.sqrt(c))
gmm_y_sum += gauss / np.trapz(gauss, gmm_x) * w
# Make regular histogram
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=[8, 5])
ax.hist(samples, bins=50, normed=True, alpha=0.5, color="#0070FF")
ax.plot(gmm_x, gmm_y, color="crimson", lw=4, label="GMM")
ax.plot(gmm_x, gmm_y_sum, color="black", lw=4, label="Gauss_sum", linestyle="dashed")
# Annotate diagram
ax.set_ylabel("Probability density")
ax.set_xlabel("Arbitrary units")
# Make legend
plt.legend()
plt.show()