How to use "Dirichlet Process Gaussian Mixture Model" in Scikit-Learn? (n_components?)


Oka

My understanding of "Infinite Mixture Models with Dirichlet Processes as Prior Distributions for Number of Clusters" is that the number of clusters is determined by the convergence of the data to a certain number of clusters.

This R Implementation https://github.com/jacobian1980/ecostates determines the number of clusters in this way. Not sure if that affects this effect though R implementationusing the Gibbs sampler.

What confuses me is the n_componentsparameter. n_components: int, default 1 : Number of mixture components. If the number of components is determined by the data and the Dirichlet process, what is this parameter?


Ultimately, I am trying to get:

(1) Cluster assignment of each sample;

(2) a probability vector for each cluster; and

(3) Likelihood/log-likelihood of each sample.

It looks like (1) is the predictmethod and (3) is the scoremethod. However, the output of (1) depends entirely on n_componentsthe hyperparameters.

My apologies if this is a naive question, I'm pretty new to Bayesian programming and found there is Dirichlet Processsomething Scikit-learnI'd like to try.


Here is the documentation : http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM

Here is a usage example : http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html

Here is my naive usage:

from sklearn.mixture import DPGMM
X = pd.read_table("Data/processed/data.tsv", sep="\t", index_col=0)
Mod_dpgmm = DPGMM(n_components=3)
Mod_dpgmm.fit(X)
Raphael Valle

As @maxymoo mentioned in the comments, n_componentsis a truncation parameter.

In the Chinese restaurant process, which is in the context of sklearn's DP-GMM correlation bar breaking representation, a new data point joins kthe probability |k| / n-1+alphaof an existing cluster and starts a new cluster with probabilities alpha / n-1 + alpha. This parameter can be interpreted as the concentration parameter of the Dirichlet process, which will affect the final number of clusters.

Unlike R's implementation that uses Gibbs sampling, sklearn's implementation of DP-GMM uses variational inference. This may be related to the difference in results.

A detailed Dirichlet Process tutorial can be found here .

Related


How to install and use scikit-learn in Python

Berbatov Note up front: I'd like to follow the advice of other threads, but so far, haven't found anything helpful ( 1 , 2 ) I received a pandas file that I want to run on my machine. First, the code references the sklearn package. import re from sklearn.decom

How to use scikit learn model in structured query?

xcsob I'm trying to apply a scikit model retrieved using pickle to each row of a structured streaming dataframe. I tried using pandas_udf (version code 1) and it gave me this error: AttributeError: 'numpy.ndarray' object has no attribute 'isnull' code: inputP

How to install and use scikit-learn in Python

Berbatov Note up front: I'd like to follow the advice of other threads, but so far, haven't found anything helpful ( 1 , 2 ) I received a pandas file that I want to run on my machine. First, the code references the sklearn package. import re from sklearn.decom

How to use scikit learn model in structured query?

xcsob I'm trying to apply a scikit model retrieved using pickle to each row of a structured streaming dataframe. I tried using pandas_udf (version code 1) and it gave me this error: AttributeError: 'numpy.ndarray' object has no attribute 'isnull' code: inputP

How to use NumPy arrays in Scikit-learn

Peled For a machine learning project, I made a Pandas dataframe to use as input in Scikit label vector 0 0 1:0.02776011 2:-0.009072121 3:0.05915284 4:-0... 1 1 1:0.014463682 2:-0.00076486735 3:0.04499

How to use string kernel in scikit-learn?

inertial I am trying to generate a string kernel that can be used for a support vector classifier. I tried it with a function that computes the kernel def stringkernel(K, G): for a in range(len(K)): for b in range(len(G)): R[a][b] = sci

How to use NumPy arrays in Scikit-learn

Peled For a machine learning project, I made a Pandas dataframe to use as input in Scikit label vector 0 0 1:0.02776011 2:-0.009072121 3:0.05915284 4:-0... 1 1 1:0.014463682 2:-0.00076486735 3:0.04499

How to use string kernel in scikit-learn?

inertial I am trying to generate a string kernel to provide a support vector classifier. I tried it with a function that computes the kernel, like this def stringkernel(K, G): for a in range(len(K)): for b in range(len(G)): R[a][b] = sc

How to use scikit learn model in structured query?

xcsob I'm trying to apply a scikit model retrieved using pickle to each row of a structured streaming dataframe. I tried using pandas_udf (version code 1) and it gave me this error: AttributeError: 'numpy.ndarray' object has no attribute 'isnull' code: inputP

How to install and use scikit-learn in Python

Berbatov Note up front: I'd like to follow the advice of other threads, but so far, haven't found anything helpful ( 1 , 2 ) I received a pandas file that I want to run on my machine. First, the code references the sklearn package. import re from sklearn.decom

How to use scikit learn model in structured query?

xcsob I'm trying to apply a scikit model retrieved using pickle to each row of a structured streaming dataframe. I tried using pandas_udf (version code 1) and it gave me this error: AttributeError: 'numpy.ndarray' object has no attribute 'isnull' code: inputP

How to use scikit learn model in structured query?

xcsob I'm trying to apply a scikit model retrieved using pickle to each row of a structured streaming dataframe. I tried using pandas_udf (version code 1) and it gave me this error: AttributeError: 'numpy.ndarray' object has no attribute 'isnull' code: inputP

How to use string kernel in scikit-learn?

inertial I am trying to generate a string kernel to provide a support vector classifier. I tried it with a function that computes the kernel, like this def stringkernel(K, G): for a in range(len(K)): for b in range(len(G)): R[a][b] = sc

Scikit-learn, KMeans: How to use max_iter

Jay I want to know the parameter max_iter from sklearn.cluster.KMeans class . According to the documentation: max_iter : int, default: 300 Maximum number of iterations of the k-means algorithm for a single run. But I think if I have 100 objects, the code has

How to use scikit learn model from C#

Nadjib Bendaoud I have learned many models using scikit and I want to make predictions on these models through a C# program, is there any API that can help me to do this? Michael Tannenbaum As far as I know, it is not possible to load sklearn models directly i

scikit-learn: how to use fit probabilistic model?

year 1991 So, I have used scikit-learn Gaussian mixture models( http://scikit-learn.org/stable/modules/mixture.html ) to fit my data, now I want to use the model, how to do it? Specifically: How to plot probability density distribution? How to calculate mean s