Table of Contents
Clustering is a technique that is used in visualizing the data as a group called clusters. You can retrieve many datasets from various sources of communication through the internet but to visualize such data as different components, it requires implementing certain machine learning models. In this tutorial, you are going to learn the concepts of clustering algorithms applied in machine learning. By the end of this tutorial, you will understand the following learning outcomes.
- What is Clustering?
- What is the Gaussian Mixture Model?
- Scikit-Learn’s Estimator API.
- Basics of API.
- Practical implementation of Iris Clustering using the Gaussian Mixture Model.
If you are new to python programming or want to learn more about coding then you can get this training course on Python Training Certification available online.
What is Clustering?
The process of partitioning the set of data or objects in a set of meaningful sub-classes is called clustering. A collection of data objects that are like one another collectively treated as one group is called a cluster.
In the above picture, we have the original unclustered data colored in blue, the process of partitioning the same set of data like the original unclustered data is called as the clustered data which is colored with different colors such as blue, green, and red. This collection of data partitioned with blue, green, and red colors are collectively called as a cluster. So through this process of clustering, you are not going to separate or disturb any data but identifying the data in the form of distinct groups by applying coloring to the same classes of original data.
A clustering algorithm attempts to find distinct groups of data without reference to any labels. In this section, we are going to apply the clustering to the Iris dataset and will use a powerful clustering method called a Gaussian mixture model(GMM). A GMM attempts to model the data as a collection of Gaussian blobs.
What is the Gaussian mixture model?
A Gaussian mixture model (GMM) is a category of a probabilistic model. The GMM states that all generated data points are derived from a mixture of a finite Gaussian distribution which has unknown parameters. The parameters for Gaussian mixture models have derived from maximum a posteriori estimation or an iterative expectation-maximization algorithm from a prior model that is well trained. GMM is very useful in modeling data that comes from several groups.
Mathematically, Gaussian mixture models are an example of a parametric probability density function, which can be represented as a weighted sum of all densities of Gaussian components. In other words, the weighted sum of M component Gaussian densities is known as a Gaussian mixture model, and mathematically it is represented as follows.
Where the parameters are written as lambda, X is the observations, and we assign a weight Wk to each of the Gaussian densities such that the weights sum up to 1. The parameters of a GMM are:
- The mean vectors of each component.
- The covariances matrices of each component and
- The weights associated with each component.
Gaussians are fully capable of modeling the correlations of feature vector elements based on the linear combination of diagonal covariance. Another feature of the Gaussian mixture model is the formation of smooth approximations to randomly shaped densities. Gaussian mixture models are also used for density estimation and are considered as the most statistically fully-fledged techniques that are applied for clustering. To understand the concept Gaussian Mixture Model in detail, please watch the video that explains clearly.
Python supports various libraries which provide the implementations of a range of machine learning algorithms. Scikit-Learn is the best package that provides a large number of algorithms. This section covers an overview of the Scikit-Learn API and a deeper understanding of these API elements which are used in machine learning algorithms.
1). Scikit-Learn’s Estimator API
The Scikit-Learn API is designed with the following principles.
- Consistency: All objects share a common interface from a limited set of methods.
- Inspection: All specified parameter values are public attributes.
- Limited Object Hierarchy: Only algorithms are represented by Python classes; datasets are represented in standard formats (NumPy arrays, Pandas DataFrames, SciPy sparse matrices) and parameter names use standard Python strings.
- Composition: Machine learning tasks are expressed as sequences of fundamental algorithms and Scikit-Learn makes use of this wherever possible.
- Sensible defaults: When models require user-specified parameters, the library defines appropriate default value.
The principles of Scikit are very easy to use. Every machine learning algorithm in Scikit-Learn is implemented with the Estimator API which provides a consistent interface for a wide range of machine learning applications.
2). Basics of the API
The steps in using Scikit-Learn API is as follows:
- Choose a class of models by importing the appropriate estimator class from Scikit-Learn.
- Choose model hyperparameters by instantiating this class with desired values.
- Arrange data into a features matrix and target vector.
- Fit the model to your data by calling the fit() method of the model instance.
- Apply the Model to new data:
- For supervised learning, often we predict labels for unknown data by using the predict()
- For unsupervised learning, we often transform or infer properties of the data using the transform() or predict()
3). Implementation steps while coding with python
Let us now implement the steps of Scikit-Learn API to the Gaussian mixture model as follows.
- Choose the model class.
- Instantiate the model with hyperparameters.
- Fit to data.
- Determine cluster labels.
The “GaussianMixture“ is the clustering model to be implemented in the python code which imports its functionalities from the”sklearn.mixture” package.
# 1. Choose the model class
from sklearn.mixture import GaussianMixture
The model is implemented with the method GaussianMixture() following two hyperparameters.
- n_components, that defines the number of components to be used, initialized as 3.
- Covariance_type, initialized as full for enabling the covariance matrix for each component.
# 2. Instantiate the model with hyperparameters
model = GaussianMixture(n_components=3,covariance_type='full')
The model is now set to fit after selecting the feature values from Iris data stored in the “X_iris” variable. These feature values are set to fit by using the method “.fit(X_iris)”. The target “y” is not used because the ‘species’ was dropped and we don’t use those labels to fit the data.
# 3. Fit to data. Notice y is not specified!
The model is then predicted by using the method “.predict(X_iris)” as shown in the code below which is then stored in the variable “y_gmm”.
y_gmm = model.predict(X_iris)
We now add the cluster label to the “Iris” DataFrame that is discussed in the below practical implementation of the code description section and Seaborn is then used to plot the results.
# 4. Determine cluster labels
iris['cluster'] = y_gmm
sns.lmplot("PCA1", "PCA2", data=iris, hue='species',col='cluster',fit_reg=False);
“Iris” dataset appears as shown in the below picture.
Practical Implementation of Iris Clustering using Gaussian Mixture Model
Now let us run the complete code to watch the clusters. Before proceeding to execute the program, download the dataset “Iris.csv” file here and check whether you have installed the following packages in python.
import seaborn as sns
import pandas as pd
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
X_iris = df.drop(‘Species’,axis=1)
model = PCA(n_components=2)
X_2D = model.transform(X_iris)
df[‘PCA1’] = X_2D[:,0]
df[‘PCA2′] = X_2D[:,1]
model = GaussianMixture(n_components=3,covariance_type=’full’)
y_gmm = model.predict(X_iris)
df[‘cluster’] = y_gmm
The program implements the “PCA” model for applying the dimensionality reduction technique from the “sklearn.decompostion” of the Scikit-Learn package.
Pandas read the “Iris” dataset as a data frame from the specified location of the PC. The drop(‘Species’,axis=1) will drop the “Species” column from the “Iris.csv” dataset and store the remaining columns into the variable “X_iris”. In the next step, the PCA() model is chosen that passes “n_components=2” as the parameter value to indicate that you are selecting the number of components as two and is stored in the variable “model”. This model is then set to fit on “X_iris” data by passing it as a parameter to the “.fit()” method. Now the model is then transformed into two-dimensional coordinates of rows and columns of a 2X2 matrix with the “.transform()” method which passes “X_iris” data values as a parameter into it. These transformed two-dimensional values are stored in the variable as “X_2D”.
In the next step, the X_2D[:,0] retrieves the first column elements of the X_2D matrix and these values are stored in the data frame df[‘PCA1’]. Similarly, X_2D[:,1] retrieves the second column elements of X_2D matrix and these values are stored in the data frame df[‘PCA2’]. We are using these two data frames df[‘PCA1’] and df[‘PCA2’] so as to plot these coordinates on the linear model chart which uses the method “lmplot()” of a “seaborn” library package.
As discussed in the above section, the same steps are implemented in this code while implementing the gaussian mixture model to label the clusters. Here the predicted values are determined for each and every component in the “iris” dataset and the data frame stores these values as 0, 1, and 2 for determining different species in the form of numbers. These numbers are then labeled as 0 for all the components of “Iris-versicolor”, 1 for all the components of “Iris-verginica” and 2 for all the components of “Iris-sertosa”. These label values are now stored in the variable “y_gmm” which in turn also stored in the data frame df[‘clusters’] so as to pass these label values as a parameter to the linear plot “lmplot()” method for displaying the result.
The method “lmplot(x, y, hue, data, fit_reg)” has the following list of parameters that determine as follows.
- x: x-column name variable in the dataset, which uses values of data frame df[‘PCA1’].
- y: y-column name variable in the dataset, which uses values of data frame df[‘PCA2’].
- hue: Variables that define subsets of the data, which will be drawn on separate facets in the grid. Here we use ‘Species’ that defines three subsets “Iris-setosa”, “Iris-versicolor” and “Iris-virginica”.
- data: It represents the dataset, in this case, we are going to pass the entire dataset which is stored in the variable “df”.
- fit_reg: Assigning it as true will plot the regression lines on a chart since in our case it is not required we will assign it as false.
- col: It represents the columns, which uses values of data frame df[‘cluster’] to display the clustered data separately as different columns.
Finally, the “.show()” method of “matplotlib” package is used in displaying the chart for visualizing the data.
By splitting the data by cluster number, we can see how well the GMM algorithm has recovered the underlying label. The sertosa species is separated within cluster 1. There is a small amount of mixing between versicolor and virginica. We can tell that measurements of these flowers are distinct enough that could automatically identify the presence of these different groups of species with a simple clustering algorithm.
Thus we have understood how the different clusters of data can be visualized in a dataset that can be retrieved from various data sources of communication by practically implementing the machine learning algorithms of the Gaussian Mixture model.
This is Manikanth, currently working as a Content Developer at HKR Trainings. I am passionate about doing research over various technical domains and publishing articles and end-user documents.