Hello,
In this section, I conducted a clustering analysis to identify patterns within the dataset. Here’s a summary of the steps and outcomes:
To prepare the data for clustering, I started by normalizing the dataset. Normalization ensures that all variables have the same scale, which is crucial for meaningful clustering results. The `StandardScaler` from `sklearn.preprocessing` was used to standardize the data. I excluded the “Year” and “Month” columns from the normalization process.
To determine the optimal number of clusters for the K-means algorithm, I employed the Elbow Method. This method involves running K-means clustering for a range of cluster numbers (from 1 to 10) and calculating the Within-cluster Sum of Squares (WCSS) for each. The WCSS measures the variability within clusters. I plotted the results of the Elbow Method, and the point at which the decrease in WCSS starts to level off represents the optimal number of clusters.
Based on the Elbow Method analysis, I performed K-means clustering with two different cluster numbers: 3 and 4. The `KMeans` class from `sklearn.cluster` was utilized for this purpose. The models were initialized using the “k-means++” method for better convergence, and a fixed random state (random_state=42) was set for reproducibility.
After fitting the models, I assigned data points to clusters using the `fit_predict` method. Two sets of data were created, one with 3 clusters and another with 4 clusters.
I added the cluster information back to the original dataset for further analysis. This allowed me to understand which cluster each data point belonged to, providing insights into the distinct groups or patterns within the data.
To provide a glimpse of the results, I displayed the first few rows of each dataset with cluster information. This inspection offers a preliminary view of how data points are distributed among clusters.
Clustering analysis helps identify inherent structures or groups within the dataset, enabling a deeper understanding of data patterns and trends. It can be a valuable tool for segmentation and decision-making in various domains.