Project 4: Clustering - 11/20/22

Cherny Devireddy
Nov 21, 2022
3 min read

Problem:

I want to solve the problem of figuring out how the countries that have a higher happiness score differ from the ones that have a lower score. If there are specific features that enable countries to have a higher score then identifying those features might enable the countries with the lower score to also implement those features so that they can have a higher score. I will use clustering to see how features are broken down and see how they affect if the country has a higher happiness score or a lower happiness score.

Clustering and How it Works:

The clustering that I will be using is k-means clustering. In a general sense, clustering is grouping data points by their similarities and distance.

K-means clustering works by using centroids to group data points together. At first, the centroids are chosen randomly. Then adjusting the centroids based on the data points to get a better grouping after the first run-through. This process is repeated x amount of times until the data points don't change to a different cluster anymore.
Agglomerative clustering works by combining the closest points together and then creating a hierarchical clustering model. The points closest together keep combining together until a wanted number of clusters are created. The groupings can be seen by using the dendrogram that has been created to keep track of the clusters that have been made.

Data:

This is the notebook I will be referencing from kaggle:

https://www.kaggle.com/code/avnika22/world-happiness-report-eda-clustering

The data I will be using is from the 2019 World Happiness Report. The dataset itself has information about GDP per capita (the sum of gross value added by all resident producers in the economy plus any product taxes not included in the valuation of output, divided by midyear population), social support, generosity, etc…

Data Understanding & Visualization:

In order to understand the data I used a pair plot, a scatterplot, and a heatmap to see how the overall rank correlated with the other features in the dataset as well as how specific features correlated with the dataset. Since the original problem was to figure out the features that had the highest correlation with the overall rank, I decided to utilize the kaggle resource that I found and I was able to split the countries in different ways. I also chose the features that I thought looked like they had the highest correlation. This really helped me visualize the data and eventually model it based on my findings later on.

I also decided to visualize the features separately because some of them were too small to show up on the combined visualization.

Pre-processing the data:

I wanted to get rid of some features/columns so that I can base my modeling on the features that I found to be most important in my visualizations. I got rid of freedom to make life choices and generosity because I felt that there was no correlation between the high, mid, or low scores.

Modeling (Clustering):

I decided to use K-means clustering because this will help me split the data into clusters based on similar features or lack of. I did not end up using agglomerative clustering because I wasn't trying to group all my data points together and create a dendrogram. Agglomerative would not have given me the result that I was looking for.

Storytelling (Clustering Analysis):

What I found from k-means modeling is that when I used 3 clusters, my data points were accurately clustered. I double-checked this because during my data visualization, I chose 2 data points from each category and after my clustering, I ended up with the data points being accurately divided into those clusters.

I was able to answer my original question of what factors play a role in the overall score of a country's happiness. Three major factors I found were social support, GDP per capita, and healthy life expectancy. One factor that could also be an impact on the happiness score is perceptions of corruption since the higher-ranked countries and the lower-ranked countries had a big difference in those values. Two features that did not really correlate with the happiness score are generosity and freedom to make life choices.

Impact:

The impact of this project was to help find the features that had the most impact on the overall happiness score. It is important to figure this out because it can really help the countries that might have fallen into the mid or low categories. It is also important to understand what a specific country can improve on so that they become better and their citizens are living better lives.

References:

Code:

https://github.com/ChernyDevireddy1/ITCS3162/blob/main/Project%204%20-%20Clustering.ipynb

Project 4: Clustering - 11/20/22

Recent Posts

Comments