top of page

My Work

Restaurant Recommendation System

Group Project


Introduction

The problem we are trying to solve is that we want to help people efficiently pick a restaurant based on their interests such as taste or mood. We believe that having a recommendation system in Philadelphia that suggests restaurants would save people a lot of time and energy. Picking a place to eat can be challenging since there are so many different restaurants in the Philadelphia area. This being the case, having a utility such as this restaurant recommendation system can help with several factors of convenience such as cost + time efficiency, distance, and as previously stated, energy. A question to be answered through completing this project is, can this something that can be accomplished through building a recommendation system? What all is needed in terms of modeling in order to achieve this goal?


Data

The data we are pulling is from a website called Yelp. Yelp is an online platform that allows for users to rate and review restaurants, their reviews are made public so other users can read them and make judgments. Yelp provides data from their service through the “Yelp Open Dataset”. We are using filtered data that is specific to large metropolitan areas, this filtered data was obtained through GitHub user, unclebrod. When users on Yelp review a restaurant it is in the form of a 1-5 rating that is then calculated into an average for the restaurant. We plan to use this average and the number of reviews to do our analysis. This data is already in a simple to use format so no extensive cleaning or scraping will need to be done.


Methods

Pre-processing:

  • The Yelp dataset was already heavily cleaned and almost all of the data was relevant to our application.

  • First we decided to filter the city to only be Philadelphia, PA.

  • Then we decided to use only the name of restaurant, stars of the restaurant, distance of the restaurant (latitude & longitude), city, and categories (type of restaurant)

  • Using user latitude & longitude location, we can calculate the distance between the user and the restaurant, which will be a factor in our recommendation.

Visualizations:

  • We wanted to visualize the data so we can get a better understanding by creating bar charts to see distributions and lists to see the restaurants in an orderly manner.






Stars: Plotting the position of the restaurants over a map of Philadelphia, we can get a good idea of where most of the restaurants are and their associated rating.



Distance: Since we have the latitude and longitude of each restaurant, we can use that to calculate the distance between the user and each restaurant. The closer the restaurant is to the user, the more likely it will be to show up higher on the list of recommendations.


Modeling

We decided to model using K-means clustering to see what kinds of features about the specific restaurants people tend to enjoy which allows us to see their interests about their food preferences.


Based on our model we found that there is no correlation between the amount of reviews a restaurant has and how far away it is. Usually you would think more people would be reviewing restaurants in a closer proximity to them but it turns out that is not the case.


We also saw that the model has a correlation between stars and review count. The highest ratings were found when the review count was over 2000.

Our final cluster after choosing 5 clusters:


Evaluation

Since we ran a classification model, we evaluated on the accuracies, F1 scores, etc. using project 3 which was about regression. Unfortunately, our R-Squared value for our regression model is dismally low, which may have something to do with the lack of features and subjectivity of the data.

Accurate to Yelp’s review:

Evaluating performance for this model is difficult because its predictions are all subjective and take into account many different variables such as where the ‘map’ is drawn when looking at Google or Yelp, however, our model consistently gave highly reviewed, highly rated restaurants all within a reasonable distance from the user.


Storytelling and Conclusion

Our initial goal we were trying to solve was that we wanted to help people effectively choose a restaurant based on their interests of taste or mood. We didn’t obtain our initial goal because we weren’t able to utilize the categories of the restaurants (mexican, fast food, chinese, american, etc) to see what interests someone had in their food and restaurant choices. Future steps in mind that we could improve on are implementing the food categories successfully within our project as well as incorporating the distance and reviews in relation to the categories. Some other insights we gained throughout our project include how difficult it is to make use of the different categories, because without that we were not able to complete our initial goal. Lastly, we tried using a regression model to make predictions, but the results we obtained were very poor. K-means clustering gave us the best results and was very closely aligned with what we saw on Yelp and Google. If we had a little more time to work on the project, we think a Random Forest Classifier would work even better.


Impact

There are numerous ways that our project will have an impact on communities and users alike. Primarily we will be impacting restaurants and other small businesses based on the recommendations that our models make. Based on the popularity of our model, it is possible that there could be a significant change in the amount of customers that a restaurant might see, whether that be an increase or decrease. Our project will also be impacting the users of the restaurant recommender system. There is potential for both positive and negative impacts related to our recommendations. A positive impact for users would be our model recommending a restaurant that the user would enjoy that is also nearby to them. A potential negative impact would be our model recommending a restaurant that is either far away from them and/or one that they will not enjoy.

We must weigh the ethical concerns of implementing a model like this because of its potential to harm or negatively affect a business that does not deserve it. There is potential for our model to unfairly affect a business by not recommending it as much as it should, this could impact the restaurant's revenue by decreasing the amount of customers it receives. In order to prevent this ethical harm from occurring we must do extensive testing of our model before making it public.




Code + Data + References

References:

Problem:

I want to solve the problem of figuring out how the countries that have a higher happiness score differ from the ones that have a lower score. If there are specific features that enable countries to have a higher score then identifying those features might enable the countries with the lower score to also implement those features so that they can have a higher score. I will use clustering to see how features are broken down and see how they affect if the country has a higher happiness score or a lower happiness score.


Clustering and How it Works:

The clustering that I will be using is k-means clustering. In a general sense, clustering is grouping data points by their similarities and distance.

  • K-means clustering works by using centroids to group data points together. At first, the centroids are chosen randomly. Then adjusting the centroids based on the data points to get a better grouping after the first run-through. This process is repeated x amount of times until the data points don't change to a different cluster anymore.

  • Agglomerative clustering works by combining the closest points together and then creating a hierarchical clustering model. The points closest together keep combining together until a wanted number of clusters are created. The groupings can be seen by using the dendrogram that has been created to keep track of the clusters that have been made.


Data:

This is the notebook I will be referencing from kaggle:

The data I will be using is from the 2019 World Happiness Report. The dataset itself has information about GDP per capita (the sum of gross value added by all resident producers in the economy plus any product taxes not included in the valuation of output, divided by midyear population), social support, generosity, etc…


Data Understanding & Visualization:

In order to understand the data I used a pair plot, a scatterplot, and a heatmap to see how the overall rank correlated with the other features in the dataset as well as how specific features correlated with the dataset. Since the original problem was to figure out the features that had the highest correlation with the overall rank, I decided to utilize the kaggle resource that I found and I was able to split the countries in different ways. I also chose the features that I thought looked like they had the highest correlation. This really helped me visualize the data and eventually model it based on my findings later on.



I also decided to visualize the features separately because some of them were too small to show up on the combined visualization.




Pre-processing the data:

I wanted to get rid of some features/columns so that I can base my modeling on the features that I found to be most important in my visualizations. I got rid of freedom to make life choices and generosity because I felt that there was no correlation between the high, mid, or low scores.


Modeling (Clustering):

I decided to use K-means clustering because this will help me split the data into clusters based on similar features or lack of. I did not end up using agglomerative clustering because I wasn't trying to group all my data points together and create a dendrogram. Agglomerative would not have given me the result that I was looking for.



Storytelling (Clustering Analysis):

What I found from k-means modeling is that when I used 3 clusters, my data points were accurately clustered. I double-checked this because during my data visualization, I chose 2 data points from each category and after my clustering, I ended up with the data points being accurately divided into those clusters.

I was able to answer my original question of what factors play a role in the overall score of a country's happiness. Three major factors I found were social support, GDP per capita, and healthy life expectancy. One factor that could also be an impact on the happiness score is perceptions of corruption since the higher-ranked countries and the lower-ranked countries had a big difference in those values. Two features that did not really correlate with the happiness score are generosity and freedom to make life choices.


Impact:

The impact of this project was to help find the features that had the most impact on the overall happiness score. It is important to figure this out because it can really help the countries that might have fallen into the mid or low categories. It is also important to understand what a specific country can improve on so that they become better and their citizens are living better lives.


References:


Code:


Problem and Dataset

The problem that I am trying to solve in this project is to see if there is any correlation between the house prices and the year that they are being sold. In order to successfully draw a conclusion to the problem, I have found a dataset that includes both price, year sold, and many other variables that could be helpful in predicting the house prices for the future.

Below is the dataset I chose:


Regression and How it works

Regression is used to find the relationship between variables. It is the "process of finding a line that best fits the data points available on the plot, so that we can use it to predict output values for inputs that are not present in the data set we have, with the belief that those outputs would fall on the line (Source).” A problem that linear regression can solve is figuring out which stocks to invest in. Linear Regression is used to handle regression problems and provides a continuous output (Source). There are three methods to evaluate regression models. One method is Mean Squared Error and the way it works is by measuring how close the data points are to the regression line. “It is a risk function corresponding to the expected value of the squared error loss. Mean square error is calculated by taking the average, specifically the mean, of errors squared from data as it relates to a function. (Source)”


Experiment 1: Data understanding

Before diving into the project, some of the pre-processing steps that need to be taken in order to gain an understanding of the data are things like:

  • Eliminating null values

  • Getting rid of irrelevant data/columns

  • Standardizing the data types and column names

  • Resetting/adjusting the indexes to complete pre-processing

In order to check if there are already any existing patterns, it is important to create simple visualizations to display the data. Some example visualizations that I plan on using are:

  • Line graph

  • Pair plot

Experiment 1: Pre-processing

During the data understanding, I found that some states were more affected by covid than others and some states recovered faster than others. I also realized that my visualization might not have shown a greater increase or recovery after covid is because there is a lack of data since it is still a very recent event. There also was a dip during the time of the 2008 recession and that is when prices went down and there was a lower average/mean than the other years. For the first experiment, I want to train and test the dataset based on the price and years of Massachusetts because it shows both of these events and their recovery.


Experiment 1: Modeling

I created a linear regression model by splitting my data into training data and testing data.

Experiment 1: Evaluation

To evaluate my model I used mean absolute error, mean square error, and root mean square error.


Experiment 2

For the second experiment, I want to use the average of the price per year disregarding the specific states. This is because, during experiment 1, I noticed that my model was not behaving the way I intended it to. Also, it was not displaying the regression line in a way that shows price increases and decreases over the years.

During my data-understanding step, I create a variable called df_avg that takes the average price per year. This is what I will use to create my linear regression model on for this experiment.

I modeled, visualized, and evaluated my second experiment. The result of my second experiment show that my numbers look a lot more normal. For example, my R-squared value in my model had a .58 meaning that 58% of my data was explained by my model whereas, in Experiment 1, only .6% was explained by my model. Additionally, during my evaluation, I noticed a difference in the mean absolute error, mean square error, and root mean square error because the numbers looked way more reasonable in experiment 2.



Experiment 3

For experiment 3, I decided to do some further pre-processing of the data. Since I used the Year as the target my visualization looks quite odd since the house price is going down as the years progress which means I might need to remove some outliers. I ended up removing 2000 because it was an outlier, 2008 because of the rescission, and 2020-2022 because of covid and 2022 doesn't have enough data since the year has not ended yet.

I also want to try a different model to show a better visualization. I chose Polynomial linear regression because I didn't just want a straight line. I wanted something that would show the big dips and big peaks. This led to a more accurate model because 1: I removed outliers and 2: I used a different regression model.

During my evaluation, I found that the mean absolute error, mean square error, and root mean square error to also be more accurate and they had less error than the previous experiment in all three metrics.



Impact

The impact of this project could be quite significant in providing insight as to how the general housing market is experiencing shifts or lack of. It could be helpful to predict what the housing prices will look like by analyzing the trends of the regression model. This is also similar to how stocks are predicted. This project has positive impacts as mentioned above but it can also have negative impacts. One negative impact could be that if the regression model is too simplified and has inaccurate data or has a low amount of data then the model will not accurately depict the regression of the housing market.


Conclusion

Throughout this project, I learned how to build regression models with data through pre-processing, visualizations, and evaluation methods. During the first experiment, I realized that my numbers were totally off through the evaluation methods I used. I also figured out that I need to remove a feature from my data because it was messing up my visualizations. In my second experiment, I eliminated the states feature by calculating the mean prices based on the year which combined all the states info into one variable. This helped tremendously with the model and I had more reasonable numbers during my evaluation. For the third experiment, I saw that I needed to remove some of the outliers in my data and possibly try a model that wasn't just a straight line. This gave me a better understanding of my data and showed a more accurate model. Overall, I am happy with my results and the models that I was about to make.


References


Code


bottom of page