Project 3: Regression - 10/23/22

Cherny Devireddy
Oct 23, 2022
4 min read

Problem and Dataset

The problem that I am trying to solve in this project is to see if there is any correlation between the house prices and the year that they are being sold. In order to successfully draw a conclusion to the problem, I have found a dataset that includes both price, year sold, and many other variables that could be helpful in predicting the house prices for the future.

Below is the dataset I chose:

https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset

Regression and How it works

Regression is used to find the relationship between variables. It is the "process of finding a line that best fits the data points available on the plot, so that we can use it to predict output values for inputs that are not present in the data set we have, with the belief that those outputs would fall on the line (Source).” A problem that linear regression can solve is figuring out which stocks to invest in. Linear Regression is used to handle regression problems and provides a continuous output (Source). There are three methods to evaluate regression models. One method is Mean Squared Error and the way it works is by measuring how close the data points are to the regression line. “It is a risk function corresponding to the expected value of the squared error loss. Mean square error is calculated by taking the average, specifically the mean, of errors squared from data as it relates to a function. (Source)”

Experiment 1: Data understanding

Before diving into the project, some of the pre-processing steps that need to be taken in order to gain an understanding of the data are things like:

Eliminating null values
Getting rid of irrelevant data/columns
Standardizing the data types and column names
Resetting/adjusting the indexes to complete pre-processing

In order to check if there are already any existing patterns, it is important to create simple visualizations to display the data. Some example visualizations that I plan on using are:

Line graph
Pair plot

Experiment 1: Pre-processing

During the data understanding, I found that some states were more affected by covid than others and some states recovered faster than others. I also realized that my visualization might not have shown a greater increase or recovery after covid is because there is a lack of data since it is still a very recent event. There also was a dip during the time of the 2008 recession and that is when prices went down and there was a lower average/mean than the other years. For the first experiment, I want to train and test the dataset based on the price and years of Massachusetts because it shows both of these events and their recovery.

Experiment 1: Modeling

I created a linear regression model by splitting my data into training data and testing data.

Experiment 1: Evaluation

To evaluate my model I used mean absolute error, mean square error, and root mean square error.

Experiment 2

For the second experiment, I want to use the average of the price per year disregarding the specific states. This is because, during experiment 1, I noticed that my model was not behaving the way I intended it to. Also, it was not displaying the regression line in a way that shows price increases and decreases over the years.

During my data-understanding step, I create a variable called df_avg that takes the average price per year. This is what I will use to create my linear regression model on for this experiment.

I modeled, visualized, and evaluated my second experiment. The result of my second experiment show that my numbers look a lot more normal. For example, my R-squared value in my model had a .58 meaning that 58% of my data was explained by my model whereas, in Experiment 1, only .6% was explained by my model. Additionally, during my evaluation, I noticed a difference in the mean absolute error, mean square error, and root mean square error because the numbers looked way more reasonable in experiment 2.

Experiment 3

For experiment 3, I decided to do some further pre-processing of the data. Since I used the Year as the target my visualization looks quite odd since the house price is going down as the years progress which means I might need to remove some outliers. I ended up removing 2000 because it was an outlier, 2008 because of the rescission, and 2020-2022 because of covid and 2022 doesn't have enough data since the year has not ended yet.

I also want to try a different model to show a better visualization. I chose Polynomial linear regression because I didn't just want a straight line. I wanted something that would show the big dips and big peaks. This led to a more accurate model because 1: I removed outliers and 2: I used a different regression model.

During my evaluation, I found that the mean absolute error, mean square error, and root mean square error to also be more accurate and they had less error than the previous experiment in all three metrics.

Impact

The impact of this project could be quite significant in providing insight as to how the general housing market is experiencing shifts or lack of. It could be helpful to predict what the housing prices will look like by analyzing the trends of the regression model. This is also similar to how stocks are predicted. This project has positive impacts as mentioned above but it can also have negative impacts. One negative impact could be that if the regression model is too simplified and has inaccurate data or has a low amount of data then the model will not accurately depict the regression of the housing market.

Conclusion

Throughout this project, I learned how to build regression models with data through pre-processing, visualizations, and evaluation methods. During the first experiment, I realized that my numbers were totally off through the evaluation methods I used. I also figured out that I need to remove a feature from my data because it was messing up my visualizations. In my second experiment, I eliminated the states feature by calculating the mean prices based on the year which combined all the states info into one variable. This helped tremendously with the model and I had more reasonable numbers during my evaluation. For the third experiment, I saw that I needed to remove some of the outliers in my data and possibly try a model that wasn't just a straight line. This gave me a better understanding of my data and showed a more accurate model. Overall, I am happy with my results and the models that I was about to make.

References

Code

https://github.com/ChernyDevireddy1/ITCS3162/blob/main/Project%203.ipynb

Project 3: Regression - 10/23/22

Recent Posts

Yorumlar