top of page

My Work

Problem:

The overproduction of garbage is a worldwide problem. Although we try to mitigate this problem with recycling and composting, it is sometimes hard to tell how exactly to sort through all the trash and find out what is recycling and what is composting. This is especially hard if we try to do this in a landfill for example. In order to solve this problem I have developed, trained, and tested a model to determine what is recycling and what is compost.


Data:

I found my data on Kaggle. This is the link:

This dataset contains features such as organic and recyclable images that can be sorted with the developed model.


Pre-processing the data:

A lot of the pre-processing I had to do was regarding the importing of the images into jupyter notebook. I had a lot of trouble with reading zip files. For example, extracting the zip files so I can read them as their file names. I also had trouble with loving my base path of where I had my data. After I realized there was a way to pull the data directly from Kaggle I faced another issue where I couldn't properly read in the location of the Kaggle data. I used a lot of references from this notebook: https://www.kaggle.com/code/beyzanks/waste-classification-with-cnn in order to figure out how to do certain things to process the data.


Data Understanding/Visualization:

In order to understand the data, I made sure to add labels to each of the images. I then figured out how many images are in each category (organic and recyclable). In order to visualize this information, I used a pie chart to show the number of each of the 2 categories. In this step, I found out that there are more organic images than recyclable images. Then I applied the labels to a random group of 9 images to make sure my visualizations were working properly and that the data is being processed correctly.


Modeling:

Firstly, I used an image data generator to make the transformation of the data on a random basis. It will help with achieving the result that I am looking for. Then I trained the dataset and tested the dataset with the fit_generator. This gave me an accuracy of .83 out of my whole dataset. After that, I tried to display my results in the form of a histogram. The fit_generator has given me a pretty decent accuracy score though.


Evaluation:

My model has an accuracy rate of .83 or in other words about 80%. In order to test this model, I had my model show the predicted label and the actual label of the image to manually show/check the accuracy rate. In this process I found my model to be correct in the way it was predicting if the images were organic or recyclable. For example, 12 out of the 15 images I displayed were labeled correctly.


The Story:

Throughout this project, I learned that it is important to pick issues that you are passionate about solving because I try to be environmentally conscious when possible but when it is not possible I feel this pressure that I am hurting the planet and causing more garbage to just end up in landfills. In order to solve this issue, I have created a model that can be implemented by garbage collection companies and landfills to sort through their trash and see what can be salvaged and prevented from ending up in a landfill. If this model is implemented, it will take pressure off people in public areas who are not able to compost (with organic materials) or recycle (with recyclable materials) because they will know that this model is doing it for them.


Impact:

The good impact of finding a solution to the problem is that it will help with the trash sorting issue that we have and it will help reduce our global waste if this model was implemented into landfills or garbage collection companies. A bad impact is that if the solution doesn't work properly then the trash won't be sorted correctly or if there is an item that isn't in our dataset then the algorithm won't know what to do.


References:


Code:


Question:

The question that I want to answer with this data is how much the price of houses sold has increased over the years based on the particular state being looked at.


Data:

I found the following 2 of my datasets from Kaggle because I couldn't find any datasets from other data websites. In order to combine both of these data sets I will need to base it off of the address and price sold because I can try to add the second dataset into the first dataset (append).

I also found this dataset from the US Census Bureau website:


Pre-processing the data:

The steps that I followed to pre-process my data:

  • Getting rid of irrelevant data and only keeping columns such as price, address, state/region, and sold_date/scraped_at in each dataset.

  • Dropping null values to get a better understanding of valid data. The first dataset had 466763 values that were null for sold_date and 71 values that were null for price. The second dataset had only 1 null value which was for the price.

  • I standardized the data to make sure both datasets had the same column names. This made it more cohesive. This will also make it more convenient to combine both datasets later on.

  • I converted the date sold object to a datetime type so that it is easier to extract the year from the date for both datasets. Then I actually extracted the year from that column and just reassigned it back to date sold.

  • I also used scaling to reset the index numbers after removing irrelevant data.

  • Since the first dataset had the states as the full name instead of the 2 letter abbreviation, I changed that to match with the second datatset.

  • I had to remove all null rows that resulted from changing the states to 2 letter abbreviations because some states were not located in the US.

  • I got rid of the more irrelevant data such as the address because I found no use for it after funneling my data.

  • I made sure to make my jupyter notebook easy to understand and added comments.


Data Understanding/Visualization:

  • In order to understand the data I tried to visualize it in many ways. There was a lot of trial and error.

  • Firstly, I wanted to do a geographic map where you can see the different states and when you hover over the state you'll be able to see the average price of the houses that were sold for a particular year. Unfortunately, this did not work as planned because I wanted to see the data for more than just one year at a time.

  • Then after having my work reviewed by my peer, I saw that there was a better way to visualize the data and that was in the form of a line plot with column encoding color. This was a good way to show all three components of Year, State, and Price but it got really messy and was not as organized as I would have liked it to be. I was hard to separately see the different years.

  • Then I started looking at heatmaps which seemed to be very organized and clear to see all the information. This was quite challenging to create and code because it was a lot of different components and I was having a lot of errors and it took me a long time to figure out the syntax and incorporate my data into it.

  • The heatmap kept giving me errors so I had to settle for a pair plot. This did not show much of what I was looking for but it was the only one that worked with my data.

    • After visualizing the data with this pair plot, I saw that there was an outlier in the range 2001-2003. I did some trial and error to remove this outlier for a better visualization. After removing it, it was easier to see the rest of the data.


The Story:

  • I learned that my assumption about house prices have increased over the years. Although there were some up and down in the visualization (the most significant down being in 2010 which is when the housing market crashed). I believe the data to be pretty accurate in regards to the prices in different states as seen in dataframe 2 as well as seeing a correlation in years and prices in dataframe 1. Overall, I was able to answer my original question.


Impact:

Discuss the possible impact of your project. For example, how could your visualizations cause possible harm? What data or perspectives might be missing from this work?

  • The answer to the question can be useful to predict the market trends and to see if there is any correlations between intervals of time and house prices. Information such as inflation and market crashes will also need to be looked at separately in order to fully get the answer but that information will be missing from the initial data analysis that I will be doing based on the dataset.


References:


Code:

bottom of page