My Work

Project 2: Classification - 10/10/22

Problem:

The overproduction of garbage is a worldwide problem. Although we try to mitigate this problem with recycling and composting, it is sometimes hard to tell how exactly to sort through all the trash and find out what is recycling and what is composting. This is especially hard if we try to do this in a landfill for example. In order to solve this problem I have developed, trained, and tested a model to determine what is recycling and what is compost.

Data:

I found my data on Kaggle. This is the link:

https://www.kaggle.com/datasets/techsash/waste-classification-data

This dataset contains features such as organic and recyclable images that can be sorted with the developed model.

Pre-processing the data:

A lot of the pre-processing I had to do was regarding the importing of the images into jupyter notebook. I had a lot of trouble with reading zip files. For example, extracting the zip files so I can read them as their file names. I also had trouble with loving my base path of where I had my data. After I realized there was a way to pull the data directly from Kaggle I faced another issue where I couldn't properly read in the location of the Kaggle data. I used a lot of references from this notebook: https://www.kaggle.com/code/beyzanks/waste-classification-with-cnn in order to figure out how to do certain things to process the data.

Data Understanding/Visualization:

In order to understand the data, I made sure to add labels to each of the images. I then figured out how many images are in each category (organic and recyclable). In order to visualize this information, I used a pie chart to show the number of each of the 2 categories. In this step, I found out that there are more organic images than recyclable images. Then I applied the labels to a random group of 9 images to make sure my visualizations were working properly and that the data is being processed correctly.

Modeling:

Firstly, I used an image data generator to make the transformation of the data on a random basis. It will help with achieving the result that I am looking for. Then I trained the dataset and tested the dataset with the fit_generator. This gave me an accuracy of .83 out of my whole dataset. After that, I tried to display my results in the form of a histogram. The fit_generator has given me a pretty decent accuracy score though.

Evaluation:

My model has an accuracy rate of .83 or in other words about 80%. In order to test this model, I had my model show the predicted label and the actual label of the image to manually show/check the accuracy rate. In this process I found my model to be correct in the way it was predicting if the images were organic or recyclable. For example, 12 out of the 15 images I displayed were labeled correctly.

The Story:

Throughout this project, I learned that it is important to pick issues that you are passionate about solving because I try to be environmentally conscious when possible but when it is not possible I feel this pressure that I am hurting the planet and causing more garbage to just end up in landfills. In order to solve this issue, I have created a model that can be implemented by garbage collection companies and landfills to sort through their trash and see what can be salvaged and prevented from ending up in a landfill. If this model is implemented, it will take pressure off people in public areas who are not able to compost (with organic materials) or recycle (with recyclable materials) because they will know that this model is doing it for them.

Impact:

The good impact of finding a solution to the problem is that it will help with the trash sorting issue that we have and it will help reduce our global waste if this model was implemented into landfills or garbage collection companies. A bad impact is that if the solution doesn't work properly then the trash won't be sorted correctly or if there is an item that isn't in our dataset then the algorithm won't know what to do.

References:

https://www.kaggle.com/code/beyzanks/waste-classification-with-cnn

https://www.kaggle.com/code/alexfordna/garbage-classification-mobilenetv2-92-accuracy

https://nbviewer.org/github/collindching/Waste-Sorter/blob/master/Waste%20sorter.ipynb

Code:

https://github.com/ChernyDevireddy1/ITCS3162/blob/main/Project%202.ipynb

Project 1: Defining a Problem and Data Understanding - 09/15/22

Question:

The question that I want to answer with this data is how much the price of houses sold has increased over the years based on the particular state being looked at.

Data:

I found the following 2 of my datasets from Kaggle because I couldn't find any datasets from other data websites. In order to combine both of these data sets I will need to base it off of the address and price sold because I can try to add the second dataset into the first dataset (append).

https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset
- This dataset contains information about real estate such as status, price, address, house size, and sold date
https://www.kaggle.com/datasets/crawlfeeds/redfin-usa-real-estate-data
- This dataset has price, address, and property details like bed, bath, and square feet.

I also found this dataset from the US Census Bureau website:

https://www.census.gov/topics/housing/data/tables.2021.List_395043788.html
Annual_AvgPrice.xlsx
- This dataset contains Average Sales Price of New Manufactured Homes Sold based on each state. I will not be using this dataset until the very end after filtering and processing the other 2 sets.