Data, Research, & Strategy

Portfolio

Project Gallery

IMG_3741.jpg

reducing individual-level food waste

Food waste is an under-appreciated problem, but a massive problem at the household level. Nearly 21 percent of the food supply that reaches market is discarded by individuals due to cooking loss, moisture loss, mold or rot, and plate waste. It is certainly a problem that I encounter on a weekly basis: when I plan weekly menus, I often have leftover raw ingredients that went unused by recipes.

I wanted to apply my data science tools to develop a system that would recommend other recipes that share the same ingredients — so instead of watching those ingredients rot in the back of the fridge, I could make a second meal with those leftovers.I built a recommendation engine to find similar recipes that share common ingredients to reuse left-over ingredients.

Concepts: Natural Language Processing; Recommender System; Agglomerative Clustering; Unsupervised Learning

 

Lightning talk: topic modeling & Latent Dirichlet Allocation

One of the projects I had to do for General Assembly’s Data Science Immersive program was to give a short presentation on Topic Modeling, specifically Latent Dirichlet Allocation (LDA). It was a review exercise: explain a technical concept to a non-technical audience. It strengthened my own understanding and I took it as an opportunity to illustrate how you might use LDA in the real world: examining my inbox.

I imagined a set of documents including "The Hungry, Hungry Caterpillar," "Dune," "The Hobbit," and "A Game of Thrones" to illustrate the insights that topic modeling would offer into understanding that corpus.

Concepts: Natural Language Processing; Email; Latent Dirichlet Allocation; Topic Modeling; Lightning Talk; Unsupervised Learning

 

modeling community engagement with Reddit posts

I built a webscraper to access information about the popular posts on Reddit and the subreddits from which those posts were drawn. I gathered data and metadata about the posts, including title, number of comments, data score (net upvote-downvote), author, subreddit source, post time, etc. Using the data I gathered from Reddit over a week's span, my goal was to identify the characteristics of Reddit posts that had above expected engagement (number of comments) from the readers.

During my initial analysis, I found that that post engagement was very flat: a post in the 75th percentile of comment quantity on r/all had received 61 comments. However, comments exploded above the 75th percentile. I tried to predict which posts would have above the 75th percentile of comments. These models started around 71% accuracy (just subreddits), to my complex model which nearly broke 80% (78.9%).

Concepts: Random Forest Classifiers; Term Frequency-Inverse Document Frequency Vectorization; Natural Language Processing; Logistic Regression; K-Nearest Neighbors; Webscraping; Regex; XPath; Crontab; Amazon Web Services; Cross-Validation; Exploratory Data Analysis; Bagging; Unbalanced Classes; Feature Engineering