Projects


Github



Predicting SNVs in the human genome with convoluted neural networks.


March 28th, 2018


Training a CNN on human NGS genome sequencing data to identify single nucleotide variants. Check out a mini report I wrote for more details and the github repo for all the gory details:

Check it out here: deepSNV



LSTM Text Prediction: Anna Karenina


October 16th, 2017


Here I trained a 3 layer stacked LSTM on Tolstoy's epic Anna Karenina and generated text. After ~200 epochs the model was able to learn "alexey alexandrovitch", "stepan", and other characters from the story.(!)



Tensorflow Explorations


September 16th, 2017


I typically use the Keras API for deep learning projects because it has a simple, readable syntax and it allows me to quickly try out model ideas. However, I have recently become interested in Tensorflow because it provides a framework to build out of the box, or non-standard, network structures. Additionally, I like Tensorflow because I feel it forces me to understand on a deeper level exactly what each layer and function in a neural netowork is doing to transform features. I recently built some simple feed forward, recurrent, and stacked LSTM neural networks in Tensorflow and trained them on the ubiquitous IRIS and MNIST data sets to establish frameworks that I can build on for more complicated projects in the future.

Check it out here: Tensorflow Explorations



Instacart Basket Prediction


September 16th, 2017


I recently completed DS450, Deriving Knowledge from Data as Scale, from at the University of Washington. Our final project was based on a recent Instacart Kaggle competition, the goal being to predict the grocery basket items for returning customers based on their previous orders. This was an interesting project because it was a bit open ended and there were many way to approach it. My final model was a three layer feed forward neural network. The performance of this model was greatly inproved by implementing dropout and by using the adam optimizer. In fact, adam seems to outperform standard stochastic gradient descent optimizers in all of my projects. This project was really fun, and I only wish I had more time to dig deeper into the data and the model.



Linear Regression with PySpark on Google Cloud Platform


August 4th, 2017


The goal of this project is to generate a simple linear regression model to predict taxi ride tip amount using a dataset containing 18 features and 77,106,102 observations. These features include typical taxi ride attributes like trip distance, pickup and dropoff time, and tolls paid. The key technical challenge of this project was the size of the data set: over 77 million rows! To handle a data set up a mini spark cluster containing four worker CPUs. I found that data frame operations like counting and aggregating for the purpose of calculating summary statistics completed in less than five minutes, demonstrating that spark and cluster computing an awesome tool for wrestling large data sets like this one.



Building a Decision Tree for Wine Classification


July 4th, 2017


Here I build a decision tree to classify 6497 wines as either red or white. Using brute force feature engineering and cross validation I was able to make improvements to the accuracy.



Classifying Dog Breeds Based On an Image With Convoluted Neural Networks


June 6, 2017


Humans are excellent at image classification. Show us a packed grocery store produce aisle and we can instantly pick out lettuce and bananas from hundreds of other options. Even though apples and pomegranates are both red, spherical, and contain a stem we can easily distinguish them visually. In fact, image classification is so simple for our brains that we take it for granted. Computers, on the other hand, have had more difficult time with image classification.

A picture is simply a multidimensional array of numbers to a classification algorithm, and extracting features and information from millions (or billions) of pixels is a formidable challenge. Advances in machine learning and computing power are enabling computers to approach human image classification accuracies. These technologies will have a powerful impact on the medical imaging field, however here I turn to a much more important problem – dog breed classification. For this project I built logistic regression and convoluted neural network (CNN) models to classify dog breeds based on an image.

(Check out all the details of this project on my github)



Implementing an Artificial Neural Network From Scratch


April 29, 2017


Descriptions of artificial neural networks usually include terms like 'deep learning', 'hidden layer', and 'mimicking the neural connectivity of the mammalian brain'. Terms like these can make neural networks seem mysterious and slightly intimidating to the uninitiated. In an attempt to both demystify and teach myself artificial neural networks I decided to build one from scratch. Many open source libraries like tensorflow exist for building large scale, efficient neural networks, and the model presented here is certainly not going to compete with them for high performance machine learning. Rather, this program allowed me to deconstruct an artificial neural network to it's bare bones components and learn how they work.



Naive Bayes Classification


April 23, 2017


Construction of a Naive Bayes Classifier from scratch and prediction of whether athletes belong to the Dodgers or Lakers.



Support Vector Machines: Predicting Restaurant Yelp Scores


April 13, 2017


Here I analyze Yelp restaurant reviews and build a SVM model to predict the number of stars for a restaurant based on it's attributes. Want a higher Yelp score - serve coffee and tea!



L1 and L2 Regularized Logistic Regression


April 1, 2017


Here I build L1 and L2 regularized logistic regression models to classify vaccine receiving patients and evaluate the differences between these two regularization methods.



Visualizing and Clustering Seattle Crime Reports


March 26, 2017


Are some neighborhoods more or less safe than others? Here I analyze 1000 recent Seattle Police Department reports and group neighborhoods by crime types using PCA and k-means clustering. Also, Tableau visualizations presented in this post allow us to pinpoint Seattle crime clusters.



Gradient Boosted Regression: Predicting Future Earnings of College Students


March 14, 2017


Is college worth the investmet? In this analysis I use the college scorecard data set to quantify the cost, debt, and future earnings and payback of attending post secondary institutions. I also built a gradeint boosted regression model to predict what a student's future wages will be based on SAT score, the state of their post secondary institution, and additional features. This is a long analysis, and if you read the entire thing let me know and I'll buy you your drink of choice!



Introduction to Gradient Boosting Machine Learning


March 12, 2017


Some theoretical aspects of gradient boosting machine learning.



Singular Value Decomposition of a JPEG


March 2, 2017


An introduction to SVD and how it can be used to decompose and compress an image.



Text Mining Twitter Streams in R Part 2: #NPR vs #NRA


February 26, 2017


Sentiment analysis of 2000 tweets containing either #NPR or #NRA.



Text Mining Twitter Streams in R Part 1: Feb 22nd 2017 Seattle Tweets


February 22, 2017


Introduction to text mining of twitter streams filtered for content in R.
See what Seattle was up to on twitter on Feb 22, 2017.



Titanic Survial and Random Forests


February 20, 2017


Supervised machine learning model for prediction of Titanic passenger survial.



My first post!


December 20, 2016


Mystery photo and introduction