March 28th, 2018
Training a CNN on human NGS genome sequencing data to identify single nucleotide variants. Check out a mini report I wrote for more details and the github repo for all the gory details:
Check it out here: deepSNV
October 16th, 2017
Here I trained a 3 layer stacked LSTM on Tolstoy's epic Anna Karenina and generated text. After ~200 epochs the model was able to learn "alexey alexandrovitch", "stepan", and other characters from the story.(!)
September 16th, 2017
I typically use the Keras API for deep learning projects because it has a simple, readable syntax and it allows me to quickly try out model ideas. However, I have recently become interested in Tensorflow because it provides a framework to build out of the box, or non-standard, network structures. Additionally, I like Tensorflow because I feel it forces me to understand on a deeper level exactly what each layer and function in a neural netowork is doing to transform features. I recently built some simple feed forward, recurrent, and stacked LSTM neural networks in Tensorflow and trained them on the ubiquitous IRIS and MNIST data sets to establish frameworks that I can build on for more complicated projects in the future.
Check it out here: Tensorflow Explorations
September 16th, 2017
I recently completed DS450, Deriving Knowledge from Data as Scale, from at the University of Washington. Our final project was based on a recent Instacart Kaggle competition, the goal being to predict the grocery basket items for returning customers based on their previous orders. This was an interesting project because it was a bit open ended and there were many way to approach it. My final model was a three layer feed forward neural network. The performance of this model was greatly inproved by implementing dropout and by using the adam optimizer. In fact, adam seems to outperform standard stochastic gradient descent optimizers in all of my projects. This project was really fun, and I only wish I had more time to dig deeper into the data and the model.
August 4th, 2017
The goal of this project is to generate a simple linear regression model to predict taxi ride tip amount using a dataset containing 18 features and 77,106,102 observations. These features include typical taxi ride attributes like trip distance, pickup and dropoff time, and tolls paid. The key technical challenge of this project was the size of the data set: over 77 million rows! To handle a data set up a mini spark cluster containing four worker CPUs. I found that data frame operations like counting and aggregating for the purpose of calculating summary statistics completed in less than five minutes, demonstrating that spark and cluster computing an awesome tool for wrestling large data sets like this one.
July 4th, 2017
Here I build a decision tree to classify 6497 wines as either red or white. Using brute force feature engineering and cross validation I was able to make improvements to the accuracy.
June 6, 2017
Humans are excellent at image classification. Show us a packed grocery store produce aisle and we
can instantly pick out lettuce and bananas from hundreds of other options. Even though apples and
pomegranates are both red, spherical, and contain a stem we can easily distinguish them visually. In
fact, image classification is so simple for our brains that we take it for granted. Computers, on the
other hand, have had more difficult time with image classification.
A picture is simply a multidimensional array of numbers to a classification algorithm, and extracting features and information from millions (or billions) of pixels is a formidable challenge. Advances in machine learning and computing power are enabling computers to approach human image classification accuracies. These technologies will have a powerful impact on the medical imaging field, however here I turn to a much more important problem – dog breed classification. For this project I built logistic regression and convoluted neural network (CNN) models to classify dog breeds based on an image.
(Check out all the details of this project on my github)
April 29, 2017
Descriptions of artificial neural networks usually include terms like 'deep learning', 'hidden layer', and 'mimicking the neural connectivity of the mammalian brain'. Terms like these can make neural networks seem mysterious and slightly intimidating to the uninitiated. In an attempt to both demystify and teach myself artificial neural networks I decided to build one from scratch. Many open source libraries like tensorflow exist for building large scale, efficient neural networks, and the model presented here is certainly not going to compete with them for high performance machine learning. Rather, this program allowed me to deconstruct an artificial neural network to it's bare bones components and learn how they work.
April 23, 2017
Construction of a Naive Bayes Classifier from scratch and prediction of whether athletes belong to the Dodgers or Lakers.
April 13, 2017
Here I analyze Yelp restaurant reviews and build a SVM model to predict the number of stars for a restaurant based on it's attributes. Want a higher Yelp score - serve coffee and tea!
April 1, 2017
Here I build L1 and L2 regularized logistic regression models to classify vaccine receiving patients and evaluate the differences between these two regularization methods.
March 26, 2017
Are some neighborhoods more or less safe than others? Here I analyze 1000 recent Seattle Police Department reports and group neighborhoods by crime types using PCA and k-means clustering. Also, Tableau visualizations presented in this post allow us to pinpoint Seattle crime clusters.
March 14, 2017
Is college worth the investmet? In this analysis I use the college scorecard data set to quantify the cost, debt, and future earnings and payback of attending post secondary institutions. I also built a gradeint boosted regression model to predict what a student's future wages will be based on SAT score, the state of their post secondary institution, and additional features. This is a long analysis, and if you read the entire thing let me know and I'll buy you your drink of choice!
March 12, 2017
Some theoretical aspects of gradient boosting machine learning.
March 2, 2017
An introduction to SVD and how it can be used to decompose and compress an image.
February 26, 2017
Sentiment analysis of 2000 tweets containing either #NPR or #NRA.
February 22, 2017
Introduction to text mining of twitter streams filtered for content in R.
See what Seattle was up to on twitter on Feb 22, 2017.
February 20, 2017
Supervised machine learning model for prediction of Titanic passenger survial.