Whether or not to attend college is a question that millions of high school seniors face every year. Those choosing to attend further have to decide between two and four year schools, courses of study, and how to deal with the financial aspect of paying for years of training. The government contributed over 100 billion dollars (1) in appropriations and grants to finance higher education, including financial aid packages. Despite this, it seems that many students are leaving school with considerable amounts of student debt. Because of the massive financial impact that secondary education has on the economy the government created a College Scoreboard ‘to provide reliable and unbiased information about college performance’ (1). In the making of this College Scoreboard data was compiled from the last 20 years pertaining to student financial returns on their college investment. This quite large data set (2) contains key metrics like admissions rates, post graduation earnings of graduates, and the demographics of the students attending a school.

In this analysis we evaluate the financial implications of attending a post secondary school. Specifically, we analyze the cost and the earnings ten years after leaving a post secondary school to try to determine what factors most influence earnings after attending a post secondary institution. Finally, we developed a supervised machine learning algorithm to predict future earnings based on an institution’s features.


Study Evaluation and Hypothesis

The goal of this observational study was to increase transparency and help parents and students understand the cost of college versus the outcomes and returns for their post secondary investments (3). The data (to be discussed in greater detail below) contains features such as:

The data were obtained from direct reporting from institutions, data pulled from federal financial aid, and tax information (3) and they are reported for every year between 1996 and 2016. Before moving forward with analysis of this data set it is important to identify potential bias because this will influence what types of questions we can ask.

A major bias introduced in this study by the collectors of the data is that the data is limited to students that are Title IV recipients (3). A Title IV award includes government loans and grants (4), and these are need based awards. Thus, it is important to keep in mind that the conclusions we attempt to make using this data set are not general to all students, but rather only those that qualified for, or perhaps more importantly had access to, financial aid.

A second bias in this study is that it only pertains to schools that received financial aid. Over 80% of schools receive financial aid (5), so this bias is not as severe as the one mentioned earlier. However, moving forward we should remember that this data set is limited to students that received financial aid at a subset of available schools.


The hypothesis for the study presented here is that a student’s future earnings are maximized by attending the best school they can get into, despite the cost.

A better school has:

Here we will use the collegescorecard data set to make a data driven evalauation of this hypothesis.

Importing Data to a SQLlite Database

This publicly available data used for this study (2015-2016 year) is a relatively large data set containing 7703 observations and 1743 features, so it makes sense to organize it in a SQL database to increase query speed. To do this we will use SQlite. First the data, a csv file, was downloaded (‘Most recent data’) from the following source:

Next, the data was loaded into a SQLite database. It was found that SQLite will only accept 999 columns (features) at a time, which is a problem since there are 1743 columns in the data set, so we had to delete ones that are unlikely to be meaningful in this analysis.

# Create College Scorecard Database in RAM
college.db <- dbConnect(SQLite(), dbname = ":memory:")
dskdb <- dbConnect(SQLite(), dbname = "collegeScorecard.sqlite")
sqliteCopyDatabase(dskdb, college.db)
## [1] TRUE
# Add data tables to the DB <- read.csv('rawData/MERGED2015_16_PP.csv', header=TRUE) <-[,-c(442:1350)]
dbWriteTable(college.db, "collegeData",
## [1] TRUE
## [1] "collegeData"

The majority of the features that were excluded include the percentage of students that died subsetted into various criteria and data describing student transfers.

Exploratory Data and Analysis

Now that the data is loaded into a database table that we can query we can identify features that will be of interest moving forward:

There are three things to consider when evaluating the financial aspects of attending a college:

  1. the cost
  2. the amount of debt to be taken on and
  3. the future earnings.

Below we perform exploratory data analysis of these three aspects.

Cost of Attending College

We will begin exploratory data analysis by evaluating the cost of attending post secondary schooling. First, we look at the geographical location of the institutions included in the study.

# pull college cost and location data from the db and convert to numeric
costLoc <- dbSendQuery(college.db, "SELECT COSTT4_A, LATITUDE, LONGITUDE FROM collegeData")
costLoc <- fetch(costLoc)
title <- c('AverageCost', 'LATITUDE', 'LONGITUDE')
colnames(costLoc) <- title
costLoc$AverageCost <- as.numeric(costLoc$AverageCost)
costLoc$LATITUDE <- as.numeric(costLoc$LATITUDE)
costLoc$LONGITUDE <- as.numeric(costLoc$LONGITUDE)

# get a map of the USA from gogle
map<-get_map(location='united states', zoom=4, maptype = "terrain", source='google',color='color')

# plot the lat/long institution data over the google map
ggmap(map, extent="device") + geom_point(aes(x=LONGITUDE, y=LATITUDE), data=costLoc, alpha=.25, na.rm = T, col="dodgerblue3")