In this example, we will go through a machine learning project from beginning to end, from the getting the data, to visualizing features with graphs, from fitting models to tunning parameters to evaluate performance.
This example uses the classic 1990 California Housing dataset and follows the instruction from O'Reily's "Hands-on machine Learning with Scikit-learn and Tensorflow", albeit in R code.
There are several steps that could be involved in a data science/machine learning project. Canonically, there are 5 steps belows:
1. Get the data
2. Explore and visualize data for insights
3. Cleaning data for machine learning algorithms
4. Select and train model
5. Tune the parameters (if possible) for performance-enhancement
If you are working in a professional setting, there could be equally important steps that follow such as:
6. Present your findings and solutions
7. Create, launch and maintain a scalable system
In this example, we will be using the 1990 California Housing Prices Dataset from the StatLib repository. This is obviously not the most recent data (guess how much the real estate market have boomed - and busted since then). You can easily find more recent or similary dataset (Boston Housing data being one of the more popular) from online repositories.
For starter, this dataset contains information such as population, median income, median housing price, total rooms for each geographical unit under the US Census Bureau (each unit typically has a population of 600 to 3,000 people).
link <- "http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz"
download.file(link, destfile = "./cal_housing.tgz")
untar("cal_housing.gz")
cal_housing <- read.csv("~/CaliforniaHousing/cal_housing.data")
Note that in any data project, it's most vital to frame the problem before diving into the datasets. The first question to ask is what exactly the business objective is. For the purpose of this tutorial, let's say our objective is to build a model that predicts the median house price for each geographical units.
At first, we can take a quick look at our dataset. A few handy functions that I found particular usefuls are:
dim(cal_housing)
#[1] 20640 10
summary(cal_housing$total_bedrooms)
#Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#1.0 296.0 435.0 537.9 647.0 6445.0 207
head(cal_housing)
longitude | latitude | housing_median_age | total_rooms | total_bedrooms |
---|---|---|---|---|
-122.23 | 37.88 | 41 | 880 | 129 |
-122.22 | 37.86 | 21 | 7099 | 1106 |
-122.24 | 37.85 | 52 | 1467 | 190 |
-122.25 | 37.85 | 52 | 1274 | 235 |
population | households | median_income | median_house_value | ocean_proximity |
---|---|---|---|---|
322 | 126 | 8.3252 | 452600 | NEAR BAY |
2401 | 1138 | 8.3014 | 452600 | NEAR BAY |
496 | 177 | 7.2594 | 352100 | NEAR BAY |
558 | 219 | 5.6431 | 341300 | NEAR BAY |
Snapshot of the dataset using head() function
There are 20,640 entries in our dataset. This sounds a lot but is still pretty small for Machine Learning and especially Deep Learning standard. Notably, the total bed_rooms only have 20,433 entries with 207 missing entries. Dealing effectively with missing entries will be discussed later.
All variables are numerical, except for '''ocean_proximity''', which is a factor. In R, factors are categorical variables that acts like dummy variable codes. We see that '''ocean_proximty''' is a factor with 5 levels. Namely:
levels(cal_housing$ocean_proximity)
#"1H OCEAN" "INLAND" "ISLAND" "NEAR BAY" "NEAR OCEAN"
Internally, R stores the integer values 1, 2, and 3, 4, 5and maps the character strings (in alphabetical order, unless I reorder) to these values, i.e. 1=1H OCEAN, 2=INLAND, and 3=ISLAND, etc. More specifically, R automatically converts any categorical variables stored as factor into a series of dummy variables telling us whether an entries is a specific level (1) or not (0).
This is a particularly useful feature in R that makes building model a whole lot easier, as compared to Python's Scikitlearn where you have to do the encoding by hand.
cal_housing_num <- subset(cal_housing, select = -c(ocean_proximity))
par(mfrow= c(3,3))
invisible(lapply(names(cal_housing_num), function(col_name) truehist(cal_housing_num[,col_name], main = paste("Histogram of ", col_name), xlab = NA)))
Snapshot of the dataset using head() function
You can also plot the data to have a good feel of what each attributes look like. For example, we can see than slightly over 800 districts have a median_house_value equal to about $500,000.
More importantly, I hope you have recognizes some of the more important findings such as these:
Visualizing data is an art. Drawing histogram as we have done earlier is one way to do it. Depending on your dataset, there can be many different ways to present our data in an insightful way. More often than not, a good representation of our dataset can uncover new insights and opens new channels for discovery and experimentation. In R, a very useful tools for data visualization is ggplot. There are plenty of ggplot tutorial online so I'm not going to deep in the actual syntax.
Back to our housing dataset, since we have geographical information such as latitude and longitude, it is usually a good idea to draw a scatter plot to visualize our data.
g <- ggplot(cal_housing, aes(x = longitude, y = latitude, colour = median_income))
g+geom_point()+scale_color_distiller(palette = "Spectral")
In the above plot, we have the radius of each circle represents the geographical unit's population and the color represents the house price. I hope you can see the California-like shape out of all the dots. The plot tells us that there must be a strong coorelation between housing prices and location or population density. Quite interestingly, the ocean_proximity variable may be useful but Northearn California coastal houses seem more affordable than SoCal. This means that the rules are not straightforward and it is up to the Machine Learning to detect these hidden pattern.
Another technique we could use is to compute the standard correlation coefficient between our target variable (median_house_price) and other variabels. In a nutshell, standard correlation coefficient measures the strength and relationship between 2 variables. By pitching all other variables against median_house_price, we can see which variables are strongly correlated to our target variables, thus, be useful for our prediction.
cor(subset(cal_housing, select = -c(ocean_proximity)), use = "pairwise.complete.obs")[,"median_house_value"]
Variable | Correlation with house value |
---|---|
longitude | -0.04596662 |
latitude | -0.14416028 |
housing_median_age | 0.10562341 |
total_rooms | 0.13415311 |
total_bedrooms | 0.04968618 |
population | -0.02464968 |
households | 0.06584265 |
median_income | 0.68807521 |
median_house_value | 1.00000000 |
Correlation of median_house_value against all other variables
The correlation ranges from -1 to 1. When it is close to 1, it means that there is a strong positive correlation. When it is close to -1, there is a strong negative correlation. 0 means no linear correlation.
We can see that there is a strong correlation between median house value and median income. Is it a coincidence that the houses of higher value usually belongs to those of higher income? Maybe.
Now that we have gained some insights about the dataset (the distribution of income, correlations between variables, etc.), maybe it's the time for us to start thinking about applying models to our dataset? Not yet. Remember that our dataset is still relatively messy, with missing values (total_bedrooms), categorical variables, etc.. This brings us to the next step, Data Cleaning (Data Wrangling) which usually takes up 70-80% of the time people spend on each data science project. The second part of the series will show you how to approach this step methodologically and effective. Click here for part 2 of the series.
The entire Python script for this project can be found at my Github page.