1.1 Introduction

Zillow has realized that its housing market predictions aren’t as accurate as they could be because they do not factor in enough local intelligence. Our group has been selected to build a better predictive model of home prices for San Francisco. Because there are so many variables and elements to influence home value in San Francisco, it is such a tough task to perform prediction. Even though it may takes substantial times and energy to gather and process raw data, it will still be interesting to take this challenge.

As far as this project, we will bring geospatial analysis and machine learning techniques together to generate our model. The goal of geospatial prediction is to borrow the experience from one place and test the extent to which that experience generalizes to another place. The machine learning process will be divided into four steps. Below is a brief discussion on each step of the process.

  • Data wrangling: The first step of the process is to gather and compile the appropriate data, often from multiple disparate sources, into one consistent dataset.This means the analyst has to understand the nature of the data, how it was collected and how to massage the raw data into useful predictive features.

  • Exploratory analysis: Thoughtful exploratory analyis often leads to more useful predictive models. Additionally, more reasonable exploratory research will make the analysis more interpretable for non-technical clients.

  • Feature engineering: Feature engineering is the vital point to differ a great model from a good one. Features are the variables used to predict the outcome of interest - in this case home prices. The more the analyst can convert data into useful features, the better her model will perform. There are two key considerations for doing this well.

  • Feature selection: In a geospatial prediction project, it is possible to contain hundreds of features in the dataset. It is much wiser to select useful features in a model instead of implementing all of them. The feature selection process is the process of whittling down all the possible features into a concise and parsimonious set that optimizes for accuracy and generalizability.

  • Model estimation and validation: In this project, we will develope a a home price prediction algorithm by using linear regression models. To aviod bias and inaccuracy, we will perform plenty of methods to validate a machine learning model for accuracy and generalizability in the following report.

Generally, the home value model we built can predict approximately 85 percent of the total homes from 2012-2015. The error is about 180,000 dollars. Considering the situation of home prediction difficulty, we think this model is well-performed.

2. Data

2.1 Gathering the Data and Data Wrangling

Our first geospatial machine learning model will be trained on home price data from San Francisco - one of the largest cities in United States featuring some of the nation’s best travel destinations and a very robust technology sector. In this section, we will create a mapTheme, read the data, re-project the data to State Plane (2227), a coordinate system encoded as feet.

2.1.1 Dependent Variable: San Francisco Home Price Data

Next, we loaded one shapefile sp_original which has 10131 house sale data points from 2012 to 2015. A map of the SF house price is presented below using this data set.