Zillow's Zestimate, and my ensemble of regressors for highly featured data prediction

“The Zillow Prize contest competition, sponsored by Zillow, Inc. (“Sponsor”) is open to all individuals over the age of 18 at the time of entry. The competition will contain two rounds, one public and one private.. Each round will have separate datasets, submission deadlines and instructions on how to participate. The instructions on how to participate in each round are listed below. Capitalized terms used but not defined herein have the meanings assigned to them in the Zillow Prize competition Official Rules.”

For a full description of the competition, datasets, evaluation, prizes visit https://www.kaggle.com/c/zillow-prize-1

My first competition entry, a stacked ensemble of regressors for this competition is available here: https://www.kaggle.com/jamesdhope/zillow-ensemble-of-regressors-0-065

Short summary. The stacked ensemble makes use of the SciKit-Learn RandomForestRegressor, ExtraTreesRegressor, GradientBoostRegressor and AdaBoostRegressor, as well as a Support Vector Machine. We also make use of xgboost to perform regression over the features of the first level ensemble and is used to make final predictions on a set of circa 3 million houses, each with 23 features, for 6 points in time (that’s 12 million predictions!).

Whilst there is room for improvement in preprocessing, including optimising strategies for overcoming missing data (for which there is a lot!), and determining the hyperparameters that lead to an optimal model, this machine learning model is easily adapted for making predictions on featured data in any context.

Now walking through the code in some more detail…. The stacked ensemble makes use of the SciKit-Learn RandomForestRegressor, ExtraTreesRegressor, GradientBoostRegressor and AdaBoostRegressor, as well as a Support Vector Machine. We also make use of xgboost to perform regression over the features of the first level ensemble. So we start out by importing the libraries we will need.

We also need to load in the training and test datasets that Zillow has provided us.

Next, we will OneHotEncode some of the features. For some features, it makes sense to assume that missing data means a missing feature, so we can map Nan values to 0.

There are some columns which appear to need consolidating into a single feature.

And we can also assume some more friendly feature names.

We also need to impute values for missing features. We can impute the median feature value across most features as a starting point.

# Impute zero for NaN for these features
train['fireplace_count'] = train['fireplace_count'].apply(lambda x: 0 if np.isnan(x) else x).astype(float)

# Impute median value for NaN for these features
train['bathroom_count'] = train['bathroom_count'].fillna(train['bathroom_count'].median()).astype(float)
train['bedroom_count'] = train['bedroom_count'].fillna(train['bedroom_count'].median()).astype(float)
train['room_count'] = train['room_count'].fillna(train['room_count'].median()).astype(float)
train['tax_amount'] = train['tax_amount'].fillna(train['tax_amount'].median()).astype(float)
train['land_tax'] = train['land_tax'].fillna(train['land_tax'].median()).astype(float)
train['tax_value'] = train['tax_value'].fillna(train['tax_value'].median()).astype(float)
train['structure_tax_value'] = train['structure_tax_value'].fillna(train['structure_tax_value'].median()).astype(float)
train['garage_square_feet'] = train['garage_square_feet'].fillna(train['garage_square_feet'].median()).astype(float)
train['garage_car_count'] = train['garage_car_count'].fillna(train['garage_car_count'].median()).astype(float)
train['fireplace_count'] = train['fireplace_count'].fillna(train['fireplace_count'].median()).astype(float)
train['square_feet'] = train['square_feet'].fillna(train['square_feet'].median()).astype(float)
train['year_built'] = train['year_built'].fillna(train['year_built'].median()).astype(float)
train['lot_size_square_feet'] = train['lot_size_square_feet'].fillna(train['lot_size_square_feet'].median()).astype(float)
train['longitude'] = train['longitude'].fillna(train['longitude'].median()).astype(float)
train['latitude'] = train['latitude'].fillna(train['latitude'].median()).astype(float)

Now on to Feature Selection. We will drop features where the volume of missing data exceeds a certain threshold. These features were not considered for imputation above.

We can now correlate the features using the Seaborn library Pearson’s Correlation. This is ideal for helping with feature reduction as ideally we want as fewer features as possible for regression. We might consider removing some more features here with a high correlation.

It’s also a good idea to scale the data at this point. I’ve left this out for brevity but you can refer to the full code if you are unsure how to do this.

Now a little preparation before we build our models. We’ll create an object called SklearnHelper that will extend the inbuilt methods (such as train, predict and fit) common to all the Sklearn classifiers. This cuts out redundancy as won’t need to write the same methods five times if we wanted to invoke five different classifiers.

We’ll also define a function for Cross Validation. This deserves a little explanation. The function will be passed the model, the training set and the test set (for all six time periods). It will make kf=5 folds of the training data, train the model on each fold and make predictions for each time period using this model. It will then take an average of the predicted scores across the five folds for each time period.

def get_oof(clf, x_train, y_train, x_test_201610, x_test_201611, x_test_201612, x_test_201710, x_test_201711, x_test_201712):
    oof_train = np.zeros((ntrain,))
    
    oof_test_201610 = np.zeros((ntest,))
    oof_test_201611 = np.zeros((ntest,))
    oof_test_201612 = np.zeros((ntest,))
    oof_test_201710 = np.zeros((ntest,))    
    oof_test_201711 = np.zeros((ntest,))
    oof_test_201712 = np.zeros((ntest,))
    
    oof_test_skf_201610 = np.empty((NFOLDS, ntest))
    oof_test_skf_201611 = np.empty((NFOLDS, ntest))
    oof_test_skf_201612 = np.empty((NFOLDS, ntest))
    oof_test_skf_201710 = np.empty((NFOLDS, ntest))
    oof_test_skf_201711 = np.empty((NFOLDS, ntest))
    oof_test_skf_201712 = np.empty((NFOLDS, ntest))
    
    #train_index: indicies of training set
    #test_index: indicies of testing set
     
    for i, (train_index, test_index) in enumerate(kf):
        #break the dataset down into two sets, train and test
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]
        
        clf.train(x_tr, y_tr)
        
        #make a predition on the test data subset
        oof_train[test_index] = clf.predict(x_te)
        
        #use the model trained on the first fold to make a prediction on the entire test data 
        oof_test_skf_201610[i, :] = clf.predict(x_test_201610)
        oof_test_skf_201611[i, :] = clf.predict(x_test_201611)
        oof_test_skf_201612[i, :] = clf.predict(x_test_201612)
        oof_test_skf_201710[i, :] = clf.predict(x_test_201710)
        oof_test_skf_201711[i, :] = clf.predict(x_test_201711)
        oof_test_skf_201712[i, :] = clf.predict(x_test_201712)
    
    #take an average of all of the folds
    oof_test_201610[:] = oof_test_skf_201610.mean(axis=0)
    oof_test_201611[:] = oof_test_skf_201611.mean(axis=0)
    oof_test_201612[:] = oof_test_skf_201612.mean(axis=0)
    oof_test_201710[:] = oof_test_skf_201710.mean(axis=0)
    oof_test_201711[:] = oof_test_skf_201711.mean(axis=0)
    oof_test_201712[:] = oof_test_skf_201712.mean(axis=0)
    
    return oof_train.reshape(-1, 1), oof_test_201610.reshape(-1, 1), oof_test_201611.reshape(-1, 1), oof_test_201612.reshape(-1, 1), oof_test_201710.reshape(-1, 1), oof_test_201711.reshape(-1, 1), oof_test_201712.reshape(-1, 1)

Next we’ll create a Dict data type to hold all of our model parameters.

We’ll now create our models.

And now train the models…

Finally, with the models trained, we have now reached the end of the first layer of our ensemble. We can now extract the features for further analysis.

The 23 features we obtain for each model are as follows:

Let’s have a look at how important these features are for each model.

Now, across the four models, the mean feature importances.

We can now build a new dataframe to hold these features, and train a regression model using xgboost on these features as our second layer.

Now to make our final predictions…

James
James
Architect | AI / ML Engineer | BSI ART1 Artificial Intelligence Committee Member / Expert | Follow me for updates on building trustworthy AI

My research interests include Artificial Intelligence, Semantic Models and Distributed Systems.