Linear Ensemble

Posted on August 17, 2017 by Jader Martins

Linear Ensemble of Regressors for Price Estimates

Pricing certain items before they enter the market is essential for good acceptance and consumption. Making a product available at the market below the market price does not give you good returns, but also a very high value does not appeal to buyers, regressive models in this case are of great help in making the decision about the pricing of an input. The predictive performance of composite models compared to simple models has been remarkable in many areas 1, simple models are those that use pure machine learning algorithms, whereas composite models combine the predictions of two or more algorithms in an attempt to improve the prediction. In this post I will try to present efficient ways to combine models to minimize the error of Boston’s square-meter property price predictions.

Preparing the Data

Here I will use a famous house price dataset, but the technique discussed here can be extended to pricing almost anything. First I will import and load my dataset into the “boston” variable using Pandas, the Python module famous for its data analysis focused on finance. The dataset comes from the Scikit-Learn module that we will use throughout this post to work with AM, it provides tools from data handling to a machine learning pipeline. We will also use the Numpy module.

%config InlineBackend.figure_formats = ['svg']
%matplotlib inline

from sklearn.datasets import load_boston
import numpy as np
import pandas as pd

boston = load_boston()

df = pd.DataFrame(
    np.column_stack([boston.data, boston.target]),
    columns=np.r_[boston.feature_names, ['MEDV']])
df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

Here I load my data into the df variable and show the first 5 lines with the head command.

We have information like crime of the region, average age of the population, etc .. Although it is not the focus of this post, the distribution of our data may cause our regressor to make it very difficult, so I will apply a simple feature engineering to make our distribution more normal, in future posts will be explained in detail what is feature engineering and how to use it to improve your predictions. First let’s see how the distribution we want to predict next to the “normalized” distribution by log(x + 1), (adding one to the value avoids us having problems with zeros).

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid", palette="coolwarm")
df.plot.box(figsize=(10,3), patch_artist=True);

First I load the chart libraries that I will use throughout the text, set the style and color palette for the chart, then set up a prices dataframe to receive two columns of values, one with the price without transformation, the other with the transformed price by log1p ($ log (x + 1) $) function.

prices = pd.DataFrame({"Price":df["MEDV"], "log(Price + 1)":np.log1p(df["MEDV"])})

prices.hist(color="#F1684E", bins=50, figsize=(10,3))
plt.ylabel("Amount");

We can see that our distribution has been less spaced and a little closer to a normal distribution, but Python has a statistical function that helps us evaluate whether this will be necessary or not, through the Box-Cox test that will have clues with the degree of skewness.

from scipy.stats import skew

for col in df.keys():
    sk = skew(df[col])
    if sk > 0.75:
        print(col, sk)

CRIM 5.207652387859715 ZN 2.219063057148425 CHAS 3.395799292642519 DIS 1.0087787565152246 RAD 1.0018334924536951 LSTAT 0.9037707431346133 MEDV 1.104810822864635

Um Pouco de Feature Engeneering

O teste de Box-Cox2 nos diz que um skew acima de 0.75 pode ser linearizado pela função log(x+1), fazendo a distribuição ficar mais normalizada, abaixo disso posso manter o valor como estava sem necessidades de modificação, vamos olhar o antes e depois de aplicar essa função a nossas distribuições. (Suprimi algumas variáveis para não poluir demais o gráfico).

dfnorm = (df - df.mean()) / (df.std())
for x in ["CRIM", "ZN", "CHAS","MEDV"]:
    sns.kdeplot(dfnorm[x])
plt.title("Distrution Mean Value")
plt.xlabel("Price")
plt.ylabel("Amount");

for col in df.keys():
    sk = skew(df[col])
    if sk > 0.75:
        df[col] = np.log1p(df[col])
dfnorm = (df - df.mean()) / (df.std())
for x in ["CRIM", "ZN", "CHAS","MEDV"]:
    sns.kdeplot(dfnorm[x])
plt.title("Distribution Mean Value")
plt.xlabel("Price")
plt.ylabel("Amount");

Vemos que as distribuições ficaram muito mais centradas e tendendo a distribuição gaussiana3, o que será excelente para o ajuste dos nossos estimadores4. Sendo a função logarítmica e a função f.x+1 bijetoras, poderemos retornar ao nosso valor original assim que acabarmos o ajuste do modelo.

Simplificando nossos dados

Nossos dados ainda podem estar muito complexos, a escala em que se encontram e talvez um excesso de informação necessária podem impossibilitar que nosso modelo atinja a perfeição. Aqui iremos aplicar duas técnicas, a primeira e escalonamento de variáveis pelo máximo-mínimo, transformação que também é reversível é poderá ser desfeita ao preço final, bastando eu guardar as variáveis da minha transformação.

df.std()

CRIM 1.022731 ZN 1.620831 INDUS 6.860353 CHAS 0.176055 NOX 0.115878 RM 0.702617 AGE 28.148861 DIS 0.413390 RAD 0.751839 TAX 168.537116 PTRATIO 2.164946 B 91.294864 LSTAT 0.539033 MEDV 0.386966 dtype: float64

It is visible that some variables are extremely dispersed, we can change this with the following formula:


$$ z_i=\frac{x_i-\min(x)}{\max(x)-\min(x)} $$

Thus our variables will be between zero and one, making the prediction simpler.

dfmin, dfmax = df.min(), df.max()
df = (df - df.min())/(df.max()-df.min())
df.std()

CRIM 0.227615 ZN 0.351200 INDUS 0.251479 CHAS 0.253994 NOX 0.238431 RM 0.134627 AGE 0.289896 DIS 0.227300 RAD 0.297672 TAX 0.321636 PTRATIO 0.230313 B 0.230205 LSTAT 0.202759 MEDV 0.180819 dtype: float64

Great!!

All ready

Finished our data tuning after so much work we are now going to adjust our models, get used to it, handling the data is what will consume you most time in a machine learning process. But finally let’s take a final look at how they got distributed. I will use the internal function of Pandas, boxplot, if you have any doubt what this chart represents, see here.

df.plot.box(figsize=(10,3), patch_artist=True);

As already discussed in other posts, we should separate the data into a training and testing set, to train our model and to know how well our model will predict for unknown cases. Read this article for a better understanding.

Here we use Scikit-Learn’s built-in function to separate data, for additional information on all of the function variables below I suggest consulting the official documentation. As the first argument I pass my X, attributes, and the second argument my y, the value I want to predict, finally I pass an integer to make my results reproducible by making the random processes of functions nonrandom.

from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest =\
    train_test_split(df.drop('MEDV',1).values, df['MEDV'].values, random_state=201)

Now we will import our two models, the first one being XGBoost, an algorithm that has been proving extremely efficient in competitions and the famous Ridge regression algorithm. We will evaluate our models by square mean error.

from lightgbm import LGBMRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

Here I perform a slight improvement on the hyperparameters with GridSearchCV to look for the combination of the hyperparameters that will give me a better prediction, then I adjust my model to the data and having it train, I predict data it doesn’t know, then evaluate the model performance as said.

params = {'alpha': np.linspace(0.1,1,200),
          'random_state':[2020]}

model1 = GridSearchCV(estimator = Ridge(), param_grid = params, cv=5)
model1.fit(xtrain,ytrain)
linpred = model1.predict(xtest)

err1 = mean_squared_error(linpred, ytest)
print(err1)

0.00745856883004946

params = {'reg_alpha': np.linspace(0,1,10),
          'gamma': np.linspace(0,1,1),
          'reg_lambda': np.linspace(0,1,1)}

model2 = GridSearchCV(estimator = LGBMRegressor(), param_grid = params, cv=5)
model2.fit(xtrain, ytrain)
lgbmpred = model2.predict(xtest)

err2 = mean_squared_error(lgbmpred, ytest)
print(err2)

0.005040440132637956

Very good results, but can we make them even better?! Let’s look at whether our predictions have a low correlation.

predictions = pd.DataFrame({"LightGBM":np.expm1(lgbmpred), "Ridge":np.expm1(linpred)})
predictions.plot(x = "LightGBM", y = "Ridge", kind = "scatter", color="#85C8DD");

As already explained, a low correlation tends to significantly improve our prediction, visually we have something significant, let’s look at that now in numbers.

from scipy import stats
_, _, r_value, _, std_err = stats.linregress(np.expm1(lgbmpred),np.expm1(linpred))
print(r_value, std_err)

0.9193021766109413 0.03313351573076193

Because our r-value is not too high (<.95), we can benefit from the combination of estimates. We get to the initial motivation part of combining models to increase predictive performance. I will test 3 combinations of predictions, weighted mean, simple mean, and harmonic mean.

err3 = mean_squared_error(lgbmpred * 0.8 + linpred * 0.2, ytest) # weighted mean
err4 = mean_squared_error((lgbmpred + linpred)/2, ytest) # mean
err5 = mean_squared_error(stats.hmean([lgbmpred, linpred]), ytest) # harmonic mean
print(err3, err4, err5)

0.004830881999425605 0.005166404680258313 0.004927370731820139

Excelente, ouve uma melhora significativa, mas o quão significativa?

1-err3/err2

0.041575363995579706

That’s right, a 4% improvement from our best estimator, quite significant for something so simple, and such improvements over high performance algorithms are of utmost importance in the data science world, perhaps even help us jump thousands of positions towards top in a competition worth $ 1.2 million 5.

Concluding

The main purpose of this publication was to demonstrate that a simple combination of two models can significantly impact their prediction, but during this process I did some data treatment that will impress you on the impact of reducing our error, try evaluating the models without performing some of the treatments I gave to the data … In future publications, more will be explained about each technique seen here.

References

Inverse Variance

Bootstrap_aggregating Wikipedia

Regularized Linear Models Kernel


  1. Polikar, R. (2006). “Ensemble based systems in decision making”. IEEE Circuits and Systems Magazine. 6 (3): 21–45. doi:10.1109/MCAS.2006.1688199

  2. http://www.itl.nist.gov/div898/handbook/eda/section3/eda336.htm

  3. https://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va

  4. https://stats.stackexchange.com/questions/18844/when-and-why-should-you-take-the-log-of-a-distribution-of-numbers

  5. https://www.kaggle.com/c/zillow-prize-1