Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.975
Model:                            OLS   Adj. R-squared:                  0.973
Method:                 Least Squares   F-statistic:                     600.0
Date:                Tue, 28 Feb 2017   Prob (F-statistic):           7.17e-37
Time:                        21:33:42   Log-Likelihood:                -9.5061
No. Observations:                  50   AIC:                             27.01
Df Residuals:                      46   BIC:                             34.66
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0395      0.104     48.462      0.000       4.830       5.249
x1             0.4963      0.016     30.946      0.000       0.464       0.529
x2             0.5288      0.063      8.388      0.000       0.402       0.656
x3            -0.0199      0.001    -14.100      0.000      -0.023      -0.017
==============================================================================
Omnibus:                        0.820   Durbin-Watson:                   2.161
Prob(Omnibus):                  0.664   Jarque-Bera (JB):                0.796
Skew:                           0.038   Prob(JB):                        0.672
Kurtosis:                       2.387   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.54314027   5.03335422   5.48246584   5.86165443   6.15250055
   6.35001224   6.46344521   6.51478214   6.53512103   6.5595658
   6.6214589    6.74690321   6.95047339   7.23282134   7.58056907
   7.96850666   8.36373421   8.73106718   9.0388171    9.26399793
   9.39610256   9.43882912   9.4094734    9.33608718   9.25286939
   9.19454737   9.19067122   9.26075883   9.41108913   9.63367047
   9.90755367  10.20227387  10.48285582  10.71556023  10.87342706
  10.94070491  10.91543903  10.80979238  10.64804735  10.46261622
  10.28871591  10.15858195  10.09617224  10.11323087  10.20735861
  10.36240705  10.55113177  10.73966852  10.893097    10.9811782 ]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 10.97090663  10.82164236  10.55412406  10.21561251   9.8683195
   9.57417627   9.37967053   9.30446465   9.33658167   9.43533763]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.039514
x1                  0.496312
np.sin(x1)          0.528829
I((x1 - 5) ** 2)   -0.019855
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.970907
1    10.821642
2    10.554124
3    10.215613
4     9.868320
5     9.574176
6     9.379671
7     9.304465
8     9.336582
9     9.435338
dtype: float64