# pandas - Python - Rolling window OLS Regression estimation

561 views

### pandas - Python - Rolling window OLS Regression estimation

For my evaluation, I have a dataset found in this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) as in the following format. The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate).

`````` time     X   Y
0.000543  0  10
0.000575  0  10
0.041324  1  10
0.041331  2  10
0.041336  3  10
0.04134   4  10
...
9.987735  55 239
9.987739  56 239
9.987744  57 239
9.987749  58 239
9.987938  59 239
``````

I want to run a rolling of for example 5 window `OLS regression estimation`, and I have tried it with the following script.

``````# /usr/bin/python -tt

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']],
window_type='rolling', window=5, intercept=True)
df['Y_hat'] = model.y_predict

print(df['Y_hat'])
print (model.summary)
df.plot.scatter(x='X', y='Y', s=0.1)
``````

The summary of the regression analysis is shown below.

``````   -------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <X> + <intercept>

Number of Observations:         5
Number of Degrees of Freedom:   2

R-squared:           -inf

Rmse:              0.0000

F-stat (1, 3):        nan, p-value:        nan

Degrees of Freedom: model 1, resid 3

-----------------------Summary of Estimated Coefficients------------------------
Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
X     0.0000     0.0000       1.97     0.1429     0.0000     0.0000
intercept   239.0000     0.0000 14567091934632472.00     0.0000   239.0000   239.0000
---------------------------------End of Summary---------------------------------
`````` I want to do a backward prediction of `Y` at `t+1` (i.e. predict the next value of `Y` according to the previous value i.e. `p(Y)t+1` by including the mean squared error (`MSE`) - for example, if we look at row 5, the value of `X` is 2 and the value of `Y` is 10. Let's say the prediction value (`p(Y)t+1`) is 6 and therefore the `mse` will be `(10-6)^2`. How can we do this using either `statsmodels` or `scikit-learn` for `pd.stats.ols.MovingOLS` was removed in `Pandas` version 0.20.0 and since I can't find any reference? by (71.8m points)

Here is an outline of doing rolling OLS with statsmodels and should work for your data. simply use `df=pd.read_csv('estimated_pred.csv')` instead of my randomly generated df:

``````import pandas as pd
import numpy as np
import statsmodels.api as sm

#random data
#df=pd.DataFrame(np.random.normal(size=(500,3)),columns=['time','X','Y'])
df=df.dropna() #uncomment this line to drop nans
window = 5

df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
temp=df.iloc[i-window:i,:]
df.iloc[i,df.columns.get_loc('a')]=RollOLS.params
df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params
df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params

#The following line gives you predicted values in a row, given the PRIOR row's estimated parameters
df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']
``````

I store the constant and betas, but there are a number of ways to approach predicting... you can use your fitted model object mine is `RollOLS` and the `.predict()` method, or multiply it yourself which I did in the final line (easier to do this way in this case because number of variables is fixed and known and you can do simple column math all in one go).

to do predictions with sm though as you go it would look like this:

``````predict_x=np.random.normal(size=(20,2))
``````

but keep in mind, if you ran the above code in sequence the predicted values would be using the model of the last window only. if you want to use a different model then you can save those as you go, or predict values within the for loop. Note you can also get fitted values with `RollOLS.fittedvalues`, and so if you are smoothing data pull and save `RollOLS.fittedvalues[-1]` for each iteration in the loop.

To help see how to use for your own data here is the tail of my df after the rolling regression loop is run:

``````      time         X           Y           a           b1          b2
495 0.662463    0.771971    0.643008    -0.0235751  0.037875    0.0907694
496 -0.127879   1.293141    0.404959    0.00314073  0.0441054   0.113387
497 -0.006581   -0.824247   0.226653    0.0105847   0.0439867   0.118228
498 1.870858    0.920964    0.571535    0.0123463   0.0428359   0.11598
499 0.724296    0.537296    -0.411965   0.00104044  0.055003    0.118953
``````