In this post we are talking about * autoregressive models* and their application to a financial world. This model follows the idea that the next value of the serie is related with the p previous values.

## Definition of p-order autoregressive model

An *autoregressive model**,* or *AR**,* is a type of modelling that explains predicted variables as a linear combination of the last p observed values plus a constant and an error. We can write it as follows:

This model is both really simple and difficult to use because it requires the serie that we want to model to be stationary. This feature is difficult to find in reality because mean and variance have to be stationary throughout time. There are many upgrades of this model that will be introduced at the end of the post.

In python there are some libraries that help to us to define these kind of models.

from statsmodels.tsa.ar_model import AR from statsmodels.graphics import tsaplots

The first library (AR) is useful to simulate the AR model and the second one (tsaplots) allows us to plot different features like autocorrelation or partial autocorrelation.

### Downloading our data

We need to import more libraries and functions to develop our code:

from pandas.io.data import DataReader from datetime import datetime import pandas as pd import numpy as np import matplotlib.pyplot as plt

The first function (DataReader) is used to download data from Yahoo. We only need to know the ticker of the financial series that we want to use. In this case we are using two important indexes in the world, Standard & Poors 500 (^GSPC) and Euro Stoxx 50 (^STOXX50E). Using DataReader we can download the asset of both indexes between two dates and plot the close prices starting at 100:

## Downloading both indexes from Yahoo. ini = datetime(1998,1,1) end = datetime.today() spx = DataReader('^GSPC', 'yahoo', ini, end) stx = DataReader('^STOXX50E', 'yahoo', ini, end) ## Indexes spot. spx = spx['Close']/spx['Close'][0]*100 stx = stx['Close']/stx['Close'][0]*100 # Colours (S&P 500, Euro Stoxx 50). c = ['#e41a1c','#377eb8'] ## Spot comparison. f = plt.figure() ax = f.add_subplot(111) ax.plot(spx.index, spx, label='S&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;P 500', c=c[0], lw=1.) ax.plot(stx.index, stx, label='Euro Stoxx 50', c=c[1], lw=1.) ax.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3, ncol=2, mode='expand', borderaxespad=0.)

We can see that spot of the both indexes is not stationary and it looks like both series are related (you can see more about it in “Let’s make a deal”: from TV shows to identifying trends). One series that could be stationary is the logarithmic returns of a financial series, because it should have zero mean (approx) throughout time. To calculate this series and show differences between logarithmic and simple returns, you can read ¿Por qué usar rendimientos logarítmicos?. Before we can plot a comparison between the logarithmic returns of both indexes, function and libraries must be imported.

## Logarithmic daily returns. lr_spx = np.log(np.divide(spx[1:], spx[0:-1])) lr_stx = np.log(np.divide(stx[1:], stx[0:-1])) ## Logarithmic daily returns comparison (two subplots). # S&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;P 500 logarithmic daily returns plot. f = plt.figure() ax = f.add_subplot(211) ax.plot(lr_spx.index, lr_spx, c=c[0], lw=1.) ax.set_title('S&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;P 500 logarithmic daily return') # Y-axis ticks to percentage. vals = ax.get_yticks() ax.set_yticklabels(['{:3.0f}%'.format(x*100) for x in vals]) # Euro Stoxx 50 logarithmic daily returns plot. ax = f.add_subplot(212) ax.plot(lr_stx.index, lr_stx, c=c[1], lw=1.) ax.set_title('Euro Stoxx 50 logarithmic daily return') # Y-axis ticks to percentage. vals = ax.get_yticks() ax.set_yticklabels(['{:3.0f}%'.format(x*100) for x in vals])

There are some groups of high volatility instead of a stationary variance. It will be a problem because we’re not accomplishing the main feature of this model (stationarity).

### Autoregressive model features

Despite this, we can continue choosing another important feature of this model, the model’s order. To help us choose we can plot the partial autocorrelation that measures the degree of association between the series and itself, without the effect of a set of controlling random variables. Like Marty and Doc in *Back to the Future Part II,* we stop our time travel on 21th of October, 2015 and using tsaplots we can plot pacf (partial autocorrelation function):

# Last date in which the return is known. ini_spx = lr_spx.index.get_loc(datetime(2015,10,21)) ini_stx = lr_stx.index.get_loc(datetime(2015,10,21)) # Partial autocorrelation. tsaplots.plot_pacf(lr_spx[:ini_spx+1], lags=20, alpha=.05) tsaplots.plot_pacf(lr_stx[:ini_stx+1], lags=20, alpha=.05)

The correlation between an index and itself is really low. Our trip to the past is not very important because the following return is loosely correlated with the last returns. There are some different ways to model a series.

: Using a fixed window (between 1st of January, 1998 and 21th of October, 2015) we can configure a model and use it to predict all future returns. This way is the worst because the returns used are out of date if we predict many returns.*Fixed window*: Using the last k returns we can configure a model and use it to predict today’s return. K needs to be large (3 or 5 years, for example).*Rolling window*: Using all knowing data (all returns until today) we can configure a model and use it only to predict today’s return. This is a mix of the others.*Expansive window*

In this post we are using **expansive window**, but you can test the model using the other alternatives. There are many methods to select the optimal lag length (order) of the model:

: It’s a measure of relative quality for a statistical model given a set of data. The model with the lowest AIC is selected. Let L be the maximum value of likelihood function for the model and the number of estimated parameters in the model (k):*Akaike Information Criterion or aic*

: it is a criterion for model selection among a finite set of models. The model with the lowest BIC is selected. Given L, k and n (the number of observations):*Bayesian Information Criterion or bic*

: It is an alternative method to AIC and BIC. The model with the lowest HQC is selected. Given L, k and n:*Hannan-Quinn Information Criterion or hqc*

## Modelling with AR(p)

In this post we are using Bayesian Information Criterion (bic). You can test the other criterion and find the differences in predictions. To define an * expansive window using bic* on Standard&Poors 500 logarithmic returns we need to write a loop like:

# Modelling S&amp;amp;P 500 lr_simu_spx = lr_spx.copy() s_spx = np.zeros((2,1)) for t in range(ini_spx+1,len(spx)-1): # model = AR(lr_spx[:ini_spx+1]) # Fixed window # model = AR(lr_spx[t-780:t]) # Rolling window model = AR(lr_spx[:t]) # Expansive window result = model.fit(maxlag=20, ic='bic') lr_simu_spx[t] = model.predict(result.params)[-1] # lr_simu_spx[t] = np.dot(result.model.X[-1,:],result.params) if np.equal(np.sign(lr_simu_spx[t]),np.sign(lr_spx[t])): s_spx[1] = s_spx[1]+1 else: s_spx[0] = s_spx[0]+1

We need to do the same in Euro Stoxx 50 and plot the comparison between the real and the predict returns:

xl = 'Real LogReturns' # X-axis label yl = 'AR LogReturns' # Y-axis label m = 'o' # Marker ms = 15 # Marker size ## Scatter plot. f = plt.figure() ax = f.add_subplot(121) ax.plot(lr_spx[ini_spx+1:], lr_simu_spx[ini_spx+1:], c=c[0], lw=0, marker=m, ms=ms) xmin,xmax = ax.get_xlim() ax.set_ylim((xmin,xmax)) ax.set_xlabel(xl) ax.set_ylabel(yl) ax.set_title('S&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;P 500 LogReturns comparison', size=25) ax = f.add_subplot(122) ax.plot(lr_stx[ini_stx+1:], lr_simu_stx[ini_stx+1:], c=c[1], lw=0, marker=m, ms=ms) xmin,xmax = ax.get_xlim() ax.set_ylim((xmin,xmax)) ax.set_xlabel(xl) ax.set_ylabel(yl) ax.set_title('Euro Stoxx 50 LogReturns comparison', size=25)

It’s easy to see that real logarithmic returns are bigger than predicted logarithmic returns. Let’s see if the sign of predicted returns is like the real sign:

l = 'Non equal', 'Equal' # Labels e = (0.05, 0.05) # Explode prc = '%3.2f%%' ## Pie chart f = plt.figure() ax = f.add_subplot(121) c = ('#fcbba1','#de2d26') # S&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;P 500 colours. ax.pie(s_spx, labels=l, explode=e, autopct=prc, startangle=90, colors=c) ax.set_title('S&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;P 500 Equal LogReturns Sign') ax = f.add_subplot(122) c = ('#c6dbef','#3182bd') # Euro Stoxx 50 colours. ax.pie(s_stx, labels=l, explode=e, autopct=prc, startangle=90, colors=c) ax.set_title('Euro Stoxx 50 Equal LogReturns Sign')

**Modelling our data with autoregressive model gets closer success to a random sign than a good approach of the predicting returns**. It is not a good result, but we expected that when we saw the logarithmic returns and the low partial autocorrelation.

## Model upgrades

As I said at the beginning of the post, there are many upgrades of this model, because this it’s difficult to apply in reality. We can find models like:

.*Autoregressive-Moving Average model or ARMA*.*Autoregressive Integrated Moving Average model or ARIMA*.*Autoregressive conditional heteroskedasticity or ARCH*.*Generalized Autoregressive conditional heteroskedasticity or GARCH*

You can find out more about these models in the paper called “Introduction to (G)ARCH Models in Time Series Econometrics”