Time Series Analysis in Python with statsmodels

We introduce the new time series analysis features of scikits.statsmodels. This includes descriptive statistics, statistical tests and several linear model classes, autoregressive, AR, autoregressive moving-average, ARMA, and vector autoregressive models VAR.


Introduction
Statsmodels is a Python package that provides a complement to SciPy for statistical computations including descriptive statistics and estimation of statistical models. Beside the initial models, linear regression, robust linear models, generalized linear models and models for discrete data, the latest release of scikits.statsmodels includes some basic tools and models for time series analysis. This includes descriptive statistics, statistical tests and several linear model classes: autoregressive, AR, autoregressive movingaverage, ARMA, and vector autoregressive models VAR. In this article we would like to introduce and provide an overview of the new time series analysis features of statsmodels. In the outlook at the end we point to some extensions and new models that are under development.
Time series data comprises observations that are ordered along one dimension, that is time, which imposes specific stochastic structures on the data. Our current models assume that observations are continuous, that time is discrete and equally spaced and that we do not have missing observations. This type of data is very common in many fields, in economics and finance for example, national output, labor force, prices, stock market values, sales volumes, just to name a few.
In the following we briefly discuss some statistical properties of the estimation with time series data, and then illustrate and summarize what is currently available in statsmodels.

Ordinary Least Squares (OLS)
The simplest linear model assumes that we observe an endogenous variable y and a set of regressors or explanatory variables x, where y and x are linked through a simple linear relationship plus a noise or error term y t = x t β + ε t In the simplest case, the errors are independently and identically distributed. Unbiasedness of OLS requires that the regressors and errors be uncorrelated. If the errors are additionally normally distributed and the regressors are non-random, then the resulting OLS or maximum likelihood estimator (MLE) of β is also normally distributed in small samples. We obtain the same result, if we consider consider the distributions as conditional on x t when they are exogenous random variables. So far this is independent whether t indexes time or any other index of observations. When we have time series, there are two possible extensions that come from the intertemporal linkage of observations. In the first case, past values of the endogenous variable influence the expectation or distribution of the current endogenous variable, in the second case the errors ε t are correlated over time. If we have either one case, we can still use OLS or generalized least squares GLS to get a consistent estimate of the parameters. If we have both cases at the same time, then OLS is not consistent anymore, and we need to use a non-linear estimator. This case is essentially what ARMA does.

Linear Model with autocorrelated error (GLSAR)
This model assumes that the explanatory variables, regressors, are uncorrelated with the error term. But the error term is an autoregressive process, i.e.
Linear Model with lagged dependent variables (OLS, AR, VAR) This group of models assume that past dependent variables, y t−i , are included among the regressors, but that the error term are not serially correlated E(ε t , ε s ) = 0, for t = s y t = a 1 y t−1 + a 2 y t−1 + ... + a k y t−k + x t β + ε t Dynamic processes like autoregressive processes depend on observations in the past. This means that we have to decide what to do with the initial observations in our sample where we do nnt observe any past values.
The simplest way is to treat the first observation as fixed, and analyse our sample starting with the k-th observation. This leads to conditional least squares or conditional maximum likelihood estimation. For conditional least squares we can just use OLS to estimate, adding past endog to the exog. The vector autoregressive model (VAR) has the same basic statistical structure except that we consider now a vector of endogenous variables at each point in time, and can also be estimated with OLS conditional on the initial information. (The stochastic structure of VAR is richer, because we now also need to take into account that there can be contemporaneous correlation of the errors, i.e. correlation at the same time point but across equations, but still uncorrelated across time.) The second estimation method that is currently available in statsmodels is maximum likelihood estimation. Following the same approach, we can use the likelihood function that is conditional on the first observations. If the errors are normaly distributed, then this is essentially equivalent to least squares. However, we can easily extend conditional maximum likelihood to other models, for example GARCH, linear models with generalized autoregressive conditional heteroscedasticity, where the variance depends on the past, or models where the errors follow a non-normal distribution, for example Student-t distributed which has heavier tails and is sometimes more appropriate in finance.
The second way to treat the problem of intial conditions is to model them together with other observations, usually under the assumption that the process has started far in the past and that the initial observations are distributed according to the long run, i.e. stationary, distribution of the observations. This exact maximum likelihood estimator is implemented in statsmodels for the autoregressive process in statsmodels.tsa.AR, and for the ARMA process in statsmodels.tsa.ARMA.

Autoregressive Moving average model (ARMA)
ARMA combines an autoregressive process of the dependent variable with a error term, moving-average or MA, that includes the present and a linear combination of past error terms, an ARMA(p,q) is defined as E(ε t , ε s ) = 0, for t = s y t = µ + a 1 y t−1 + ... + a k y t−p + ε t + b 1 ε t−1 + ... + b q ε t−q As a simplified notation, this is often expressed in terms of lagpolynomials as L is the lag or shift operator, L i x t = x t−i , L 0 = 1. This is the same process that scipy.lfilter uses. Forecasting with ARMA models has become popular since the 1970's as Box-Jenkins methodology, since it often showed better forecast performance than more complex, structural models. Using OLS to estimate this process, i.e. regressing y t on past y t−i , does not provide a consistent estimator. The process can be consistently estimated using either conditional least squares, which in this case is a non-linear estimator, or conditional maximum likelihood or with exact maximum likelihood. The difference between conditional methods and exact MLE is the same as described before. statsmodels provides estimators for both methods in tsa.ARMA which will be described in more detail below.
Time series analysis is a vast field in econometrics with a large range of models that extend on the basic linear models with the assumption of normally distributed errors in many ways, and provides a range of statistical tests to identify an appropriate model specification or test the underlying assumptions.
Besides estimation of the main linear time series models, statsmodels also provides a range of descriptive statistics for time series data and associated statistical tests. We include an overview in the next section before describing AR, ARMA and VAR in more details. Additional results that facilitate the usage and interpretation of the estimated models, for example impulse response functions, are also available.

OLS, GLSAR and serial correlation
Suppose we want to model a simple linear model that links the stock of money in the economy to real GDP and consumer price index CPI, example in Greene (2003, ch. 12). We import numpy and statsmodels, load the variables from the example dataset included in statsmodels, transform the data and fit the model with OLS: import numpy as np import scikits.statsmodels.api as sm tsa = sm.tsa # as shorthand np.log(mdata['cpi'])]) exog = sm.add_constant(exog, prepend=True) res1 = sm.OLS(endog, exog).fit() print res1.summary() provides the basic overview of the regression results. We skip it here to safe on space. The Durbin-Watson statistic that is included in the summary is very low indicating that there is a strong autocorrelation in the residuals. Plotting the residuals shows a similar strong autocorrelation.
As a more formal test we can calculate the autocorrelation, the Ljung-Box Q-statistic for the test of zero autocorrelation and the associated p-values: To see how many autoregressive coefficients might be relevant, we can also look at the partial autocorrelation coefficients Similar regression diagnostics, for example for heteroscedasticity, are available in statsmodels.stats.diagnostic. Details on these functions and their options can be found in the documentation and docstrings.
The strong autocorrelation indicates that either our model is misspecified or there is strong autocorrelation in the errors. If we assume that the second is correct, then we can estimate the model with GLSAR. As an example, let us assume we consider four lags in the autoregressive error. mod2 = sm.GLSAR(endog, exog, rho=4) res2 = mod2.iterative_fit () iterative_fit alternates between estimating the autoregressive process of the error term using tsa.yule_walker, and feasible sm.GLS.
Looking at the estimation results shows two things, the parameter estimates are very different between OLS and GLS, and the autocorrelation in the residual is close to a random walk: This indicates that the short run and long run dynamics might be very different and that we should consider a richer dynamic model, and that the variables might not be stationary and that there might be unit roots.

Stationarity, Unit Roots and Cointegration
Loosely speaking, stationarity means here that the mean, variance and intertemporal correlation structure remains constant over time. Non-stationarities can either come from deterministic changes like trend or seasonal fluctuations, or the stochastic properties of the process, if for example the autoregressive process has a unit root, that is one of the roots of the lag polynomial is on the unit circle. In the first case, we can remove the deterministic component by detrending or deseasonalization. In the second case we can take first differences of the process, Differencing is a common approach in the Box-Jenkins methodology and gives rise to ARIMA, where the I stands for integrated processes, which are made stationary by differencing. This lead to a large literature in econometrics on unit-root testing that tries to distinguish deterministic trends from unit roots or stochastic trends. statsmodels provides the augmented Dickey-Fuller test. Monte Carlo studies have shown that it is often the most powerful of all unit roots test.
To illustrate the results, we just show two results. Testing the log of the stock of money with a null hypothesis of unit roots against an alternative of stationarity around a linear trend, shows an adf-statistic of -1.5 and a p-value of 0.8, so we are far away from rejecting the unit root hypothesis: tsa.adfuller(endog, regression="ct")[:2] (-1.561, 0.807) If we test the differenced series, that is the growth rate of moneystock, with a Null hypothesis of Random Walk with drift, then we can strongly reject the hypothesis that the growth rate has a unit root (p-value 0.0002) tsa.adfuller(np.diff(endog), regression="c")[:2] (-4.451, 0.00024)

ARMA processes and data
The identification for ARIMA(p,d,q) processes, especially choosing the number of lagged terms, p and q, to include, remains partially an art. One recommendation in the Box-Jenkins methodology is to look at the pattern in the autocorrelation (acf) and partial autocorrelation (pacf) functions scikits.statsmodels.tsa.arima_process contains a class that provides several properties of ARMA processes and a random process generator. As an example, statsmodels/examples/tsa/arma_plots.py can be used to plot autocorrelation and partial autocorrelation functions for different ARMA models.
This allows easy comparison of the theoretical properties of an ARMA process with their empirical counterparts. For example, define the lag coefficients for an ARMA(2,2) process, generate a random process and compare observed and theoretical pacf: We can see that they are very close in a large generated sample like this. ArmaProcess defines several additional methods that calculate properties of ARMA processes and to work with lag-polynomials: acf, acovf, ar, ar_roots, arcoefs, arma2ar, arma2ma, arpoly, from_coeffs, from_estimation, gener-ate_sample, impulse_response, invertroots, isinvertible, isstationary, ma, ma_roots, macoefs, mapoly, nobs, pacf, periodogram. The sandbox has a FFT version of some of this to look at the frequency domain properties.

Filtering
We have recently implemented several filters that are commonly used in economics and finance applications. The three most popular method are the Hodrick-Prescott, the Baxter-King filter, and the Christiano-Fitzgerald. These can all be viewed as approximations of the ideal band-pass filter; however, discussion of the ideal bandpass filter is beyond the scope of this paper. We will [briefly review the implementation details of each] give an overview of each of the methods and then present some usage examples.
The Hodrick-Prescott filter was proposed by Hodrick and Prescott [HPres], though the method itself has been in use across the sciences since at least 1876 [Stigler]. The idea is to separate a time-series y t into a trend τ t and cyclical compenent ζ t y t = τ t + ζ t The components are determined by minimizing the following quadratic loss function where τ t = y t − ζ t and λ is the weight placed on the penalty for roughness. Hodrick and Prescott suggest using λ = 1600 for quarterly data. Ravn and Uhlig [RUhlig] suggest λ = 6.25 and λ = 129600 for annual and monthly data, respectively. While there are numerous methods for solving the loss function, our implementation uses scipy.sparse.linalg.spsolve to find the solution to the generalized ridge-regression suggested in Danthine and Girardine [DGirard]. Baxter and King [BKing] propose an approximate band-pass filter that deals explicitly with the periodicity of the business cycle. By applying their band-pass filter to a time-series y t , they produce a series y * t that does not contain fluctuations at frequencies higher or lower than those of the business cycle. Specifically, in the time domain the Baxter-King filter takes the form of a symmetric moving average where a k = a −k for symmetry and ∑ K k=−K a k = 0 such that the filter has trend elimination properties. That is, series that contain quadratic deterministic trends or stochastic processes that are integrated of order 1 or 2 are rendered stationary by application of the filter. The filter weights a k are given as follows a j = B j + θ for j = 0, ±1, ±2, . . . , ±K where θ is a normalizing constant such that the weights sum to zero with the periodicity of the low and high cut-off frequencies given by P L and P H , respectively. Following Burns and Mitchell's [] pioneering work which suggests that US business cycles last from 1.5 to 8 years, Baxter and King suggest using P L = 6 and P H = 32 for quarterly data or 1.5 and 8 for annual data. The authors suggest setting the lead-lag length of the filter K to 12 for quarterly data. The transformed series will be truncated on either end by K. Naturally the choice of these parameters depends on the available sample and the frequency band of interest.
The last filter that we currently provide is that of Christiano and Fitzgerald [CFitz]. The Christiano-Fitzgerald filter is again a weighted moving average. However, their filter is asymmetric about t and operates under the (generally false) assumption that y t follows a random walk. This assumption allows their filter to approximate the ideal filter even if the exact time-series model of y t is not known. The implementation of their filter involves the calculations of the weights in T −t andB t−1 are linear functions of the B j 's, and the values for t = 1, 2, T − 1, and T are also calculated in much the same way. See the authors' paper or our code for the details. P U and P L are as described above with the same interpretation. Moving on to some examples, the below demonstrates the API and resultant filtered series for each method. We use series for unemployment and inflation to demonstrate 2. They are traditionally thought to have a negative relationship at business cycle frequencies.
We have implemented Denton's modified method. Originally proposed by Denton [Denton] and improved by Cholette [Cholette]. To take the example of turning an annual series into a quarterly one, Denton's method entails finding a benchmarked series X t that solves That is, the sum of the benchmarked series must equal the annual benchmark in each year. In the above A y is the annual benchmark for year y, I t is the high-frequency indicator series, and β is the last year for which the annual benchmark is available. If T > 4β , then extrapolation is performed at the end of the series. To take an example, given the US monthly industrial production index and quarterly GDP data, from 2009 and 2010, we can construct a benchmarked monthly GDP series >>> gdp_m = tsa.interp.dentonm(iprod_m, gdp_q, freq="qm")

Modeling multiple time series: Vector autoregressive (VAR) models
It is common in finance, economics, and other fields to model relationships among multiple time series. For example, an economist may wish to understand the impact of monetary policy on inflation and unemployment. A widely used tool for analyzing multiple time series is the vector autoregressive (VAR) model. At each time point we observe a K-vector Y t of data points, one for each time series. These can be modeled similar to an AR process as above In this case, the coefficients A i are square matrices. As with prior models, the error ε t is typically assumed to be normally distributed and uncorrelated over time. This model can be estimated by MLE or equivalently by OLS, either case as a single regression or by noticing that the model decouples into K separate linear regressions, one for each time series.
We have recently written a comprehensive implementation of VAR models on stationary data following [Lütkepohl]. In addition to estimation, we are also interested in • Analysis of impulse responses (the effect of a unit shock to one variable on all of the others) • Statistical tests: whiteness and normality of residuals, Granger-causality • Lag order selection: how many lags Y t−i to include • Multi-step forecasting and forecast error variance decomposition We will illustrate some of the VAR model features on the macrodata data set in statsmodels. Here is a code snippet to load the data and fit the model with two lags of the log-differenced data: mdata = sm.datasets.macrodata.load().data mdata = mdata [['realgdp','realcons','realinv']] names = mdata.dtype.names data = mdata.view((float,3)) data = np.diff(np.log(data), axis=0) model = VAR(data, names=names) res = model.fit (2) As with most other models in statsmodels, res.summary() provides a console output summary of the estimated coefficients and other standard model diagnostics. The data itself can be visualized by a number of plotting functions: >>> res.plot_sample_acorr() Impulse responses can be generated and plotted thusly: >>> irf = res.irf(10) # 10 periods >>> irf.plot() n-step ahead forecasts can similarly be generated and plotted: >>> res.plot_forecast (5) The forecast error variance decomposition can also be computed and plotted like so >>> res.fevd().plot()   Obviously we are just providing a flavor of some of the features available for VAR models. The statsmodels documentation has a more comprehensive treatment of the feature set. We plan to continue implementing other related models for multiple time series, such as the vector error correction models (VECM) for analyzing cointegrated (non-stationary) time series data. Other more sophisticated models in the recent time series literature could also be implemented within the same framework.
Conclusions statsmodels development over the last few years has been focused on building correct and tested implementations of the standard suite of econometric models available in other statistical computing environments, such as R. However, there is still a long road ahead before Python will be on the same level librarywise with other computing environments focused on statistics and econometrics. We believe that, given the wealth of powerful scientific computing and interactive research tools coupled with the excellent Python language, statsmodels can make Python become a premier environment for doing applied statistics and econometrics work. Future work will need to integrate all of these tools to create a smooth and intuitive user experience comparable to industry standard commercial and open source statistical products.
We have built a foundational set of tools for several ubiquitous classes of time series models which we hope will go a long way toward meeting the needs of applied statisticians and econometricians programming in Python.