In this post, we will go through how to generate autoregressive data in Python, which is useful for debugging models for sequential prediction like recurrent neural networks.
When you’re building a machine learning model, it’s often helpful to check that it works on simple problems before moving on to complicated ones. I’ve found this is especially useful for debugging neural networks.
One example of a simple problem is fitting autoregressive data. That is, data of the form
The above example is called an process. Basically, each datapoint depends only on the previous datapoints and the distribution of noise .
The process, then, is defined by
1. The coefficients ,
2. The initial values of x, , and
3. The distribution of the noise .
Generating realistic-looking time series data: Stable and unstable poles
The catch is that for your data to be reasonable or stable, these parameters have to satisfy certain conditions. The most important condition is that the AR process has to be stable. That is, the poles (the roots of the equation ) all have to have magnitude less than one. If not, something like this might happen:
We won’t go into detail as to why the poles have to have magnitude less than one here, but you can think of it like this: if you repeatedly multiply numbers with magnitude greater than one together, the magnitude of your end result will keep increasing. (If you want to learn more about poles, you can check out Brian Douglas’ videos on control theory.)
But if you have no noise and your poles have magnitude less than one, the data will converge to zero. Similar to the previous case, if you repeatedly multiply numbers with magnitude less than one together, your end result would go to zero.
This doesn’t look like typical time-series data (e.g. stock prices) at all. Fortunately, adding noise solves this problem. Think of it as increasing the magnitude of your product each time before multiplying it by a new number less than zero:
A tip to make your tests fairer
Here’s a tip – it really helps to generate 3n more datapoints and cut the first 3n datapoints. This is because the first datapoints are likely to be distributed very differently from the others since they were initialised randomly. These effects will continue till at least the $2n$th datapoint. So if you want to assess your model fairly and prevent it from trying to disproportionately fit to the first datapoints, take out the first 3n datapoints.
Code to generate AR(n) data
Now, all we have to do is implement this in code. Fortunately for you, I’ve already done it! Here are the key portions of the code. The full code is on GitHub.
"""Class to generate autoregressive data."""
def __init__(self, *args, coeffs=None, **kwargs):
self.given_coeffs = coeffs
super(ARData, self).__init__(*args, **kwargs)
if coeffs is not None:
self.num_prev = len(coeffs) - 1
# + 3*self.num_prev because we want to cut first (3*self.num_prev) datapoints later
# so dist is more stationary (else initial num_prev datapoints will stand out as diff dist)
for i in range(self.num_datapoints+3*self.num_prev):
# Generate y value if there was no noise
# (equivalent to Bayes predictions: predictions from oracle that knows true parameters (coefficients))
self.bayes_preds[i + self.num_prev] = np.dot(self.y[i:self.num_prev+i][::-1], self.coeffs)
# Add noise
self.y[i + self.num_prev] = self.bayes_preds[i + self.num_prev] + self.noise()
# Cut first 20 points so dist is roughly stationary
self.bayes_preds = self.bayes_preds[3*self.num_prev:]
self.y = self.y[3*self.num_prev:]
if self.given_coeffs is not None:
self.coeffs = self.given_coeffs
filter_stable = False
# Keep generating coefficients until we come across a set of coefficients
# that correspond to stable poles
while not filter_stable:
true_theta = np.random.random(self.num_prev) - 0.5
coefficients = np.append(1, -true_theta)
# check if magnitude of all poles is less than one
if np.max(np.abs(np.roots(coefficients))) < 1:
filter_stable = True
self.coeffs = true_theta
# Initial datapoints distributed as N(0,1)
self.y[:self.num_prev] = np.random.randn(self.num_prev)
# Noise distributed as N(0, self.noise_var)
return self.noise_var * np.random.randn()
# Generate AR(5) process data
stable_ar = ARData(num_datapoints=50, num_prev=5, noise_var=1)
Using this to debug models
How does this help debug models? Firstly, the data is simple, so it doesn’t take long to train, and if the model can’t learn AR(5), it likely won’t be able to learn more complicated patterns. I’ve found this particularly useful for debugging recurrent neural networks.
Secondly, since you know the distribution of the data, you can compare model performance to the Bayes error. The Bayes error is the expected error that an oracle which knew the true distribution would make. In this case, it would be the error from predicting using the AR coefficients and noise . This can serve as a baseline to compare your model performance to.
Finally, you might be able to interpret the model parameters. For example, if you have a linear model, you can check how the model parameters compare with the AR coefficients.
I hope this has helped – all the best with your machine learning endeavours!