How to run Facebook-Prophet predict X100 faster

Vectorizing Prophet's uncertainty modeling

Sep 5, 2021

13 min read

01/25/23 edit: FB engineers integrated the solution in this post into the package in version release 1.1.2. Due to some implementation differences it is still slightly slower than the solution in the end of this post, but not significantly.
If you’re not interested in learning about vectorization and Prophet’s inner workings, and you just want to run Prophet faster, you can skip to the tl;dr at the end.

Facebook Prophet

Facebook’s Prophet package for time series forecasting was released in 2017 and has since become one of the most, if not the most, popular forecasting algorithms: According to PyPy, Prophet was downloaded ~20 million times. The stream of downloads continues, despite some criticism of the model’s accuracy and the general trend towards neural networks for many machine learning applications. Prophet’s popularity probably stems from its simple out-of-the-box use, its confidence intervals, clear visualizations, and the fact that it often outperforms classic statistical algorithms (ARIMA, ETS).

However popular, Prophet has one huge disadvantage: it is painfully slow and does not scale to hundreds of thousands of items.

Let’s try to run the following code, creating a random time series:

import pandas as pd
import numpy as np
import datetime
from fbprophet import Prophet

n = 100
some_data = pd.DataFrame({'ds':pd.date_range(datetime.datetime(2020,1,1,), freq='W', periods=n), 'y': np.random.rand(n)})

now let’s test the running time:

%%timeit
prophet = Prophet(interval_width=0.8)
prophet.fit(some_data)

output:

# 91.5 ms ± 1.66 ms per loop

About 0.1 seconds to fit the data. But the real pain comes in the "predict" stage:

%%timeit
prophet.predict(some_data)

output:

1.15 s ± 55.9 ms per loop

It takes more than a full second to get the prediction! This is surprising, since in most ML models, training is expensive, and prediction is cheap.

In this post, we will see why Prophet’s predict function is relatively slow, and how to get it to run 100X faster (note that you will still have to fit the model first, so fit+predict will only see a 5-20X improvement). Along the way, we will learn something about modeling time series uncertainty, statistical distributions, and vectorization.

Let’s first explain Prophet’s modeling of uncertainty, and some of its code, and then show how we can easily vectorize the function.

Finding the time-consuming function

If you profile the "predict" code, you can see where the fault lies.

Prophet’s predict profile. (Image by Author)

About 98% of the time is spent on "predict_uncertainty". This function creates "yhat_upper" and "yhat_lower" in the result DataFrame, which are the edges of an 80% (or any other value given in interval_width) confidence interval – where the actual points are likely to lie.

What does Prophet do when it models uncertainty, that takes so much time?

Prophet’s uncertainty

(If you’re not familiar with Prophet’s additive model, you should probably start with a quick overview)

The Prophet modeling assumes that there are two sources of uncertainty in the data:

Gaussian residuals around the trend line
Changing slope values

When fitting, Prophet finds the best points in the training data for trend-changes to best fit the data – it is assumed that the trend is linear at any given point, but the slope of the trend can change. Naturally, the more trend-changes found, and the greater the delta between adjacent slopes, the greater the uncertainty in the future values, as can be seen in these examples.

left: When there are many change points - Uncertainty grows quickly. right: With few historical changepoints (and low delta between them) - uncertainty depends mostly on historical residuals. (Images by Author) — left: When there are many change points – Uncertainty grows quickly. right: With few historical changepoints (and low delta between them) – uncertainty depends mostly on historical residuals. (Images by Author)

If the trend changes significantly and often in the training data, the range of possible values for the future is large. In contrast, if the trend is relatively constant in the train data – we have more certainty it will continue in a straight path in the future, in which case the only remaining source of error is the residuals that we have already seen in the historic data (which does not grow with time).

We will not discuss how Prophet finds the change-points and their delta, that’s all done in the fit phase.

Predicting uncertainty in a fitted model

The following is a "paraphrase-code" (compressed, with some non-crucial parts removed for clarity) of Prophet’s predict_uncertainty (you can skip to the literal description):

Let’s summarize the process in words:

Here’s what these sampled trends look like:

Randomly sampled future trends with fbprophet. (Image by Author)

The train data had two slope changes. Five future trend lines are presented, note that the number of change-points, the time they occur, and the delta between slopes are different for each line.

Prophet adds Gaussian noise to these trend lines, giving us:

Randomly sampled futures with fbprophet, including random residuals. (Image by Author)

Finally, it finds 10% and 90% quantiles for the values of each time step, giving us our confidence interval.

Optimizing runtime

Our goal is to make this process more efficient by vectorizing it. A detailed explanation of vectorization is beyond the scope of this post but suffice to say that if we remove for-loops and perform all the operations on a NumPy array, the process will run much faster.

Before we vectorize the code, let’s note that one part of this process is redundant: The size of the confidence interval around the mean only depends on the likelihood and size of slope-changes and the past residuals; we, therefore, don’t need to concatenate the slope, intercept, and deltas of the training data and assess their values. The mean of this interval depends on the final trend line, the yhat, which depends on the training data intercept and slopes, but which is computed before predict_uncertainty starts.

We can assume slope=0 and intercept=0, use the likelihood and size of slope changes find the 10%-90% interval for all time steps (e.g. it is +/-8). We then add this value to the yhat (e.g. 17) to get the confidence interval (9-25).

The main forecast is created (dashed-line)

i. The size of the uncertainty is calculated around 0. ii. The interval is added to the main forecast. (Images by Author)

Note that as time increases the width of the confidence interval increases, (which intuitively makes sense). This is because the trend-changes could have a greater impact the more we forecast into the future.

Now let’s vectorize

Remember our goal is to create a matrix where each row is a sampled trend, and each column is some date in the future. E.g.

But with 1000+ rows.

First, we need to sample the number and location of slope-changes. This is where Prophet samples from a Poisson distribution for each row, and then randomly assigns the number of changepoints. We’re going to get the same outcome without Poisson. The likelihood of a changepoint occurring at any time step is determined by the number of changepoint observed in the training data, divided by the length of the train data. Assuming we have a trained prophet_obj and a forecast_df with the future dates to forecast, the likelihood is:

prophet_obj.changepoints_t is a list of times where a changepoint occurred in the training data. What is single_diff? It is a consequence of how Prophet models time progression.

Prophet’s time progression

In both the train and predict functions, Prophet converts the time ("ds") column in the Pandas DataFrame to an array called t which signifies time progression. For all kinds of reasons, it is convenient to have t in the range of 0-1 during training. So, in training, t is created by setting the interval between any two elements as 1/training_length. E.g., if there are 50 training data points, t is equal to [0, 0.02, 0.04, …, 0.98, 1.]. For the forecast, it will continue with the same intervals [1.02, 1.04 …] up to the length of forecast_df.

Since the t array represents the progression of time, it can be used to calculate progress over time. For example, some_coeft creates a linear trend, where some_coef represents the slope. Thus, a slope of 4 means that in each time step the series adds 4single_diff (4*0.02 in this example).

values (y-axis) with slope=4 over time (x-axis). Image by Author

Back to vectorizing slope changes

We now have the likelihood of a slope-change at any given point. Let’s create a matrix of slope-changes:

where k is the number of samples (rows). The result is a Boolean matrix:

Matrix of random boolean change-points. (Image by Author)

And this is where the magic happens. Take the sum of each row, plot the distribution, and you will get:

The rows-sum of a random boolean matrix is a Poisson Distribution. (Image by Author)

*A Poisson distribution with a mean of likelihood len(future_t_time)**! Exactly what Prophet samples from to set the n_changes for each row!

This derives straight from the definition of Poisson – if you have q independent draws where each has a likelihood of l being positive, you get a Poisson distribution with a mean of q*l.

We replaced Prophet’s for-loop sampling of the number of changepoints in a row – and then sampling their placement in the time-steps – and replaced it with a single line for an entire matrix!

But we don’t need a Boolean value of whether a change occurs – we need the delta of the change. This part is simple:

We create a new matrix of samples, this one from a Laplace distribution with a scale of the mean absolute training data deltas. Why Laplace? I don’t know if there’s a theoretical justification, but that’s what Prophet does, so we follow it (except that Prophet performs the sampling separately for each row, and we sample a matrix). Take the product of the two matrices to get:

Matrix of slope-change values. (Image by Author)

The change in slope for each sampled trend, in each time step (remember row=sampled trend, column=time step).

We are almost done! Now we need to transition from slope changes to actual forecast values.

Slope-change to actual values

Imagine we have the following time series of deltas of slope (the change in slope values between adjacent steps): [0,1,0,0,-4,0,2]. Assuming we start with an actual slope of 0, the slope at each step will be [0,1,1,1,-3,-3,-1], as for each time step we take the sum of deltas up to it, or in other words: "cumsum".

Deltas to actual values example. (Image by Author)

But this only gives us the slope at each point (instead of the slope-delta). We want the actual values! A linear slope means we add the slope value at each iteration. If we have a slope of 2 for 3 time-steps, the actual values will be [2,4,6]; it should be obvious that this is another cumsum. Hence, to transition from a matrix of slope-deltas to the actual values of the time series:

Remember that a slope of 4 means that 4*single_diff is the actual value, so we take the product of this new matrix with single_diff.

Adding Gaussian residuals

Now we have a matrix of sample trends, we only need to add the Gaussian noise:

Where sigma is the standard deviation around the trend in the training data.

We finished vectorizing the creation of a matrix of samples. If we want the confidence interval between 10%-90%, we set

Simply adding the lower and upper edges of the interval to the yhat forecast. All done! Except for a few small issues (feel free to skip the next 3 sections).

Mid-step trend change

There’s a tiny modeling difference between the vectorized code and Prophet. Prophet allows the slope-changes to occur between time-steps. E.g if t = [1.02, 1.04, 1.06…], a slope change can be set at 1.028. Or in other words: if your data represents weekly values on Mondays, Prophet allows for slope changes Saturdays or any other day.

I’m not sure this modeling always makes sense for all datasets. Also, from my experience, the difference is negligible.

However, for the sake of completion: Let’s consider a trend change occurring before the first actual step (in 1.0–1.02). If it is close to 1.0 (Previous Monday), by the time-step 1.02 we will get a full step-worth of change (the 4*single_diff). But if it is close to 1.02 (Sunday), the slope change will not make a big difference for that time step. The changes can occur anywhere on that spectrum, making the best correction the mean of the two extremes: each time step with the one before it (with 0 for the first one).

Another approach is to create a matrix with k times more columns, and a chengepoint likelihood of 1/k of the previous likelihood. This matrix will represent smaller intervals than the ones in the time series. After the cumsum.cumsum we could extract every k-th column. But again – the difference is probably insignificant.

Logistic growth

By default, Prophet assumes the growth is linear, but you can set it to logistic, in which case the creation of the slope-changes remains identical, but the transition from the matrix to the sampled trends is completely different (non-linear). Prophet’s code for that transition is monstrous, but it is trivial to convert the code to a vectorized version. I will not explain it in this post, but the vectorized code is given below for your use.

Training data

One final issue – all the above applies to forecasting into the future, where the trend values are not known and therefore need to be sampled. However, sometimes we call the predict function on the training data. The uncertainty there only depends on the Gaussian noise around the trend, so we can set the sample_trends to be zeros.

Time and accuracy comparison

How much faster is the vectorized version?

%%timeit
add_prophet_uncertainty(p, forecast_df)

output:

2.13 ms ± 139 µs per loop

Compared with 1s+ for the original, non-vectorized, version of Prophet. This is a 500x improvement.

And does it return the same results as Prophet? Pretty close.

Why are the results not identical? Remember, this is a random process, the 1000 samples created by Prophet and by the vectorized code won’t have the exact same values. If the standard deviation is large – 1000 samples are not enough for the law of large numbers to take effect. Run it again and you will get a slightly different result. In fact, since the vectorized version is so much faster, I set k samples to be 10000, which gives a more consistent result, this is why it is smoother than Prophet’s forecast in the examples above. Even with 10000 samples, it still only takes ~13ms.

tl;dr

Start by defining these functions (If you never use logistic growth – you can remove prophet_logistic_uncertainty and its uses, which is more than half of the code):

Given some training_df, if you require the uncertainty interval for both train and forecast, run the following code:

How to run Facebook-Prophet predict X100 faster

Facebook Prophet

Finding the time-consuming function

Prophet’s uncertainty

Predicting uncertainty in a fitted model

Optimizing runtime

Now let’s vectorize

Prophet’s time progression

Back to vectorizing slope changes

Slope-change to actual values

Adding Gaussian residuals

Mid-step trend change

Logistic growth

Training data

Time and accuracy comparison

tl;dr

Related Articles

What Do Large Language Models “Understand”?

How to Forecast Hierarchical Time Series

Optimizing Marketing Campaigns with Budgeted Multi-Armed Bandits

Deep Dive into LSTMs & xLSTMs by Hand ✍️

Methods for Modelling Customer Lifetime Value: The Good Stuff and the Gotchas

Done is Better Than Perfect

Latest picks: Data Quality for Everyday Analysis