Forecasting demand 101
Link to source code: Colab Notebook
1. Forecasting Tutorial Context: Python and Prophet
Forecasting the future has always been an intriguing aspect of data analysis, whether it's predicting stock market trends, weather patterns, or sales figures. Time series forecasting is a powerful tool in the arsenal of data scientists and analysts.
One of the leading methods for making time-bound predictions is autoregressive modeling, which leverages past data points to project future values. This method is particularly useful for time series data that shows some form of linear dependency over time.
[autoregressive model, simple diagram]
In this forecasting tutorial, we'll dive back to the basics and explore how to implement autoregressive modeling using Python and Facebook's Prophet library, a robust tool designed to handle the complexity of time series data. While Prophet has impressive capabilities for incorporating external regressors, such as additional data sources that may impact the forecast, we will be focusing solely on its autoregressive capabilities in this post.
What is Autoregressive Modeling?
Autoregressive modeling is a statistical approach that uses observations from previous time steps as input to a regression equation to predict the value at the next time step.
2. Setting the environment
Now let's get in with the practice. Before diving into the code, ensure that you have Python installed on your machine along with the necessary packages.
You can install Prophet using pip:
pip install prophet
Or, if you're using Anaconda, you can use conda to install it:
conda install -c conda-forge prophet
Aside from prophet we will use pandas for data handling and seaborn for its premade sample datasets. Specifically, we will leverage the "flights" dataset from Seaborn which contains the number of monthly airline passengers from 1949 to 1960.
import pandas as pd
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
import seaborn as sns
3. Loading the data and formatting it
First we will load the aforementioned data from seaborn
# Load sample data
flights_data = sns.load_dataset("flights")
Prophet requires data to be in a specific format for it to work properly, with at least two columns of data as input - a column of datetime values for the timestamps, which should be labeled 'ds', and a column of numeric values for the time series, which should be labeled 'y'.
Provided the flights dataset comes with 3 columns (year, month, and passengers) we will be formatting it so that year and month become one single date labeled as "ds" and passengers is labeled as "y". That's what we will do next:
# Create a new column 'ds' by combining 'year' and 'month'
flights_data['ds'] = pd.to_datetime(flights_data['year'].astype(str) + ' ' + flights_data['month'].astype(str), format='%Y %b')
# Drop the 'year' and 'month' columns
flights_data = flights_data.drop(['year', 'month'], axis=1)
# Rename 'passengers' column to 'y'
flights_data = flights_data.rename(columns=={'passengers': 'y'})
The resulting dataframe should contain 2 columns and look as follows. Note that the order of the columns does not matter for Prophet.
[resulting dataframe]
4. Creating and training the model
With the data in the right format, we can now proceed to create and train the model. First, we fit a model by instantiating a new Prophet object and then call its fit method with a DataFrame containing the historical data.
# Instantiate a Prophet object
model = Prophet()
# Fit the model with our flights_data dataframe
model.fit(flights_data)
After running this code, we should see how the training has begun, with a corresponding timestamp, and a similar line with the final timestamp once it's done. In the absence of error messages, our model training has been successful! In our case, we can see in the image below how it took less than a second to train.
[logs for prophet model training]
5. Predicting future values
We now need to generate a new dataframe containing dates for the forecast period. Luckily, Prophet comes equipped with the `make_future_dataframe` function, which simplifies this process.
This function requires two parameters:
- The number of periods we want to create.
- A `freq` parameter which denotes the frequency for predictions.
When specifying the frequency, choose the one that matches the frequency of your time data and the frequency at which you want to make future predictions. For example, if you have daily data and want predictions for the next month on a daily basis, you would use `freq='D'`. This function uses the standard string aliases used by Pandas, where you can specify multiples of these frequencies like '2D' for every two days. Common values include:
- D: Daily frequency
- W: Weekly frequency (by default, weeks start on Sunday; otherwise W-MON...)
- M: Month end frequency
- MS: Month start frequency
- Q: Quarter end frequency
- QS: Quarter start frequency
- A or Y: Year end frequency
- AS or YS: Year start frequency
- H: Hourly frequency
- T or min: Minute frequency
- S: Second frequency
- B: Business day frequency
- BM: Business month end frequency
- BMS: Business month start frequency
- BH: Business hour frequency
In our case, we will forecast the next 3 years at a monthly level, covering a total of 36 periods. With the future DataFrame created, we can proceed to generate the forecast using the predict method.
# Create a dataframe with future dates for forecasting the next 3 years
future = model.make_future_dataframe(periods=38, freq='T')
# Predict the future values
forecast = model.predict(future)
6. Extracting the forecast
If we inspect the forecast DataFrame generated by the results from the predict method, we will find a densely populated collection of data presented in the following format:
[prophet model output]
Among all this data, the two main columns of interest for now will be "ds," which contains the time steps, and "yhat," which contains the predicted values.
If we want to isolate these values to work on them independently, we can copy them to a new dataframe.
forecast_values = forecast[['ds', 'yhat']].copy()
[clean prophet model output]
7. Analyzing the forecast
In case we desire to obtain more insights from the forecast, we can delve deeper into the full output of Prophet. The columns we can find follow the following meanings:
- ds: This column represents the date stamp for each prediction or observation.
- trend: This column shows the underlying trend of the time series data, which is the model's estimation of the direction in which the data is moving, without accounting for seasonal fluctuations or holidays.
- yhat_lower: This is the lower bound of the forecast's confidence interval for the predicted value (yhat). It represents the lower end of the range within which the actual observed value is expected to fall.
- yhat_upper: Similar to yhat_lower, this is the upper bound of the forecast's confidence interval for the predicted value.
- trend_lower: This is the lower bound of the confidence interval for the trend component of the forecast.
- trend_upper: This is the upper bound of the confidence interval for the trend component.
- additive_terms: These are components of the forecast that are added to the trend to get the final prediction. They can include weekly, yearly, and holiday effects. They represent how much these components will increase or decrease the forecast.
- additive_terms_lower: The lower bound of the confidence interval for the additive terms.
- additive_terms_upper: The upper bound of the confidence interval for the additive terms.
- yearly: This is the yearly seasonality component of the model. It captures annual patterns in the data, such as increased sales during the holiday season.
- yearly_lower: The lower bound of the confidence interval for the yearly seasonal component.
- yearly_upper: The upper bound of the confidence interval for the yearly seasonal component.
- multiplicative_terms: If the model is specified to have multiplicative seasonality, this represents the percentage change due to seasonality and holidays. If the model is additive, this will be zero.
- multiplicative_terms_lower: The lower bound of the confidence interval for the multiplicative terms.
- multiplicative_terms_upper: The upper bound of the confidence interval for the multiplicative terms.
- yhat: This is the final forecasted value, which is the sum of the trend, additive terms, and multiplicative terms (if any).
Given this data we can use the previously imported functions for plotting. On the one hand plot_plotly will provide a visual overview of the historical values, and the forecasted ones.
# Visualize the forecast
fig_forecast = plot_plotly(model, forecast)
fig_forecast.show()
[prophet forecast visualization]
On the other hand, we can use plot_components_plotly to visualize the underlying trend of the data and the seasonal patterns.
# Visualize forecast components
fig_components = plot_components_plotly(model, forecast)
fig_components.show()
[prophet forecast components visualization]
8. Conclusions
Prophet offers a flexible and robust framework for time series forecasting, streamlining many of the laborious tasks associated with fitting an autoregressive model. While it performs well on data exhibiting clear seasonal patterns and trends, Prophet may encounter challenges with more intricate time series that feature multiple seasonalities or irregular spikes and dips.
In this post, we have delved into the fundamentals of using Prophet in Python. However, aspects like model accuracy and the incorporation of external regressors play a crucial role in forecasting, and these will be explored in future posts.
As data continues to proliferate across industries, more advanced techniques may prove more adept at handling temporal dynamics. Thanks to tools like Brokyl, all of this complexity can be simplified into a no-code solution, compressing the entire end-to-end forecasting process.