Sales Prediction: A Deep Learning Approach
A Kaggle competition attempt to use deep learning on sales data
Tuesday May 8, 2018
First, a GitHub link for those who prefer reading code. I used fastAI's library, which wraps PyTorch quite nicely for this problem domain.
Given that I'm between internships, I've dedicated this lull in work towards self-learning. Among others, Jeremy Howard's fast.ai deep learning lectures have been an absolute pleasure. One topic of many that captivated me was using deep learning for tabular data through embeddings. While I was aware of gradient boosting for these problems, Jeremy (Howard, not me) suggests that deep architectures can do the job just as well. Before we begin, here are some definitions.
The winning architecture of Kaggle's NYC Taxi Duration Competition
An embedding is a way of representing categorical variables numerically. Categorical variables could include non-numeric concepts like season or even low-cardinality numbers such as month. Each category is mapped to an ID, which is associated with a vector. This isn't so different than a one-hot encoding. These vectors are fed through the neural network alongside all the numerical variables. The weights in these vectors are updated as the model learns. The implication is that as the neural net trains, elements with similar traits will have close vectors in Euclidean space.
Tabular data is data that you would expect in CSV format. In particular, we're focusing on time-series data, which involves data points taken chronologically. For these problems, we typically have a goal of predicting the outcome of a future date.
Creating a Tabular Data Model: Predicting Sales
A straightforward example of learning with tabular data is sales prediction from past trends. Luckily, there are tons of Kaggle competitions on this, so I arbitrarily picked Predict Future Sales. It's also a worthy candidate because almost everyone else is using a gradient boosting or similar decision tree approach.
The goal is to predict the sales of each item that a Russian store
chain offers for the month after the test data ends. To ensure a worthy comparison between
gradient boosting and my approach, I used
as a baseline. It provides a clear benchmark: a root mean square error (RMSE) of
Data engineering isn't the emphasis for this writeup. In short, we transform files like this…
|0||PC - Гарнитуры/Наушники||0|
|1||Аксессуары - PS2||1|
|2||Аксессуары - PS3||2|
|0||! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D||0||40|
|1||!ABBYY FineReader 12 Professional Edition Full...||1||76|
|2||***В ЛУЧАХ СЛАВЫ (UNV) D||2||40|
|0||!Якутск Орджоникидзе, 56 фран||0|
|1||!Якутск ТЦ "Центральный" фран||1|
|2||Адыгея ТЦ "Мега"||2|
…Into data that looks like this (many columns omitted for brevity):
The general approach is to introduce lag features. The
target was how much was actually
sold in the given
date_block_num. And the lags correspond to the
target 1, 2, 3, and 12 months
ago for a given
(item_id, shop_id, date_block_num) index. This allows the model
to learn how current and past months affect future trends.
Now here's the fun part. Our dataset has a mixture of continuous variables—which feed cleanly into what we expect in a neural net—and categorical variables—which go through the embedding matrices. This data gets fed through 2 hidden linear layers of size 1000 and 500. Finally a sigmoid is applied on the last single-node layer. Architecturally, this is very similar to the winning taxi ride solution above.
To reduce some overfitting problems, I introduced a substantial amount of dropout and L2 regularization. Optimizing based on the root mean square log error (RMSLE) instead of the expected RMSE seemed to stabilize the loss function more easily.
Training the model for 3 epochs
I ran this for 3 epochs, and… it worked! The RMSE on the validation set was
the RMSE on the public leaderboard was
.9652, despite some troubles with overfitting.
Not only did I outperform the original kernel's
1.0428, I placed inside the top 10% of the competition using features that
generated only top 25% percent results using gradient boosting.
And I didn't spend any time engineering more features.
Should We Always Use Deep Learning?
Although I outperformed the kernel I borrowed, there are a few tree-boosting models that outperformed mine. So what was the cause? One recently available kernel did a lot more feature engineering than I did. Which I suppose is one of the takeaways. While deep learning increased my performance for the limited set of features I used, I spent multiple hours tuning hyperparameters and retraining. XGBoost gave fast feedback and was comparatively easier to configure.
In the end, the training data is just as relevant to success as the model. There were only 212400 rows of data to train on, so adding features would have propelled my model even further and perhaps eradicated the slight overfit.
Is it worth the effort? In production: why not if it really helps? In a Kaggle competition: maybe if the problem is incredibly cool and doesn't have an award of "Kudos."