A Pythonic exploration of diverse neural-network autoencoders to reduce the dimensionality of Bitcoin price time series
Stock market data space is highly dimensional and, as such, algorithms that try to exploit potential patterns or structure in the price formation can suffer from the so-called “curse of dimensionality”. In this short article, we will explore the potential of 4 different types of autoencoders to capture the dynamic information of stock market prices in a lower and traceable dimension space. To do so, we will use the Python programming language and, as an example, we will apply these algorithms to the compression of Bitcoin price time series. The code to build the neural network models (using the Keras library) and the full Jupyter notebook used is available at the end of the article.
The basics of an autoencoder
An autoencoder is a type of neural network in which the input and the output data are the same. As such, it is part of the so-called unsupervised learning or self-supervised learning because, unlike supervised learning, it requires no human intervention such as data labeling. The architecture of an autoencoder may vary, as we will see, but generally speaking it includes an encoder, that transforms the input into a lower dimensional representation, and a decoder, which tries to reconstruct the original input from the lower dimensional representation. Therefore, these models present some some sort of “bottle neck” in the middle that forces the network to learn how to compress the data in a lower dimensional space. When training these algorithms, the objective is to be able to reconstruct the original input with the minimum amount of information loss. Once the model is trained, we can compress data at will by only using the encoder component of the autoencoder.
The data and the objective
The data we are going to use is the Bitcoin time series consisting of 1-hour candlestick close prices of the Coindesk Bitcoin Price Index starting from 01/01/2015 until today. Specifically, we will use the first 93% of the data as a training dataset and the final 7% as test dataset. The bitcoin prices will be transformed to log returns (i.e. the log of the difference between the price x+1 and price x) and windows of 10 consecutive returns will be produced. Each of these windows of consecutive returns will be normalized with a MinMaxScaler to the range [0,1].
The objective for the different autoencoder models is to be able to compress the input which is 10-dimensional to a 3-dimensional latent space. This constitutes a reduction factor of 3.3, which should be attainable with reasonably good accuracy.
For each model tried we will show a summary of the model, the loss for the training and test datasets at each stage of the training epochs and, finally, the input and output of the autoencoder for 10 randomly selected price return windows extracted from the test dataset (i.e. the model has not seen these data points). The selected test windows will intentionally remain the same across all models to be able to compare which kind of features each model may be learning.
1st model: a simple multi-layer percepetron (MLP) autoencoder
Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) (None, 10) 0 _________________________________________________________________ dense_3 (Dense) (None, 3) 33 _________________________________________________________________ dense_4 (Dense) (None, 10) 40 ================================================================= Total params: 73 Trainable params: 73 Non-trainable params: 0
The model used is super simple but the comparison between the input and the output reveal the ability of the network to abstract few important features such as peaks and lows. Interestingly, we can see that the some of the outputs are almost identical between each other even though the inputs are reasonably different.
2nd model: deep autoencoder
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_3 (InputLayer) (None, 10) 0 _________________________________________________________________ dense_5 (Dense) (None, 6) 66 _________________________________________________________________ batch_normalization_1 (Batch (None, 6) 24 _________________________________________________________________ dense_6 (Dense) (None, 3) 21 _________________________________________________________________ dense_7 (Dense) (None, 6) 24 _________________________________________________________________ batch_normalization_2 (Batch (None, 6) 24 _________________________________________________________________ dense_8 (Dense) (None, 10) 70 ================================================================= Total params: 229 Trainable params: 205 Non-trainable params: 24
Despite the few more parameters we seem to reach a similar accuracy when looking at the train/test loss. However, the input/output examples show a different type of plots, the majority of them containing one single low or high peak unlike the previous results, which were much more variant in the middle range.
3rd model: 1D convolutional autoencoder
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_4 (InputLayer) (None, 10, 1) 0 _________________________________________________________________ conv1d_1 (Conv1D) (None, 10, 16) 64 _________________________________________________________________ max_pooling1d_1 (MaxPooling1 (None, 5, 16) 0 _________________________________________________________________ conv1d_2 (Conv1D) (None, 5, 1) 49 _________________________________________________________________ max_pooling1d_2 (MaxPooling1 (None, 3, 1) 0 _________________________________________________________________ conv1d_3 (Conv1D) (None, 3, 1) 4 _________________________________________________________________ up_sampling1d_1 (UpSampling1 (None, 6, 1) 0 _________________________________________________________________ conv1d_4 (Conv1D) (None, 5, 16) 48 _________________________________________________________________ up_sampling1d_2 (UpSampling1 (None, 10, 16) 0 _________________________________________________________________ conv1d_5 (Conv1D) (None, 10, 1) 49 ================================================================= Total params: 214 Trainable params: 214 Non-trainable params: 0
This third model gets kind of interesting. In this model we are using convolutions with kernel size of 3 and the idea is that these convolutions should look at patterns occurring in groups of 3 returns. The results were surprising to me. In most of them, we can see the main “event” very well represented while the overall reconstruction is very smooth as we had applied a moving average to the returns.
4th model: LSTM autoencoder
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_5 (InputLayer) (None, 10, 1) 0 _________________________________________________________________ lstm_1 (LSTM) (None, 3) 60 _________________________________________________________________ repeat_vector_1 (RepeatVecto (None, 10, 3) 0 _________________________________________________________________ lstm_2 (LSTM) (None, 10, 1) 20 ================================================================= Total params: 80 Trainable params: 80 Non-trainable params: 0
While recurrent neural networks such as Long-Short Term Memory (LSTM) models are particularly suitable to tackle time series, we can see that their performance as autoencoders is very poor. Goodfellow et al. explain it succinctly well in their book “Deep Learning”:
When the recurrent network is trained to perform a task that requires predicting the future from the past, the network typically learns to use h(t) [the compressed representation] as a kind of lossy summary of the task-relevant aspects of the past sequence of inputs up to t. This summary is in general necessarily lossy, since it maps an arbitrary length sequence to a fixed length vector h(t).
The most demanding situation is when we ask h(t) to be rich enough to allow one to approximately recover the sequence, as in autoencoder frameworks.
Bonus: deep autoencoder + synthetic data
The idea behind an autoencoder is to reduce an original high dimensionality to a lower dimensionality. In our case, the values of this high dimensional space are continuous between 0 and 1 thanks to the normalization scheme that we are applying. However, one can think that if we discretize this 0 to 1 range to, say, 10 bins, all the sudden we are kind of categorizing down the whole 0–1 spectrum to 10 simple categories. Now, if our windows are 10 returns-long, using this “discretized” continuous space one could easily generate the 10 to the power of 10 different existing combinations or “discretized time series”. This “synthethic” dataset could be used as training dataset to almost de facto enrich our model and teach it to understand parts of the price space that were undersampled in the Bitcoin time series. What a nice idea, isn’t it?
Or… is it? The results are bittersweet. Some of the progressions improve a lot (compare for instance the first column to previous models) but others are really bad (for instance the 6th column). This brings me to think that perhaps sampling the whole space of possibilities equiprobably is not an optimal idea. By sampling the whole space equally we are forcing the network to learn to compress the whole space equally, indiferently from whether the space is actually relevant to represent Bitcoin or stock prices. We must keep in mind that an autoencoder, like all the neural network, is a function approximator and as such they try to globally approximate all the data points we use in the training. This global optimization inherently means that in order to approximate some values better it will have to necessarily loose performance in approximating others. This suggests that in order for this idea to work, we should find smarter ways to sample only the relevant space so that the network gets at its best only on relevant compressions.
We have seen that autoencoders can be useful to compress the time series of stocks returns. If the objective was solely to compress the data, it would be interesting to try other classic dimensionality reduction algorithms such as PCA, which may well prove better at this specific task.
However, the advantage of using autoencoders is that some of its components, such as the encoder, can be separately trained on several independent stock market returns and then re-used in other end-to-end neural-networks while still keeping the potential to be globally optimized by back-propagation.
Some of the next ideas I’d like to try next are:
- Train the same autoencoders described in this article with data from very different stock market instruments (e.g. non-cryptocurrency or other cryptocurrencies) and different year time spans to see if the performance increase at exactly the same task evaluated here (Bitcoin data compression).
- Plug in a trained encoder directly into other neural networks specialized in future price return prediction or x+1 UP/DOWN price predictors, and compare its performance to a network that takes in directly the raw price returns instead of the compressed representation.
As usual, here’s the Jupyter notebook to reproduce my work:
A final call
This work is part of TheMoneyPrintingMachine, a fun project to use neural networks and state-of-the-art algorithms applied to finance. I’m looking for other developers that share similar ideals and passion to embark on this journey together. If you have any idea please email me at germarros at gmail dot com.
Autoencoders for the compression of stock market data was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.