Extended Long Short Term Memory

The GDG HK ML paper club discussed this paper earlier this week, and if time series analysis is something that interests you, you might want to take a look.

In short, the authors have found interesting ways to improve on the idea of Long Short Term Memory (LSTM) on the basis of Recurrent Neural Networks (RNN), which are quite a bit different from the transformer architecture. They replaced the previously linear functions at the core of the memory process with exponential functions (needing to adjust quite a bit of the math under the hood as well), but were able to represent much more nuanced memory states that are capable of representing more complex dependency patterns, which continuously update via training. This updated version is now called xLSTM for Extended Long Short Term Memory.

This updating mechanism is referred to as 'accumulated covariance' and is akin (but also very different) to attention in transformers. In xLSTM successive updates to the memory matrix - you can think of it as a continuous buildup - compile a comprehensive summary of key-value relationships, and thereby create a 'rich historical record' of these interactions capturing underlying patterns in very nuanced ways.

Some key differences to transformers, which illustrate why this is interesting:

The covariance updates operate on a single key-value pair at a time, whereas the transformer attention mechanism operates on a sequence of key-value pairs. Therefore, the memory recall in xLSTM is more 'precise'.
The covariance update rule uses additive updates to store key-value pairs, while the attention mechanism in transformers relies on weighted sums to retrieve values. Averaging mechanisms are great, but you always loose some detail around the edges, which can be antithetical to precision.
Where xLSTM uses a separate memory matrix to store key-value pairs, transformers use the input sequence itself as the memory. As such, its easy to see how the output of a transformer based model is heavily influenced by the input, whereas with xLSTM you can design specific memory recalls unimpacted by the user's query itself.

All of this lends itself very well to represent temporal data more effectively capturing the underlying trends and patterns that are common in time-series data.

https://arxiv.org/abs/2405.04517

0 comments