OpenTSLM models are exactly made to capture these subtle signals. That was one of the original motivations. The model integrates the raw time series data via cross attention, with concrete time series representations learned by a raw time series encoder.
Can you explain how? If I'm understanding the paper right, the timeseries encoding is a Conv1D and the cross-attention layer is constrained to output the token space of a pre-trained LLM. My naive expectation is these constraints would make the model less expressive / fine-tunable to pick up on these types of subtle signals.
But obviously ML is an empirical field, so if you found that a constrained architecture worked well in practice, that's an interesting result in its own right.
Sure! There is more after the 1D conv, another transformer architecture that encodes further features of the time series. The LLM can then basically query this encoder for information, also able to capture more subtle patterns. In away it's similiar to how some vision language models work.