Sure! There is more after the 1D conv, another transformer architecture that encodes further features of the time series. The LLM can then basically query this encoder for information, also able to capture more subtle patterns. In away it's similiar to how some vision language models work.