How is this technically done? How does it split the query and aggregates the results?
From the readme:
More devices mean faster performance, leveraging tensor parallelism and high-speed synchronization over Ethernet.
The maximum number of nodes is equal to the number of KV heads in the model #70.
I found this[1] article nice for an overview of the parallelism modes.
[1]: https://medium.com/@chenhao511132/parallelism-in-llm-inferen...
From the readme:
More devices mean faster performance, leveraging tensor parallelism and high-speed synchronization over Ethernet.
The maximum number of nodes is equal to the number of KV heads in the model #70.
I found this[1] article nice for an overview of the parallelism modes.
[1]: https://medium.com/@chenhao511132/parallelism-in-llm-inferen...