can someone help me understand how the following can be true:
1. TPU's are a serious competitor to nvidia chips.
2. Chip makers with the best chips are valued at 1-3.5T.
3. Google's market cap is 2T.
4. It is correct for google to not sell TPU's.
i have heard the whole, its better to rent them thing, but if they're actually good selling them is almost as good a business as every other part of the company.
I believe Broadcom is also very involved in the making of the TPU's and networking infrastructure and they are valued at 1.2T currently. Maybe consider the combined value of Broadcom and Google.
If they think they’ve got a competitive advantage vs. GPUs which benefits one of their core products, it would make sense to retain that competitive advantage for the long term, no?
Selling them and supporting that in the field requires quite some infrastructure you'd have to build. Why go through all that trouble if you already make higher margins renting them out?
Also, if they are so good, it's best to not level the playing field by sharing that with your competitors.
Also "chip makers with the best chips" == Nvidia, there aren't many others. And Alphabet does more than just produce TPUs.
Google is saving a ton of money by making TPUs, which will pay off in the future when AI is better monetized, but so far no one is directly making a massive profit from foundation models. It's a long term play.
Can you suggest a good reference for understanding which algorithms map well onto the regular grid systolic arrays used by TPUs? The fine article says dese matmul and convolution are good, but is there anything else? Eigendecomposition? SVD? matrix exponential? Solving Ax = b or AX = B? Cholesky?
SVD/eigendecomposition will often boil down to making many matmul (e.g. when using Krylov-based methods, e.g. Arnoldi, Krylov-schur, etc.), so I would expect TPU to work well there. GMRES, one method to solve Ax = b is also based on Arnoldi decomp.
I think https://jax-ml.github.io/scaling-book/ is one of the best references to go through. It details how single device and distributed computations map to TPU hardware features. The emphasis is on mapping the transformer computations, both forwards and backwards, so requires some familiarity with how transformer networks are structured.
LLMs are generally deterministic. The token sampling step is usually randomized to some degree because it gets better results (creativity) and helps avoid loops, but you can turn that off (temp zero for simple samplers).
They don't affect determinism of the results but different architectures have different determinism guarantees with respect to performance, as a result of scheduling and other things.
TPUs share a similar lineage to the Groq TPU accelerators (disclaimer: I work at Groq) which are actually fully deterministic which means not only do you get deterministic output, you get it in a deterministic number of cycles.
There is a trade off though, making the hardware deterministic means you give up HW level scheduling and other sources of non-determinism. This makes the architecture highly dependent on a "sufficiently smart compiler". TPUs and processors like them are generally considered VLIW and are all similarly dependent on the compiler doing all the smart scheduling decisions upfront to ensure good compute/IO overlap and eliminating pipeline bubbles etc.
GPUs on the other hand have very sophisticated scheduling systems on the chips themselves along with stuff like kernel swapping etc that make them much more flexible, less dependent on the compiler and generally easier to reach a fairly high utilisation of the processor without too much work.
TLDR:
TPUs MAY have deterministic cycle guarantees.
GPUs (of the current generation/architectures) cannot because they use non-deterministic scheduling and memory access patterns.
Both still produce deterministic output for deterministic programs.
> In essence, caches allow hardware to be flexible and adapt to a wide range of applications. This is a large reason why GPUs are very flexible hardware (note: compared to TPUs).
this is correct but mis-stated - it's not the caches themselves that cost energy but MMUs that automatically load/fetch/store to cache on "page faults". TPUs don't have MMUs and furthermore are a push architecture (as opposed to pull).
Everything thats in the blog post is basically well known already. Google publishes papers and gives talks about their TPUs. Many details are lacking though, and require some assumptions/best guesses. Jax and XLA are (partially) open source and give clues about how TPUs work under the hood as well.
This is not the only way though. TPUs are available to companies operating on GCP as an alternative to GPUs with a different price/performance point. That is another way to get hands-on experience with TPUs.
can someone help me understand how the following can be true:
1. TPU's are a serious competitor to nvidia chips.
2. Chip makers with the best chips are valued at 1-3.5T.
3. Google's market cap is 2T.
4. It is correct for google to not sell TPU's.
i have heard the whole, its better to rent them thing, but if they're actually good selling them is almost as good a business as every other part of the company.
I believe Broadcom is also very involved in the making of the TPU's and networking infrastructure and they are valued at 1.2T currently. Maybe consider the combined value of Broadcom and Google.
If they think they’ve got a competitive advantage vs. GPUs which benefits one of their core products, it would make sense to retain that competitive advantage for the long term, no?
Aren't Google's TPUs a bit like a research project with practical applications as a nice side effect?
Selling them and supporting that in the field requires quite some infrastructure you'd have to build. Why go through all that trouble if you already make higher margins renting them out?
Also, if they are so good, it's best to not level the playing field by sharing that with your competitors.
Also "chip makers with the best chips" == Nvidia, there aren't many others. And Alphabet does more than just produce TPUs.
Nvidia is selling a ton of chips on hype.
Google is saving a ton of money by making TPUs, which will pay off in the future when AI is better monetized, but so far no one is directly making a massive profit from foundation models. It's a long term play.
Also, I'd argue Nvidia is massively overvalued.
Can you suggest a good reference for understanding which algorithms map well onto the regular grid systolic arrays used by TPUs? The fine article says dese matmul and convolution are good, but is there anything else? Eigendecomposition? SVD? matrix exponential? Solving Ax = b or AX = B? Cholesky?
SVD/eigendecomposition will often boil down to making many matmul (e.g. when using Krylov-based methods, e.g. Arnoldi, Krylov-schur, etc.), so I would expect TPU to work well there. GMRES, one method to solve Ax = b is also based on Arnoldi decomp.
I think https://jax-ml.github.io/scaling-book/ is one of the best references to go through. It details how single device and distributed computations map to TPU hardware features. The emphasis is on mapping the transformer computations, both forwards and backwards, so requires some familiarity with how transformer networks are structured.
Anything that you can express as 128x128 (but ideally much larger) dense matrix multiplication and nothing else
does that cooling channel have a NEMA stepper on it as a pump or metering valve?[0]
If so, wild. That seems like overkill.
[0]: https://henryhmko.github.io/posts/tpu/images/tpu_tray.png
definitely closed-loop, might even be a servo
ELI5: how (specifically) do GPU and TPU optimisations effect determinism in LLMs? Or is this all a myth?
LLMs are generally deterministic. The token sampling step is usually randomized to some degree because it gets better results (creativity) and helps avoid loops, but you can turn that off (temp zero for simple samplers).
+ can also just pin the seed instead, right?
They don't affect determinism of the results but different architectures have different determinism guarantees with respect to performance, as a result of scheduling and other things.
TPUs share a similar lineage to the Groq TPU accelerators (disclaimer: I work at Groq) which are actually fully deterministic which means not only do you get deterministic output, you get it in a deterministic number of cycles.
There is a trade off though, making the hardware deterministic means you give up HW level scheduling and other sources of non-determinism. This makes the architecture highly dependent on a "sufficiently smart compiler". TPUs and processors like them are generally considered VLIW and are all similarly dependent on the compiler doing all the smart scheduling decisions upfront to ensure good compute/IO overlap and eliminating pipeline bubbles etc.
GPUs on the other hand have very sophisticated scheduling systems on the chips themselves along with stuff like kernel swapping etc that make them much more flexible, less dependent on the compiler and generally easier to reach a fairly high utilisation of the processor without too much work.
TLDR: TPUs MAY have deterministic cycle guarantees. GPUs (of the current generation/architectures) cannot because they use non-deterministic scheduling and memory access patterns. Both still produce deterministic output for deterministic programs.
> In essence, caches allow hardware to be flexible and adapt to a wide range of applications. This is a large reason why GPUs are very flexible hardware (note: compared to TPUs).
this is correct but mis-stated - it's not the caches themselves that cost energy but MMUs that automatically load/fetch/store to cache on "page faults". TPUs don't have MMUs and furthermore are a push architecture (as opposed to pull).
How can someone have this level of knowledge about TPUs without working at Google?
Everything thats in the blog post is basically well known already. Google publishes papers and gives talks about their TPUs. Many details are lacking though, and require some assumptions/best guesses. Jax and XLA are (partially) open source and give clues about how TPUs work under the hood as well.
https://arxiv.org/abs/2304.01433
https://jax-ml.github.io/scaling-book/
From the acknowledgment at the end, I guess the author has access to TPUs through https://sites.research.google/trc/about/
This is not the only way though. TPUs are available to companies operating on GCP as an alternative to GPUs with a different price/performance point. That is another way to get hands-on experience with TPUs.
A quick free way to access TPUs is through https://colab.research.google.com, Runtime / Change Runtime Type / v2-8 TPU
Cool article!
[flagged]