Having clocks synchronized between your servers is extremely useful. For example, having a guarantee that the timestamp of arrival of a packet (measured by the clock on the destination) is ALWAYS bigger than the timestamp recorded by the sender is a huge win, especially for things like database scaling.
For this though you need to go beyond NTP into PTP which is still usually based on GPS time and atomic clocks
Actually interesting to think about what UTC actually means and there is seems to be no absolute source of truth [0]. I guess the worry is not that much about the NTP servers (for which people anyways should configure fail overs) but the clocks themselves.
Could you define an absolute source of truth based on extrinsic features. Something like taking an intrinsic time from atomic sources, pegged to an astronomic or celestial event; then a predicted astronomic event that would allow us to reconcile time in the future.
It might be difficult to generate enough resolution in measurable events that we can predict accurately enough? Like, I'm guessing the start of a transit or alignment event? Maybe something like predicting the time at which a laser pulse will be returnable from a lunar reflector -- if we can do the prediction accurately enough then we can re-establish time back to the current fixed scale.
I think I'm addressing an event that won't ever happen (all precise and accurate time sources are lost/perturbed), and if it does it won't be important to re-sync in this way. But you know...
Spanner depends on having a time source with bounded error to maintain consistency. Google accomplishes this by having GPS and atomic clocks in several datacenters.
And more importantly, the tighter the time bound, the higher the performance, so more accurate clocks easily pay for themselves in other saved infrastructure costs to service the same number of users.
There's a lot of focus in this thread on the atomic clocks but in most datacenters, they're not actually that important and I'm dubious that the hyperscalers actually maintain a "fleet" of them, in the sense that there are hundreds or thousands of these clocks in their datacenters.
The ultimate goal is usually to have a bunch of computers all around the world run synchronised to one clock, within some very small error bound. This enables fancy things like [0].
Usually, this is achieved by having some master clock(s) for each datacenter, which distribute time to other servers using something like NTP or PTP. These clocks, like any other clock, need two things to be useful: an oscillator, to provide ticks, and something by which to set the clock.
In standard off-the-shelf hardware, like the Intel E810 network card, you'll have an OXCO, like [1], with a GPS module. The OXCO provides the ticks, the GPS module provides a timestamp to set the clock with and a pulse for when to set it.
As long as you have GPS reception, even this hardware is extremely accurate. The GPS module provides a new timestamp, potentially accurate to within single-digit nanoseconds ([2] datasheet), every second. These timestamps can be used to adjust the oscillator and/or how its ticks are interpreted, such that you maintain accuracy between the timestamps from GPS.
The problem comes when you lose GPS. Once this happens, you become dependent on the accuracy of the oscillator. An OXCO like [1] can hold to within 1µs accuracy over 4 hours without any corrections but if you need better than that (either more time below 1µs, or more accurate than 1µs over the same time), you need a better oscillator.
The best oscillators are atomic oscillators. [2] for example can maintain better than 200ns accuracy over 24h.
So for a datacenter application, I think the main reason for an atomic clock is simply for retaining extreme accuracy in the event of an outage. For quite reasonable accuracy, a more affordable OXCO works perfectly well.
I don't know about all hyperscalers, but I have knowledge of one of them that has a large enough fleet of atomic frequency standards to warrant dedicated engineering. Several dozen frequency standards at least, possibly low hundreds. Definitely not one per machine, but also not just one per datacenter.
As you say, the goal is to keep the system clocks on the server fleet tightly aligned, to enable things like TrueTime. But also to have sufficient redundancy and long enough holdover in the absence of GNSS (usually due to hardware or firmware failure on the GNSS receivers) that the likelihood of violating the SLA on global time uncertainty is vanishingly small.
The "global" part is what pushes towards having higher end frequency standards, they want to be able to freewheel for O(days) while maintaining low global uncertainty. Drifting a little from external timescales in that scenario is fine, as long as all their machines drift together as an ensemble.
The deployment I know of was originally rubidium frequency standards disciplined by GNSS, but later that got upgraded to cesium standards to increase accuracy and holdover performance. Likely using an "industrial grade" cesium standard that's fairly readily available, very good but not in the same league as the stuff NIST operates.
GPS satellites have their own atomic clocks. They're synchronized to clocks at the GPS control center at Schriever Space Force Base, Colorado, formerly Falcon AFB.
They in turn synchronize to NIST in Boulder, Colorado.
GPS has a lot of ground infrastructure checking on the satellites, and backup control centers. GPS should continue to work fine, even if there's some absolute error vs. NIST. Unless there have been layoffs.
> There's a lot of focus in this thread on the atomic clocks but in most datacenters, they're not actually that important and I'm dubious that the hyperscalers actually maintain a "fleet" of them, in the sense that there are hundreds or thousands of these clocks in their datacenters.
I mean, fleets come in all sizes; but if you put one atomic reference in each AZ of each datacenter, there's a fleet. Maybe the references aren't great at distributing time, so you add a few NTP distributors per datacenter too and your fleet is a little bigger. Google's got 42 regions in GCP, so they've got a case for hundreds of machines for time (plus they've invested in spanner which has some pretty strict needs); other clouds are likely similar.
Having clocks synchronized between your servers is extremely useful. For example, having a guarantee that the timestamp of arrival of a packet (measured by the clock on the destination) is ALWAYS bigger than the timestamp recorded by the sender is a huge win, especially for things like database scaling.
For this though you need to go beyond NTP into PTP which is still usually based on GPS time and atomic clocks
Actually interesting to think about what UTC actually means and there is seems to be no absolute source of truth [0]. I guess the worry is not that much about the NTP servers (for which people anyways should configure fail overs) but the clocks themselves.
[0] https://www.septentrio.com/en/learn-more/insights/how-gps-br...
Could you define an absolute source of truth based on extrinsic features. Something like taking an intrinsic time from atomic sources, pegged to an astronomic or celestial event; then a predicted astronomic event that would allow us to reconcile time in the future.
It might be difficult to generate enough resolution in measurable events that we can predict accurately enough? Like, I'm guessing the start of a transit or alignment event? Maybe something like predicting the time at which a laser pulse will be returnable from a lunar reflector -- if we can do the prediction accurately enough then we can re-establish time back to the current fixed scale.
I think I'm addressing an event that won't ever happen (all precise and accurate time sources are lost/perturbed), and if it does it won't be important to re-sync in this way. But you know...
Spanner depends on having a time source with bounded error to maintain consistency. Google accomplishes this by having GPS and atomic clocks in several datacenters.
https://static.googleusercontent.com/media/research.google.c...
https://static.googleusercontent.com/media/research.google.c...
And more importantly, the tighter the time bound, the higher the performance, so more accurate clocks easily pay for themselves in other saved infrastructure costs to service the same number of users.
There's a lot of focus in this thread on the atomic clocks but in most datacenters, they're not actually that important and I'm dubious that the hyperscalers actually maintain a "fleet" of them, in the sense that there are hundreds or thousands of these clocks in their datacenters.
The ultimate goal is usually to have a bunch of computers all around the world run synchronised to one clock, within some very small error bound. This enables fancy things like [0].
Usually, this is achieved by having some master clock(s) for each datacenter, which distribute time to other servers using something like NTP or PTP. These clocks, like any other clock, need two things to be useful: an oscillator, to provide ticks, and something by which to set the clock.
In standard off-the-shelf hardware, like the Intel E810 network card, you'll have an OXCO, like [1], with a GPS module. The OXCO provides the ticks, the GPS module provides a timestamp to set the clock with and a pulse for when to set it.
As long as you have GPS reception, even this hardware is extremely accurate. The GPS module provides a new timestamp, potentially accurate to within single-digit nanoseconds ([2] datasheet), every second. These timestamps can be used to adjust the oscillator and/or how its ticks are interpreted, such that you maintain accuracy between the timestamps from GPS.
The problem comes when you lose GPS. Once this happens, you become dependent on the accuracy of the oscillator. An OXCO like [1] can hold to within 1µs accuracy over 4 hours without any corrections but if you need better than that (either more time below 1µs, or more accurate than 1µs over the same time), you need a better oscillator.
The best oscillators are atomic oscillators. [2] for example can maintain better than 200ns accuracy over 24h.
So for a datacenter application, I think the main reason for an atomic clock is simply for retaining extreme accuracy in the event of an outage. For quite reasonable accuracy, a more affordable OXCO works perfectly well.
[0]: https://docs.cloud.google.com/spanner/docs/true-time-externa...
[1]: https://www.microchip.com/en-us/product/OX-221
[2]: https://www.u-blox.com/en/product/zed-f9t-module
[3]: https://www.microchip.com/en-us/products/clock-and-timing/co...
I don't know about all hyperscalers, but I have knowledge of one of them that has a large enough fleet of atomic frequency standards to warrant dedicated engineering. Several dozen frequency standards at least, possibly low hundreds. Definitely not one per machine, but also not just one per datacenter.
As you say, the goal is to keep the system clocks on the server fleet tightly aligned, to enable things like TrueTime. But also to have sufficient redundancy and long enough holdover in the absence of GNSS (usually due to hardware or firmware failure on the GNSS receivers) that the likelihood of violating the SLA on global time uncertainty is vanishingly small.
The "global" part is what pushes towards having higher end frequency standards, they want to be able to freewheel for O(days) while maintaining low global uncertainty. Drifting a little from external timescales in that scenario is fine, as long as all their machines drift together as an ensemble.
The deployment I know of was originally rubidium frequency standards disciplined by GNSS, but later that got upgraded to cesium standards to increase accuracy and holdover performance. Likely using an "industrial grade" cesium standard that's fairly readily available, very good but not in the same league as the stuff NIST operates.
GPS satellites have their own atomic clocks. They're synchronized to clocks at the GPS control center at Schriever Space Force Base, Colorado, formerly Falcon AFB. They in turn synchronize to NIST in Boulder, Colorado. GPS has a lot of ground infrastructure checking on the satellites, and backup control centers. GPS should continue to work fine, even if there's some absolute error vs. NIST. Unless there have been layoffs.
> There's a lot of focus in this thread on the atomic clocks but in most datacenters, they're not actually that important and I'm dubious that the hyperscalers actually maintain a "fleet" of them, in the sense that there are hundreds or thousands of these clocks in their datacenters.
I mean, fleets come in all sizes; but if you put one atomic reference in each AZ of each datacenter, there's a fleet. Maybe the references aren't great at distributing time, so you add a few NTP distributors per datacenter too and your fleet is a little bigger. Google's got 42 regions in GCP, so they've got a case for hundreds of machines for time (plus they've invested in spanner which has some pretty strict needs); other clouds are likely similar.