The Network group at $JOB (which I’m part of) is responsible for maintaining time. What I mean by that is that we’re the ones that maintain the systems involved in keeping accurate time across the campus. Here’s how we’re doing it:
First, we have three devices on our network that pull time from whatever source: currently, the two that are running use the CDMA signal from two different cellular providers. These are considered Stratum 1 devices.
Next, we have two servers query those two devices, making them Stratum 2. These two servers are then responsible for being available for time querying on campus – these are the things that give every other computer their time.
Several days ago I heard some of the security/data center folk discussing time, and that it seemed to be a bit off. I didn’t think much about it at the time, until yesterday, when I got visited by some of the DCI folk. It turns out that time was way off in several cases, and they indicated that they saw that our two time servers were at least seven seconds apart from one another.
We do monitor the servers, but we’re not notified of issues unless the time is a full minute apart.
So I took a close look at the two of them. Not being totally familiar with NTP beyond initially configuring NTPD to query servers, I did a little digging.
There’s a handy command called ntpq. You tell ntpq to query the peers it’s using for time sync with “ntpq -p”. You’ll get something like (numerical values are in milliseconds):
remote refid st t when poll reach delay offset jitter ============================================================================== +(timekeeperDNS) .CDMA. 1 u 81 128 377 0.646 -0.138 0.292 *(timekeeprDNS2) .CDMA. 1 u 9 128 377 0.610 -0.486 0.203
This output is somewhat sanitized, but you get the idea – “*” means that it’s using that output to keep time, and “+” is a viable fallback if the one being used fails. The thing is, this is only a valid output in the case of two identical “st” (the strata level) values. If you are running from two time sources that are identical strata, they have to have offsets that are close enough to each other to be considered valid. If their offsets differ too much, and the “delay” (the time it takes for the NTP client to query the source in question) can’t account for the difference, both sources will be marked with an “x” – and they will not be used to sync the client’s clock.
In our case, this made both “time servers” start keeping their own time. Eventually, their clocks varied enough that everyone else using them as a source stopped considering them valid.
It turns out that one of our time devices apparently failed to account for the leap second on June 30.
Keep a closer eye on your time sources if they’re the same stratum. Otherwise, either use one device, or devices on different strata.
Oh … one other thing: Some systems don’t react well to sudden jumps in time. Others may not auto-adjust properly if the time on the client is off too far from the source.