Friday, April 24, 2015

The earth is slowing down, we need another leap second!

The earth’s rotation is slowing down. So we need to introduce a leap second in the world’s standard time this June. Are you and your company prepared?

I think we have enough anecdotal evidence that when this happened in 2012 it caused some havoc, several huge global systems went down, including many GPS satellite systems

At the company I work for, one of my responsibilities is for system times to be correct at all time on our 13,000 workstations and 2,500 servers. As a huge Doctor Who fan, I enjoy this. Within health care, accurate time is essential. Its not just about having an accurate legal record for births and deaths, different computer system have to be able to exchange information fluidly.

If system ‘A’ thinks the time is 12:04:58 pm and it receives information from system ‘B’ dated 12:04:59 pm, then system ‘A’ interprets the information as being dated one second in the future. Many systems will not handle this scenario well.

I am recommending that we survive this episode using Google’s smear method.

Here’s the science bit
Usually when a leap second is almost due, the NTP protocol says a server must indicate this to its clients by setting the “Leap Indicator” (LI) field in its response. This indicates that the last minute of that day will have 61 seconds, or 59 seconds. (Leap seconds can, in theory, be used to shorten a day too, although that hasn’t happened to date, since the earth's rotation is slowing and will never speed up.) Rather than doing this, Google applied a patch to their NTP server software on their internal Stratum 2 NTP servers to not set LI, and tell a small “lie” about the time, modulating this “lie” over a time window w before midnight:
What this did was make sure that the “lie” they were telling their servers about the time wouldn’t trigger any undesirable behavior in the NTP clients, such as causing them to suspect the time servers to be wrong and applying local corrections themselves.

One thing I think we should consider is that pool.ntp.org (where our Active Directory forest gets its time) is a round-robin of NTP servers (thousands of them), and therefore it is hard to know if you are going to hit a server that correctly supports the leap flag in the NTP protocol. This flag is set 24 hours before the leap second is due to be implemented. If your NTP system supports leap flags (our domain controllers do) then it prepares them to handle the event and not freak out.

So the possible worse case scenario is that we receive time from an NTP server that does not pass us the leap flag and, even though we support the flag, we freak out because we never saw the flag and didn't expect the change. Also, even if Microsoft Windows handles the unexpected change, if our domain services do not receive the flag, then we wont pass it on to downstream UNIX systems that might not handle the change. The implications for this would ripple through the enterprise to all workstations, servers and devices that look to the domain for time. That would be a shame since the incident would be caused by systems beyond our control.

My advice (after testing) is that we shift our time source from pool.ntp.org to Google's time servers (there are 4 of them) and allow them to take us through the 48 hours surrounding the leap second using their smear. Which will adjust the time 43.2 million times over 24 hours (23 nano seconds per adjustment). We regularly adjust time more aggressively than that. After this period and the world has settled down, we can switch back to pool.ntp.org.

I feel this will protect our domain time services, it then comes down to all those outlying systems that do not get time from the domain.

Assuming we get through this without incident, this will be a great win for the department, especially if other systems across the world choke.

I will update this blog as this situation develops.

Update #1 4/27
Cisco have advised that the majority of their appliances may have issues including our core routers, Global Site Selector hardware and UCS blades. They have advised that time sources should be removed prior to the change and put back afterwards. We are putting a team together to ensure all systems are pointing to domain time (like they should be). That way, if we use the Google smear then we should be OK. The team will include engineers representing clinical engineering, telecom, networks etc.

Update #2 5/14
I probably should have been clearer, this event is at Midnight UTC (also known as GMT) so be sure to calculate this out for yourself. I am in Mountain Daylight Time right now, so for me the event will be at 17:59:59 on June 30th.


No comments:

Post a Comment