The earth’s rotation is
slowing down. So we need to introduce a leap second in the world’s standard
time this June. Are you and your company prepared?
I think we have enough
anecdotal evidence that when this happened in 2012 it caused some havoc,
several huge global systems went down, including many GPS satellite systems
At the company I work
for, one of my responsibilities is for system times to be correct at all time
on our 13,000 workstations and 2,500 servers. As a huge Doctor Who fan, I enjoy
this. Within health care, accurate time is essential. Its not just about having
an accurate legal record for births and deaths, different computer system have
to be able to exchange information fluidly.
If system ‘A’ thinks the
time is 12:04:58 pm and it receives information from system ‘B’ dated 12:04:59
pm, then system ‘A’ interprets the information as being dated one second in the
future. Many systems will not handle this scenario well.
I am recommending that
we survive this episode using Google’s smear method.
Here’s the science bit
Usually when a leap
second is almost due, the NTP protocol says a server must indicate this to its
clients by setting the “Leap Indicator” (LI) field in its response. This
indicates that the last minute of that day will have 61 seconds, or 59 seconds.
(Leap seconds can, in theory, be used to shorten a day too, although that
hasn’t happened to date, since the earth's rotation is slowing and will never
speed up.) Rather than doing this, Google applied a patch to their NTP server
software on their internal Stratum 2 NTP servers to not set LI, and tell a
small “lie” about the time, modulating this “lie” over a time window w before
midnight:
What this did was make
sure that the “lie” they were telling their servers about the time wouldn’t
trigger any undesirable behavior in the NTP clients, such as causing them to
suspect the time servers to be wrong and applying local corrections themselves.
One thing I think we
should consider is that pool.ntp.org (where our Active Directory forest gets
its time) is a round-robin of NTP servers (thousands of them), and therefore it
is hard to know if you are going to hit a server that correctly supports the
leap flag in the NTP protocol. This flag is set 24 hours before the leap second
is due to be implemented. If your NTP system supports leap flags (our domain
controllers do) then it prepares them to handle the event and not freak out.
So the possible worse
case scenario is that we receive time from an NTP server that does not pass us
the leap flag and, even though we support the flag, we freak out because we
never saw the flag and didn't expect the change. Also, even if Microsoft
Windows handles the unexpected change, if our domain services do not receive the
flag, then we wont pass it on to downstream UNIX systems that might not handle
the change. The implications for this would ripple through the enterprise to
all workstations, servers and devices that look to the domain for time. That would
be a shame since the incident would be caused by systems beyond our control.
My advice (after
testing) is that we shift our time source from pool.ntp.org to Google's time
servers (there are 4 of them) and allow them to take us through the 48 hours surrounding
the leap second using their smear. Which will adjust the time 43.2 million
times over 24 hours (23 nano seconds per adjustment). We regularly adjust time more
aggressively than that. After this period and the world has settled down, we
can switch back to pool.ntp.org.
I feel this will protect
our domain time services, it then comes down to all those outlying systems that
do not get time from the domain.
Assuming we get through
this without incident, this will be a great win for the department, especially
if other systems across the world choke.
I will update this blog
as this situation develops.
Update #1 4/27
Cisco have advised that the majority of their appliances may have issues including our core routers, Global Site Selector hardware and UCS blades. They have advised that time sources should be removed prior to the change and put back afterwards. We are putting a team together to ensure all systems are pointing to domain time (like they should be). That way, if we use the Google smear then we should be OK. The team will include engineers representing clinical engineering, telecom, networks etc.
Update #2 5/14
I probably should have been clearer, this event is at Midnight UTC (also known as GMT) so be sure to calculate this out for yourself. I am in Mountain Daylight Time right now, so for me the event will be at 17:59:59 on June 30th.
Update #1 4/27
Cisco have advised that the majority of their appliances may have issues including our core routers, Global Site Selector hardware and UCS blades. They have advised that time sources should be removed prior to the change and put back afterwards. We are putting a team together to ensure all systems are pointing to domain time (like they should be). That way, if we use the Google smear then we should be OK. The team will include engineers representing clinical engineering, telecom, networks etc.
Update #2 5/14
I probably should have been clearer, this event is at Midnight UTC (also known as GMT) so be sure to calculate this out for yourself. I am in Mountain Daylight Time right now, so for me the event will be at 17:59:59 on June 30th.
No comments:
Post a Comment