For most companies (ours included) IPv6 is not something we have embraced. It's hard to have a network team who have spent decades becoming experts in TCP/IP and have them relearn a whole new technology and implement it. It's just not high on our list (I digress) so...
Perfect Storm Component (1): We don't even route IPv6 on our network
Our Windows 2008 servers all have IPv6 installed and bound to the NICs (the default configuration). Microsoft say this is best practice since they have declined to retro test any of the current or future technologies against a platform that has IPv4 enabled and IPv6 disabled.
Perfect Storm Component (2): We have IPv6 enabled on all our servers.
When an IPv6 stack comes on line it will perform a solicitation broadcast (IPv6 Broadcast) and typically in our environment that falls on deaf ears, thus the only IPv6 interface that is online is a loopback interface. Enter UAG. Since the UAG product uses IPv6 it needs a way of allowing the IPv6 clients on the Internet a way of talking to IPv4 servers, it does this by installing an ISATAP server which effectively allows IPv6 and IPv4 systems to talk to each other by gluing IPv4 headers onto IPv6 packets. The moment that comes on line the IPv6 solicitation broadcasts are received by ISATAP and it starts handing out addresses (only analogous to DHCP). Two things then start to occur. (a) All our servers begin registering AAAA records in DNS and (b) the nodes of our production clusters start to perform their health checks of IPv6.
Perfect Storm Component (3): Servers, including clusters, start using IPv6 for DNS and cluster health checks.
At this point I am pissed off, or at least I would have been if I had known what was going on. The introduction of the UAG product and consequently ISATAP had made a change to our entire enterprise at a profound level and all without our knowledge.
And then it happens...
Perfect Storm Component (4): The UAG product breaks.
Now the time bomb has started to tick, with UAG down, servers can no longer communicate via IPv6 so they are unable to renew their IP addresses in DNS. The countdown to their DNS scavenge time has begun and 7 days later
Now for many companies, an erroneous cluster fail over is not a big deal, but suppose, just suppose the cluster is responsible for a medical imaging system in a hospital, and suppose the cluster failover of an SQL database caused the application to crash, and suppose a doctor was treating an patient having a stroke, and suppose he needs to look at the MRI of the patients brain - and he can't. Nuff said.
No comments:
Post a Comment