Time travel and Active Directory replication - a tale from 2012

When your Active Directory Domain Controllers experience time travel, replication goes up the creek. This is a tale of something that happened back in 2012.

Time travel and Active Directory replication - a tale from 2012

Despite being an entirely man made concept, time is an incredibly important thing.  Having an accurate time source is crucial for authentication, matching logs to real events and even real-life things like meeting a friend for coffee.  Unfortunately back in 2012 a bug caused several computer systems around the globe to experience time travel.

I was working for a local IT support company when a ticket landed from one of the schools I'd installed a VMWare environment for some months earlier.  The reported fault was that none of their users could log in, with some computers complaining about clock skew.  Sadly the on-site IT staff hadn't been able to login to any of the workstations to determine what was going on.

I managed to remote in to their systems (boy was I glad not all their authentication was AD integrated!) and connected to one of their virtual domain controllers (DC) using the VMWare console.  This was a Windows Server 2003 DC, it was showing the correct time and date and seemed to be running fine.  Next step was to determine if there was a problem with replication, so I ran repadmin /replsum and, to my horror, the DC last replicated with its partners back in the year 2000 - over 12 years ago.

Now, bear in mind this was a Windows Server 2003 DC.  Server 2003 which Wikipedia tells us was released on 24 April 2003.  It didn't exist back in the year 2000 so clearly something was wrong.  I called the client who confirmed they had previously had Windows Server 2000 DCs but they were long gone.  I explained I was still looking in to it but for some reason replication was claiming to have last happened 12 years ago - which clearly wasn't true.

For those not familiar with Active Directory, you can't just replicate two servers that haven't replicated in a long time - there's a maximum drift and it certainly doesn't stretch to 12 years (the default from memory is about 60 days).  Attempting to replicate gave me an error I really didn't want to see:

It has been too long since this machine last replicated with the named source machine. The time between replications with this source has exceeded the tombstone lifetime. Replication has been stopped with this source.

For those of you that remember the fear around the Y2K bug, and the suggestion that all the data created "in the future" (after the clocks thought it was 1900) might get deleted - I was trying to work out what would happen to the AD objects created in the last 12 years too.  At best I might be able to get enough of the environment operational to restore from backup but that really didn't sound like fun and I was trying to avoid invoking full disaster recovery for this school.  

I started to do the maths for the number of jumps I'd need to make, in 59 day intervals and how I was going to do that with three domain controllers at a time.  Firstly I'd have to send them all back in time 12 years (again), replicate, then start moving them forwards.  Part way through planning that (and discussions with similarly mystified senior engineers - I was largely left to myself on this one) I received an email from the customer.  A big, shouty (CAPITALS) email in red text.  He was asking why he hadn't heard anything, shouting that the whole school was dead in the water from an ICT perspective and it was affecting student learning, deadlines, etc.  All very true, and it was not like I wasn't working on it.  I decided this needed a phone call, which went something like this:

Hi Pete? It's Jonathan. Sorry I've not been in touch, but I've been trying to fix the problem, that and working out how to tell you "your domain is dead". By the way, there was no need for the red capital letters.

I didn't raise my voice, didn't need to.  Both points hit home and the reply was along the lines of a flustered "ah, w...w...what can we do?".

Working for a Microsoft gold partner at the time the company got a handful of free Microsoft Professional Support Services (PSS) incidents a year.  I let "Pete" (not his real name) know I was off to get permission to ask PSS for help and he left me to it.  A brief discussion with management later and I had the green light to call PSS.

It was around the time of that conversation that a colleague heard me mention "time travel" and "year 2000" - not unusual things for our office (plenty of sci-fi geeks) but importantly he directed me to a story on Slashdot ("news for nerds, stuff that matters" was its tagline back then).  That, in turn, lead to the Internet Storm Centre's post on the issue:

A few people have written in within the past 18 hours about their NTP server/clients getting set to the year 2000. The cause of this behavior is that an NTP server at the US Naval Observatory (pretty much the authoritative time source in the US) was rebooted and somehow reverted to the year 2000. This, then, propogated out for a limited time and downstream time sources also got this value.

Given this school was in Kent, England, I wouldn't expect them to be looking at the US Naval Observatory but, you guessed it, they were.

Microsoft PSS called me back and I spoke to a chap that had clearly fixed this before, several times, today.  There's a key in the registry that you can change to allow replication with outdated partners and that was to be our fix.  After applying that registry key to all the DCs we could force a replication and, hey presto!  Replication showed as last happening seconds ago, rather than 12 years.  I made sure he knew I was very grateful.

A call to the school, after a change to their time server configuration, and they were operational again.  "Pete" was very glad.  I did a fair amount of work for that school in the end, including completing an active directory domain rename process someone at the school had started three years previously but not finalised - incomplete renames leave very interesting warnings in the logs.  

Regrettably I can't find my notes from that day, as I made sure I recorded details of that registry key.  If you're having replication issues I suggest you start with this article.

Banner image, "Push Back Time", from OpenClipart.org, by