Saturday 23 March 2013

Murphy Strikes

For those not familiar with Murphy's Law, it states "if anything can go wrong, it will".  No more true than with IT things!

One day a week (mostly) I assist in the archives of a local museum.  In late 2010, we spent time migrating the archive database to ModesXML - lots of planning and test runs, finally completing the task in early 2011.  Ideally we should have dedicated a server to the database; its resource demands will grow as data and user access grow.  But money is tight, so we took the opportunity to put the database onto a lightly-used PC - which also happened to be the newest so it was quite well specified.  Worth a mention here is that it runs 32-bit Windows 7 for legacy reasons.

All was well after initial migration.  Subsequent customisation took place, along with creation of templates, lists of object terms, many new entries, and so on. It was gratifying to see the system settle in and moulding to our needs.

During the early weeks of 2012, some odd behaviour had been noted on the backup drive - a Samsung 1.5TB USB Story Station.  Some days it would fail to start, requiring manual switch off, a pause and then on again.  Then it started throwing errors during backups.  Running CHKDSK would usually fix it for a few days, then it reverted.  This went on for many weeks as investigation was hampered by more serious issues: broadband internet faults, viruses in other PCs on site - oh, and an incapacitating leg injury!



When the hunt finally resumed, things had worsened:
  • the power supply for the backup drive was faulty
  • the C drive was also throwing errors
  • many and various errors were recorded in the event log
  • the system was randomly restarting.
Murphy had struck with a vengeance!  But he wasn't yet done...

A new power supply was ordered, but failed the second time it was switched on.  Another was procured - success!  The Samsung drive was recovered using CHKDSK /R on another machine, then soak-tested until it was given a clean bill of health.  Miraculously, almost all the backup files had survived.

By now it was summer 2012!  The system was now seriously unstable, throwing weird errors, randomly restarting, the occasional BSOD, key Windows processes not running.  Was this a virus?  Had the backup drive failure corrupted something else?  Hardware?

So tests were run: disk scans, virus scans, lots of diagnostics - the database was needed all the time.  But then we bit the bullet and took the box off line to run a full diagnostic test on the hardware.

In short order, errors were detected in one memory bank - maybe obvious in hindsight - so the offending stick was removed, leaving Windows 7 to run (more slowly) on just 1GB.  New memory was ordered, but when installed, it didn't work at all!  The PC wouldn't boot, just 3 beeps from the BIOS POST instead.  As we had no other PC that used DDR3 memory, we couldn't test it in a known good machine, so it was returned as dead-on-arrival.

More new memory was obtained via the PC manufacturer - this worked!  After soak tests, the system was brought back online.

Now on reflection (hindsight is wonderful), it would have been wise to repair or even format and reload Windows at this stage - which would have saved time in long run.  But there is always a need to keep the database available so that volunteers' scarce time is not wasted.  So only essential integrity checks were performed: chkdsk, sfc (which made several repairs, missing files were copied from other machines), the system troubleshooter detected print spooler problems and a missing UMBus was reinstated.

So by mid-October, the system was again stable.  But a niggling thought remained: what other non-system files (applications, the database itself, backups, other data) might have been corrupted?  There was no easy way to find out.  And Murphy was still waiting...

The next three weeks were largely uneventful, mostly catching up with routine data entry and creating new templates.  An LCD monitor on another PC decided to 'go green' - literally!  A fairly recent acquisition by way of donation, it had given good service hooked up to an old Windows XP PC (also donated).  But the display gradually went greener and greener until users started feeling unwell.  Sadly it was beyond repair, so it went to be recycled.  A kind volunteer has lent a replacement.

Almost by chance, another oddity was noted when checking file permissions on the backup drive.  Naturally enough, previous backups included user files along with their folders and data.  Each user folder contains the hidden folder \AppData\, where applications hold their working data.  But unexpectedly each of these folders modified before 2011 also had its own nested \AppData\ and so-on for many, many levels deep.  Navigating right down these structures would eventually cause an error pop-up to be displayed, and notably an error event to be logged.

Before 2011, the backup drive had been used on a Windows Vista PC, and various backups had been driven by command scripts mostly running robocopy.  After it had run a few times it would throw errors, but nobody had discovered why.  For some reason, the old backup scripts had been copying files one level deeper each time, until the file system complained about the massive nested folder depth and threw an error.  Mystery solved!

Except that all these deeply recursive folders probably compromised the file system so had to be fixed; attempts to delete them just threw another error.  However, they could be moved.  So the very repetitive procedure for each folder tree was:
  • move a folder from somewhere fairly well down the tree up to a much higher level
  • delete as much of the tree as possible before file permissions prevented further progress
  • take ownership of and/or modify permissions to obtain full control of the offending folder and/or file
  • delete remaining folders and files in the tree.
Plodding through this lot took quite a few weeks.  And after re-running chkdsk, it certainly reduced the incidence of errors from the backup drive.

And just when it seemed safe, the archive office flooded!  Extremely heavy rainfall over Christmas 2012 caused a backup through the floor.  The database system had been installed under a desk - but escaped damage by a whisker with water only lapping at its base.  The system is now a few feet higher.

So what was our takeaway from all this?  Never ignore unusual events; be watchful for early clues about impending malfunctions; multiple failures do happen; there are no short-cuts when it comes to protecting data.

Clearly Murphy is omnipresent, pan-dimensional and multi-threaded...

No comments:

Post a Comment