OT - backup supplements

Sat Feb 26 15:34:00 PST 2005

On Sat, Feb 26, 2005 at 05:18:46PM -0500, Leefp1 at aol.com may or may not have
proven themselves an utter git by pronouncing:
> Just curious...
> 
> Do others find that the need for a backup is rarely the result of hardware 
> failure, but rather user error or misunderstanding and/or (dare I say it) 
> programmer failure?  

99.9% hardware failure, 0.1% administrative error.  I've had a few
incidents (years ago) where I had the bad habit of typing 'ls' over and
over while pondering a problem, and before I knew it, I was looking at an
already-entered `rm -rf *` from the development tree PFDIR.  I broke out of
it very quickly but still needed backups.  Another time, I just typed too
fast and got a space before the * on a glob used for a forced remove.

The rest of the time it's been hardware failure or migration assistance
when there was no networking in place.

> My observation is that by far the majority of the time I have needed to
> restore something from a backup it was because the user irretrievably
> messed something up; or I programmed badly and therefore inadvertently
> allowed a bad, unintended consequence to result.

I bullet-proof my code in advance of any user even seeing it.  I'll try
every possible combination of data entry (or transmission, in my
behind-the-scenes software) to make sure all errors are handled gracefully.
I've had a few instances in the past where someone could do something
slightly odd, but nothing that took more than 10-15 min to re-code, and 20
seconds to fix the data.

The trick is to stress-test your software before deployment.  Imagine every
possible permutation of something they can do, and make sure your logic
flow is capable of handling them all gracefully.  I stress permutation
here, as you can have error handling routines for condition 'x' not being
met, and another for condition 'y' not being met, but if they're not
written correctly or if you just put in the wrong logical "glue" that ties
your error handling together, you can trigger a failure you never
anticipated.  That's why, on a 10-field anything where say, six fields are
required, I'll go through every permutation of correct and incorrect data
being entered between any of the fields--not just test each individually.
Offhand, I'm not mentally calculating the number of tests that creates, but
I do perform all those tests.  I'm not satisfied that it's robust unless I
have done it--and I therefore won't deploy it.

> Further, would you not agree that restoring from backup media is tedious
> and sometimes frustrating, especially when you don't have control of the
> media, having left that to the user?

I think restoring from backup being tedious (and time-consuming) is not
necessarily a bad thing.  It certainly gives one time enough to pause and
reflect on one's errors so that one will not make the same mistake twice.
Consider it a window of opportunity to re-evaluate what you're doing and
how you're doing it.  By the time I'm done kicking myself, I can be fairly
sure I will never make that same mistake twice.  It gives me time to kick
myself quite firmly, repeatedly, and to savour the learning experience.
It also gives me time to take some TUMS, Pepcid, and have a cup of tea.  :)  

If it was fast and easy, I daresay more developers would be sloppier than
they already are, would stop being as careful about QA, and would rely on
backups to "handle" things--which is entirely the wrong way to do it.  But
if the technology was there, I know a good number of people I can be pretty
sure would just say, "oh well...we have a backup, no biggie."  It -is- a
biggie -any- time you corrupt data.

Personally, I'm glad data restoration is never 1) 100% in a busy
environment, and 2) fast and easy.  If it were either, there'd be a lot
sloppier code out there than there already is.  100% restoration capability
would be the holy grail for the a backup scenario, but as soon as you hit
that, someone lazy will start leaning on it instead of writing solid code.
You want it there as a safety net, not as a matter of course.  It should be
an insurance policy, not an integral element of operations.

> So... I have for some time now been adding another phase to my backup    
> routines, namely copying filepro key files and other data files to       
> other names in other directories on the hard drive.  This allows me a    
[snip]
> Any comments?

Yeah.  It still leaves you with a single point of failure.  I'd rather
use rsync to another system entirely so that it's isolated from the
production machine.  If the HD dives, it would still be more convenient.
Or, if someone finds out how you keep backups, I can all-to-easily imagine
someone not even bothering to tell you that there's a problem with the
main database--they might just reset their variables and start using
the redundant files as live files "because it 'works'", not realising
what a disservice they're doing themselves.  I can also think of about
5 developers in this forum alone that don't credit their users with
enough knowledge to pull that off.  Never underestimate the knowledge,
resourcefulness, desperation, or ignorance of your users.  Users generally
possess all in near equal measure.  All it takes is one person in an
organisation with only -half- a clue to bring the entire house of cards
crashing down because they thought they had the whole clue.

I'd personally want it on at least a different machine for both reasons.
It's obviously no replacement for offsite, protected backups, but if one
is going to make any effort, one may as well make it count for as much as
possible.

mark->
-- 
Bring the web-enabling power of OneGate to -your- filePro applications today!

Try the live filePro-based, OneGate-enabled demo at the following URL:
               http://www2.onnik.com/~fairlite/flfssindex.html