Blobs (was Re: Windows filepro splash screen)

Thu Mar 13 10:57:11 PDT 2008

Y'all catch dis heeyah?  Kenneth Brody been jivin' 'bout like:
> Quoting Fairlight (Wed, 12 Mar 2008 20:55:12 -0400):
> [...]
> > Apparently at least a few customers know of some issues.  Maybe I should
> > get the specifics from my client and make them public.  That would be
> > rather embarrassing, especially if they're accurate and legitimate issues
> > that haven't been addressed.
> 
> Unless, of course, they were never reported, as appears to be quite common.

Point taken.

> Not quite.  The routines used by blobfix are loaded with "sanity checks" at
> every step along the way.  If an "insane" value appears, such as a negative
> offset, or a block number past EOF, it stops reading that particular object.

The question is, how did the code let those insane values get there in the
first place, and why?  A fix program is a bandaid, not a vaccine.

> Could the routines used by filePro do the same thing?  Yes, of course.  But,
> that would obviously slow things down, as you mentioned.  Also, what should
> filePro do at runtime if it came across such corruption?  Finally, note that
> the only likely benefit the user would see is that filePro wouldn't crash
> with a SEGV or the like should it hit a corrupt BLOB.

The choice issue is obviously a good point.

> > Now...if there
> > are -no- known issues, why is there a blobfix program at all?
> 
> Why is there an fsck?

Several reasons that lead me to draw a distinction between the two:

1) Hardware failure.  I suppose it could be argued that blobfix's existence
can recover a blob file from hardware failure, but my point is that the
hardware and filesystems have always checked out in any case I've ever
seen, and only the file has been corrupt.

2) It's an open system.  Other programs could modify the filesystem
incorrectly, in theory, given sufficient permissions. filePro is pretty
much a closed system, and if only the fP binaries are ever used, and the fs
and hardware all check out fine, the corruption can probably safely be
assumed to originate within fP,  Again, we're back to bandaid vs vaccine.

> > "Fix" implies "issues" in my book, and probably most people's.
> 
> I guess it also depends on your definition of "known issue".  If the issue
> is "they have 50 people banging away at it 12 hours a day, 6 days a week,
> and after a couple of weeks they notice corruption in the memos", I suppose
> you could say there are "known issues".  Of course, the issue is probably
> best described as "some people, with some files, doing some sequence of
> events, will, after some time, have a problem somewhere in the blob file".
> The problem is in getting further details.

If there's a known sequence of events that would cause corruption, it
should be made available as a "known limitations" note somewhere.  If not,
the problem(s) should be squashed.  Heavy use is really not a good excuse
for data corruption, IMHO.

The problem -is- definitely in getting further details.  Nobody's sitting
there watching 20+ operators keying data and seeing exactly what is done
until a corruption is encountered.  They hit a corruption and then there's
next to no way to discern if it's caused by some sort of data-driven bug.
It's like doing a post-mortem where you have a few cells on a slide to go
by instead of a full body.  Until there's a way to enable a "debug mode"
that logs operations, even taking a performance hit, there's always going
to be a problem in getting additional and accurate information.

mark->
-- 
"Moral cowardice will surely be written as the cause on the death
certificate of what used to be Western Civilization." --James P. Hogan