OT: file systems was Re: Can filepro do drill downs like acess?

Fri Mar 26 14:28:09 PST 2004

You can just hit N now if you don't want to read further.

On Fri, Mar 26, 2004 at 12:54:19PM -0500, Jay R. Ashworth thus spoke:
> On Fri, Mar 26, 2004 at 10:19:48AM -0500, Bill Vermillion wrote:
> > That's one reason they were developed. When these were almost
> > standard for serious operations - sales for example - drives were
> > not that reliable and hardware wasn't that stable - even the best
> > of the HW.  These were all on Unix systems BTW.  Hard drives were
> > about $4/megabyte if you shopped, or if vendor supplied were
> > in the $8/megabyte range. [A far cry from the $1 Gigabyte 
> > we see today for server strength IDE drives [ 50 cents or less for
> > consumer/desktop drives].  The high-end high-capacity SCSI's are in
> > the $6+ per GB.

> Seagate's Cheetah-164 is about $800 right now. That is,
> actually, $.50 a gig.

I saw youre later correction and my figure was based on $1200 that
I researched on a 16k Cheetah with Fiber-channel interface. 

> > So on boot a snapshot is made of the drive in it's current state.
> > Multiple parallel fsck's are started in the background and the
> > system is brought up for use.  Everything not fsck in the
> > background is essentially in a read-only at this moment.

> > But if a user program needs access the blocks needed are
> > essentially fsck'ed at that point.  IOW the fsck is prioritized to
> > fsck blocks as needed. It will update the snapshot made at boot
> > so these are now excluded from the background fsck.

> > In tests an fsck which might take 15-30 minutes if done at boot -
> > even with parallel fscks running - when run in the background it
> > could take 5+ hours.  But the system is avialbe immediately for
> > use.

> Actaully, mostof that is being done with journalling
> filesystems, these days, which are effectively the application
> of transactions to filesystem updates: the filesystem
> structure's sanity is preferred at the expense of the data, in
> the event of a crash.

Actually the way I understand it, journalling is going to protect
the data more but at the expense of a coherent file system in case
of a crash.  And to be sure of stability you need to run a
journalling system in a synchronous mode with then sacrifices
performance.

This is why McKusick [one of the principle architects of BSD for
years and one of the authors of the 4.2, 4.3, and 4.4 series of
books explaining BSD before it became unencumberd] developed the
softupdates.  It sets file system integrity as it's highest
priority and does meta data updates in a synchronous mode, and
planned carefully. THere is about a 40 page paper on that at one of
the University sites.

Then after meta-data is written the rest of the file system is
handled in lazy mode.  It's interesting to perform an rm -r on a
source tree of about 350 MB and have the prompt returned back in a
matter of seconds, but then watch the blocks free up slowly -
sometimes taking as long as 2 minutes.   Freeing block is not a
high priority but ensuring the FS is stable is.

This is part of the heart of the snapshot mode which runs fsck in
the BG:

A highly edited excerpt from one of the papers:
---------------------------

   Home About USENIX Events Membership Publications Students 

Running "fsck" in the Background

   Marshall Kirk McKusick, Author and Consultant

Abstract

   Traditionally, recovery of a BSD fast filesystem after an uncontrolled
   system crash such as a power failure or a system panic required the
   use of the filesystem checking program, "fsck". Because the filesystem
   cannot be used while it is being checked by "fsck", a large server may
   experience unacceptably long periods of unavailability after a crash.

....

2. Creating a Filesystem Snapshot

   A filesystem snapshot is a frozen image of a filesystem at a given
   instant in time. Implementing snapshots in the BSD fast filesystem has
   proven to be straightforward. Taking a snapshot entails the following
   steps:

    1. A snapshot file is created to track later changes to the
       filesystem; a snapshot file is shown in Fig. 1. This snapshot file
       is initialized to the size of the filesystem's partition, and its
       file block pointers are marked as zero which means "not copied."
       ....

    2. A preliminary pass is made over each of the cylinder groups to
       copy it to its preallocated backing block.  ....

    3. The filesystem is marked as "wanting to suspend." In this state,
       processes that wish to invoke system calls that will modify the
       filesystem are blocked from running, while processes that are
       already in progress on such system calls are permitted to finish
       them. These actions are enforced by inserting a gate at the top of
       every system call that can write to a filesystem. The set of gated
       system calls includes "write", "open" (when creating or
       truncating), "fhopen" (when creating or truncating), "mknod",
       "mkfifo", "link", "symlink", "unlink", "chflags", "fchflags",
       "chmod", "lchmod", "fchmod", "chown", "lchown", "fchown",
       "utimes", "lutimes", "futimes", "truncate", "ftruncate", "rename",
       "mkdir", "rmdir", "fsync", "sync", "unmount", "undelete",
       "quotactl", "revoke", and "extattrctl". In addition gates must be
       added to "pageout", "ktrace", local domain socket creation, and
       core dump creation. The gate tracks activity within a system call
       for each mounted filesystem. A gate has two purposes. The first is
       to suspend processes that want to enter the gated system call
       during periods that the filesystem that the process wants to
       modify is suspended. The second is to keep track of the number of
       processes that are running inside the gated system call for each
       mounted filesystem. When a process enters a gated system call, a
       counter in the mount structure for the filesystem that it wants to
       modify is incremented. When the process exits a gated system call,
       the counter is decremented.

    4. The filesystem's status is changed from "wanting to suspend" to
       "fully suspended." ....

    5. The filesystem is synchronized to disk as if it were about to be
       unmounted.

    6. Any cylinder groups that were modified after they were copied in
       step two are recopied to their preallocated backing block.

       ....

    7. With the snapshot file in place, activity on the filesystem
       resumes. Any processes that were blocked at a gate are awakened
       and allowed to proceed with their system call.

    8. Blocks that had been claimed by any snapshots that existed at the
       time that the current snapshot was taken are expunged from the new
       snapshot for reasons described below.

   During steps three through six, all write activity on the filesystem
   is suspended. Steps three and four complete in at most a few
   milliseconds. The time for step five is a function of the number of
   dirty pages in the kernel. It is bounded by the amount of memory that
   is dedicated to storing file pages. It is typically less than a second
   and is independent of the size of the filesystem. Typically step six
   needs to recopy only a few cylinder groups, so it also completes in
   less than a second.

....

    5. Implementation of Background "fsck"

   Background "fsck" runs by taking a snapshot then running its
   traditional algorithms over the snapshot. Because the snapshot is
   taken of a completely quiescent filesystem, all of whose dirty blocks
   have been written to disk, the snapshot appears to "fsck" to be
   exactly like an unmounted raw disk partition.  ...

     _________________________________________________________________

    This paper was originally published in the Proceedings of the BSDCon
    '02 Conference on File and Storage Technologies, February 11-14,
    2002, Cathedral Hill Hotel, San Francisco, California, USA.
    Last changed: 28 Dec. 2001 ml

   Technical Program
   BSDCon 2002 Home
   USENIX home

-----------------------

I hope you found that interesting.  I've waited a long time for
relatively small FS'es to have an fsck performed, I'd really not
like to wait while a 250 GB file system [as an example] completes
it's fsck.  And users waiting for the system to come back up don't
like it either.

Bill
-- 
Bill Vermillion - bv @ wjv . com