OT: Unix stutter

Fairlight fairlite at fairlite.com
Fri Sep 10 07:04:18 PDT 2004


On Fri, Sep 10, 2004 at 09:10:01AM -0400, Leefp1 at aol.com, the prominent
pundit, witicized:

> Several times per day this Unix box "stops" for 5-10 seconds for no
> apparent reason.  I use the word stop as oppose to "lock-up" because it
> is different than what I have ever seen before.  When it stops, no key
> strokes are accepted from the console or any terminal.  You can't even
> "Alt-Fn" to another screen.  The system just stops.  While stopped any
> key strokes are NOT stored, i.e. when it "starts" the key strokes entered
> during the stop are NOT processed.  It starts again on its own with no
> action taken by a user.  It is not a fatal error but, obviously, very
> annoying in a busy office environment.

Very.

Sounds like the CPU is being sucked dry--literally.  The equivalent of a
load average of about 70 on an old Unisys 7000/40.  

Immediately after it recovers, what does 'uptime' say about the load average?

> I asked this question about a few months ago when it first started and
> did not get much response.  A few weeks ago this machine had a hard drive

Probably someone like JPR will say that's because this isn't the "correct"
forum for it.  I think I just saved him the trouble, even though I don't
feel that way personally.

> failure and I thought perhaps when I replaced the hard drive the problem
> might go away (not sure why I thought that... just hoping).  But it
> didn't, and the users are beginning to doubt my "guruness" since I can't
> fix this.

Only way I can see disk doing this is if you got an entire filesystem in
a state where every process is thrown to a disk wait state.  Say, if you
were running RAID and something really massive happened to slow down the
subsystem where all buffers were full and possibly if it needed to swap on
the same hardware, that could potentially do it.

> Recently I have been suspecting a memory or processor problem, at least
> something hardware related... my hardware builder thinks maybe a cache
> problem.  I'm stumped.

I wouldn't jump to either of those conclusions, seeing as it recovers.

It sounds like a bottleneck somewhere.  Caching could do that, but it would
be prudent to look at all potential factors, including the load on the
server.  I don't suppose this coincides with a recently added cron job or
the like?

I can't see caching just falling on its arse and then recovering in a few
seconds.  Well, I could -maybe- if the system was overheating.

> Am I not thinking about some Unix setting?  Has anyone ever experienced
> such a problem?  Any ideas where to start looking for the problem?  TIA.

The problem is not being able to look at the process table -when- it's
happening.  If one could do that, it might narrow the field a little.  If
it's a sudden spike though, uptime should show it.

What brand and type CPU are you using?  Are you using multiple CPU's?

I can think of some conditions that could cause it, dependant on the
answers to those two questions.  Overheating with sudden throttle-down
constricting the CPU slices on top of a heavy load could do it.  Spinlocks
in SMP/MPX could do it too.  

This really wants someone more hardware-minded than myself, since you're
not going to get much out of the OS proper, but we have no shortage of
those here.  Hey "Dad"? (*pokes Bill Vermillion*) Whatcha think?

mark->
-- 
Bring the web-enabling power of OneGate to -your- filePro applications today!

Try the live filePro-based, OneGate-enabled demo at the following URL:
               http://www2.onnik.com/~fairlite/flfssindex.html


More information about the Filepro-list mailing list