OT: Unix stutter

Sun Sep 12 06:00:38 PDT 2004

Bill Vermillion wrote:
> On Fri, Sep 10 23:29 , while denying his reply is spam, John Esak
> prattled on endlessly saying:
>
>
>>> In that case it could be just swapping and flushing cache. There
>>> have been instances where someone tunes something thinking it will
>>> make it faster but it causes huge pauses.  This will be allocating
>>> large cache and then when it really has to be flushed the system
>>> will spend the time doing that - and basically seem to be dead
>>> during that time.
>
>> This is the first time I've heard of this.
>
> Jeff Lieberman and I had a discussion on this a few years back.
> This was also a problem in many Linux systems.  They made the cache
> very large for fast performance, but when it had to be flushed it
> would bog down.  It has probably been fixed.
>
>> You're right I never mentioned the system (this time). It is a
>> SMP 2 3Ghz CPU"s with 1Gb of RAM. We run about 50 users, but at
>> night when this problem happens only about 7 or 8 and only 1 of
>> them is ever doing anything at a time. (meaning no time are two
>> or three people doing different things. Just one big filePro
>> app running with 6 - 8 users... and the backup.
>
> I don't have anyone running SMP but ISTR some discussion on SMP
> problems and there was a patch - but I may be confused on this.
> Perhaps JPR has more definate information.
>
>> Do you think a system this large and fast could have this cache
>> clearing freeze up ... because I _do_ have the cache (nbufs and
>> so on) set as high as they possibly can be. Should I _lower_
>> these, perhaps?
>
> You can try that.  But I'd first run sar [see below] to see
> what is happening.  If you see huge amounts of disk i/o or
> waits the huge cache could be it.  I'm not one for chaniging things
> until I find out what is causing it, that's why I suggested sar
> first.
>
> Are the drives on a cacheing controller?   If so you could have
> OS caache flushing to controller cache flushing to disk.  I'm not
> saying this is the problem, just that it could be.
>
>>> You have never mentioned that you have run sar or given any info
>>> about that.  Have you run it?
>
>>> Bill
>
>> No, I never have run sar. It is so intermittent. I'll set up
>> sar to run, though during a midnight backup and see what it
>> produces. It is just that Tune-Up has not really shown anything
>> in the way of huge system hogs other than Edge itself which
>> pretty much always wins. I don't know how to track spikes with
>> Tune-Up... maybe I'll call and ask them.
>
> If you mean by 'sare is so intemittent' you mean the default that
> runs hourly overnight and every 20 minutes during the day?   That
> is only going to give hourly and 20 minute averages and might not
> point out a thing.
>
> But you can run sar from the command line, and when I have a
> problem that I need to see I'll set it up to run for 10 or more
> itterations for 10 or more seconds.  If you run it more than every
> 10 seconds sar itself can skew the results.
>
>> From the command line just do this:
>
> sar -o /tmp/sar 10 10
>
> The first 10 is how many seconds to wait before you run it again.
> The second 10 is how many times you wish it to run.  The output
> goes into the file 'sar' in tmp.
>
> This is good to catch those thing that happen for a minute or so
> and go away.
>
> You could make this into a small shell script so someone could run
> it if things get slow without having to call you.  They could
> then mail the results of the file to you.
>
> Then to see what happened just do this
>
> sar -A -f /tmp/sar
>
> You can pipe that to a printer, less, or whatever.

also he could install "hog" iohog/cpuhog/memhog and top.
I happen to know he already has my handy dandy vol downloader/installer
script so here is the exact command:

/setup_gnu hog-1.1 top-3.5beta5

after that, start up 3 telnet sessions
in each one run one of these commands:
/usr/local/bin/iohog, memhog, cpuhog

let them run all day and watch various processes appear and disappear from
the displays as they momentarily consume one or more of the three implied
resources. If a program appears in, say, the iohog screen, and doesn't go
away, that is a suspect. rreport, *tar/backupedge, htepi_daemon and bdflush
will commonly appear often but fleetingly as io hogs. bdflush is the thing
that flushes NAUTOUP seconds old writes from the NBUF x 4Kbyte disk cache
every BDFLUSHR seconds. htepi_daemon performs the metadata and journal
("transaction intent") updating for htfs filesystems.
ref: http://osr5doc.sco.com:1997/PERFORM/PERFORM_Glossary.html
I think htepi_daemon is not a concern and does only a little work but does
it constantly, whereas bdflush might have to do a lot of work, and
repeatedly.

other notes:

* install u386mon
http://www.caldera.com/skunkware/faq.html#u386

* if you have less than osr 5.0.7 with UP3 (that's UP3 not MP3, which you
should also have, but mp3 is free and up3 must be purchased) then disable
hyperthreading in the bios. WARNING: If you determine that HT was enabled
and that you should disable it, then be advised that some licenced
commercial software might become unlicensed after doing it. The actual
example from my experience was ctar. They relicenced it promptly without
requiring any jumping through hoops when I explained what happened so it was
no trouble. The reason I make a big deal is, backup software can be broke
all day long and no one cares. If it had been say, facetwin that broke, and
you are presumably doing this after hours since it requires rebooting, and
facetcorp is not available for re-licencing until several hours after your
users start needing to work.... could be a problem! I don't remember if
facetwin broke or not but I don't think so.

* if you have 5.0.6-anything, then there are a few patches relating to cpu
ftp://ftp.sco.com/pub/openserver5/oss657a.ltr
ftp://ftp.sco.com/pub/openserver5/oss651a.ltr
ftp://ftp.sco.com/pub/openserver5/oss648a.ltr
I can't believe you have 5.0.5 so I didn't post 5.0.5 related patches but
there was one for SMP.
and 5.0.6 definitely qualifies as "less than 5.0.7 with up3" also.

* All this talk of sar brings to mind sarcheck, which reads sar for you and
would have diagnosed a cach/buffer size vs flush-frequency issue and made a
recommendation to alleviate it (as long as the hardware was just plain up to
the work that needs to get done, which it sounds like it is). And as soon as
I say that, I have to say, I beleive that John has olympus tune-up, which
should be doing more or less the same thing as sarcheck only maybe better
and of course it just goes ahead and impliments changes rather than generate
a report advising you to do them.

* I think the jury is still out on the goodness of this, but it has been
discussed and at least Bela Lubkin and myself have tried it and have not had
bad results, but not especially good either.(on a very heavily multi-user
loaded machine in my case. A lot of clerk & report procs for over 100 users)
I am sure john has a caching raid card. Given that, you could try setting
NAUTOUP and BDFLUSHR both to 0 (cd /etc/conf/cf.d ; ./configure, option 1)
which totally disabled the OS disk cache, which is fine because the raid
card has it's own cache and two or more caches in serial with each other is
rarely a benefit. Conversely, you could disable the controllers cache in the
controllers bios. The tamer actions would be to try lowering nbuf & friends
(reduce the wad of data that needs to be flushed every 30 seconds) or
lowering bdflushr & nautoup (flush more often than 30 seconds so that the
cache can be very large yet the actual changed data that needs to flush is
usually small) But it's real easy to make combinations that perform very bad
when trying out different values and playing those two properties against
each other. And making the change requires rebooting which is hard to do
often on a busy production machine so you could be stuck with bad settings
for a whole day each, and the testing cycle is a whole day.

* might want to check netstat -i , and netstat -m for suspiciously high
numbers of errors and streams usage

Brian K. White  --  brian at aljex.com  --  http://www.aljex.com/bkw/
+++++[>+++[>+++++>+++++++<<-]<-]>>+.>.+++++.+++++++.-.[>+<---]>++.
filePro BBx  Linux SCO  Prosper/FACTS AutoCAD  #callahans Satriani