Help! BSOD in Unix

Brian K. White brian at aljex.com
Wed May 14 23:20:33 PDT 2008


----- Original Message ----- 
From: "Barry Wiseman" <bwiseman at optonline.net>
To: "Filepro_List" <filepro-list at lists.celestial.com>
Sent: Wednesday, May 14, 2008 4:35 PM
Subject: Help! BSOD in Unix


> SunOS
> filepro 5.0.14
> 
> These people win the cake for most weird problems I'm seeing for the 
> first time in 30 years of filepro.
> 
> Today, they suddenly stopped being able to load *clerk or *report.  I 
> launch the program from the shell, no arguments, it never gets to Enter 
> Filename.  Redraws the screen blank in filepro's background color, and 
> hangs with the cursor at 0,0.  Does not respond to BREAK or Ctrl-\. 
> Only a "kill -9" from another terminal will recover the session.
> 
> Creation programs, dxmaint, ddir all run fine.  Only these four.  The 
> *clerk and *report binaries are checksum identical with those on the 
> test server, which run fine there.  Permissions and ownerships all look 
> correct, both in $PFPROG and $PFDIR trees.
> 
> Ideas?

No but an observation, really 2 observations in one:

We have seen this (or smething that sounds pretty exactly like it) very rarely, and not recently, on linux.

Somtime over a year ago, maybe over 2 years ago now, we had a problem on 2 linux boxes shortly after switching our largest single subscription user server from sco open server 5.0.6 to suse linux 10.0.

The code was still the same or very similar as was on the sco box, but the code was copied to a new server as well as being ported to a new OS. maybe 5 or so times in total, spread over about 6 months, on 2 identical suse linux servers (same fp binaries, same application code base, same os version, same hardware, same raid & fs config) with no apparent pattern or cause, all *clerk and *report would suddenly hang on start up just as you describe. Always it was at a time when I could not take any time to do any debugging or diagnosing. I tried a few ovious fast things and got nowhere and had to give up and reboot because of the pressure of so many paying users being down right in their busy time. Never saw the problem before then, not even on the sco boxes that had all the same users running all the same application code just a few weeks before, and not in at least a year before now, though I don't know what if anything we changed that might have either caused or corrected the problem. Maybe an OS update, maybe a filepro update, maybe application code changes, maybe nothing more than usage pattern?
We did reduce the total number of users on any single box just on general principal, since it was up in the over-200 range and the fp licences were getting close to the wire anyways. As far as I could tell it was not any obvious limit like fp license count or os stuff like NOFILES etc.. might have been but didn't appear to me that any of those limits were exceeded at those times. The boxes were completely fine in all other observed ways at the same time the clerk/report processes started failing. Not excessively loaded in any obvious resource, ram, disk i/o, concurrent tty's, total number of processes, etc.. all well below previously observed levels where the box ran fine. Killing off existing clerk/report processes didn't help either (perhaps not surprising since I think graceful signals 15, 1, etc didn't work for us either). Killing off every user process, as gracefully as possible, new clerks & reports still hung, only reboot cleared it. I never got as far as digging into shared memory and maybe clearing allocations manually. But if I had time or if it happened again that would be the next thing I'd have looked into. But it hasn't happened again. By now I don't remember if I looked for locked records (lslk/lsof/showlock/fuser) or if I tried using strace -e full -o clerk.txt rclerk ...

So the two observations are that:
1) I think your phenomenon has been seen before, on a different OS.
2) The most obvious common elements are the use of filepro and only filepro binaries being effected, and the fact that the application-level code being used was Aljex. The code you are working on is much older than what we were using, and has been further modified a lot by the customer over the course of years, yet that still leaves a _lot_ of common code between the two systems that have exhibited the problem.

I'd ask Howie if he remembers doing anything special around the time of, or in reaction to, our problem.
I know he did a lot of house cleaning and reduced the amount of work happening in auto processing in several heavily used files. There are a lot of differences between our current code and what that customer has, and we seem to not have the problem any more. It could be almost anything of course, but _maybe_ something done in that time will stand out as a possible fix.

-- 
Brian K. White    brian at aljex.com    http://www.myspace.com/KEYofR
+++++[>+++[>+++++>+++++++<<-]<-]>>+.>.+++++.+++++++.-.[>+<---]>++.
filePro  BBx    Linux  SCO  FreeBSD    #callahans  Satriani  Filk!



More information about the Filepro-list mailing list