fpCGI failure

Thu Mar 13 13:20:23 PDT 2008

----- Original Message ----- 
From: "Tyler Style" <tyler at healthyhabitsweb.com>
To: "Ron Kracht" <rkracht at filegate.net>
Cc: <filepro-list at lists.celestial.com>
Sent: Thursday, March 13, 2008 2:04 PM
Subject: Re: Re: fpCGI failure

> On Thu, Mar 13, 2008 at 10:37 AM, Ron Kracht <rkracht at filegate.net> wrote:
> 
>     Tyler wrote:
>     > fpCGI, both v1 and 2, bombs out on our Apache2 web server running
>     > fp5.0.14 on SCO OS 6.  It simply stops outputting HTML files for any
>     > rreports run.  It requires rebooting the entire machine to get it
>     > going again.  Running the commands thru fpcgi that it was stalling on
>     > works perfectly fine after the reboot.
>     >
>     > Via our resller via Ray at fptech, Ron at fptech says that it looks
>     > like an application error, and that we should just set
>     > Field_nohtmlfound to some other page.  This is obviously 
> unsatisfactory.
>     >
>     > Can anyone shed any light on this, or how to avoid it?
>     >
>     > Tyler
>     > 
> ------------------------------------------------------------------------
>     >
>     > _______________________________________________
>     > Filepro-list mailing list
>     > Filepro-list at lists.celestial.com
>     > http://mailman.celestial.com/mailman/listinfo/filepro-list
>     >
>     I did not say it was an application error. I said that from the logs it
>     looked like there was a flaw in the fpcgi logic that you were
>     occasionally hitting when report does not produce an html document and
>     that this flaw could be avoided by setting Field_nohtmlfound.  I do not
>     see why this is "obviously unsatisfactory".  You should always have 
> some
>     sort of html document ready to report to the user that no output was
>     produced - that is the purpose of Field_nohtmlfound.  As I said in my
>     response to support if you already have Field_nohtml found set to point
>     to such a document that my analysis was incorrect and I'd need get more
>     information.
> 
>     Ron
> 
> Hmmm.  What you wrote, "it looks like they may have hit upon a flaw in 
> the programming logic", is ambiguous.  Since my frame of reference is 
> that we're talking about fpCGI, I assumed that the 'programming logic' 
> being referred to here was fpCGI's, not the filepro report being run.
> 
> Unfortunately, this is not the case.  We were running a destruct test on 
> a test box for fpcgi v2 to see if it would solve this issue.  The same 
> data was being submitted to the same processing over and over again.  If 
> it was a filepro processing table issue the problem would surface every 
> time it was submitted, and it would not prevent all subsequent runs of 
> rreport by fpCGI (regardless of the processing table being run or the 
> data being submitted) from outputing files.  It certainly wouldn't 
> require rebooting our server to get it going again, which is what we 
> have to do now.
> 
> It would be very odd for one report bombing out to block every 
> subsequent report from running; if that was the case, it should bring 
> our whole system to a halt right away as no one would be able to run 
> reports at all.  I think this isolates it pretty much to the actual 
> fpcgi binary.
> 
> This problem is severe enough that we have a cron job script running 
> every 10min against fpcgi that will email us if it doesn't get a 
> response back from fpcgi:
> 
> response=`curl -s -d Field_ddir=/usr/local/apache/htdocs/ -d 
> Field_base=fpcgts -d 
> Field_cmd=rreport+kinocontrol+-fp+fpcgitest+-a+-n+-u 
> http://www.kinotox.net/cgi-bin/fpcgi | grep running`;
> 
> The filepro processing is:
> ::html :cr @pw{".htm"
> ::html :tx "running"
> ::html :cr-
> ::exit
> 
> When this script sends us emails saying that this processing isn't 
> producing output files anymore, we know it's time to reboot (or 
> customers/employees call complaining that they can't use our web tools).
> 
> I find it very unlikely that this processing has a "a flaw in the 
> programming logic" that is causing fpCGI to die :D  So back to the 
> drawing board, alas :(
> 
> Tyler
> 
> PS: None of our forms have "Field_nohtmlfound" - I've never heard of it. 
>  It isn't in my fp dev reference book, nor can i find it when I do a 
> search in the online fp manual.  Is this some new feature in v2?  If so, 
> we wouldn't be using it as our production server is still v1. Or has it 
> been around and just not been ever documented?

We have hit a problem on rare occasions where all clerk & report processes stop working. (any qualifier, any fppath variables, d* and r*, and I'm pretty sure I tried running other copies of the binaries)
As far as we could tell it wasn't a lockfile or record locking issue, we'd manually kill every fp binary process (clerk/report/runmenu/anything from fp) and manually delete all lockfiles after that, and still launching a new clerk process would freeze right on startup, before drawing the initial screen or menu.

Rebooting the server was the only thing that cleared this condition and we can't reproduce it at will, and using the same app, same hardware, same users etc.. it happened once, then once again a month later, then twice in one week one time, and hasn't happened for over a year since then with the same servers up continuously the whole time. I don't think it happened 10 times in total on all servers combined. Maybe not even 5 times. It did happen on both a SCO Open Server 5.0.6 box and a openSuSE 10.0 box. The data and users have changed since then only by growing. The processing code has changed continuously, daily, but is always the same on all boxes that ever had the problem. So because of that, by now my guess is there was some processing of ours somewhere that caused it, and that processing changed and no longer causes it. Whether that processing was actually valid/legal according to the rules of fp (making it an fp or underlying OS bug) or was invalid processing (making it an aljex bug) is impossible to say. Maybe I should have tried to capture the state of the machine, copy /dev/kmem to a file? just before rebooting. I had no luxury to spend any time trying to diagnose or debug the problem since it always happened on the servers with the most users during the busiest times, so I had to reboot and get it back on-line immediately.

-- 
Brian K. White    brian at aljex.com    http://www.myspace.com/KEYofR
+++++[>+++[>+++++>+++++++<<-]<-]>>+.>.+++++.+++++++.-.[>+<---]>++.
filePro  BBx    Linux  SCO  FreeBSD    #callahans  Satriani  Filk!