Critical uptime question (Was "Looking for some upgrade advice")
Brian K. White
brian at aljex.com
Mon May 22 09:59:35 PDT 2006
----- Original Message -----
From: "Boaz Bezborodko" <boaz at mirrotek.com>
To: <filepro-list at lists.celestial.com>; "John Esak" <john at valar.com>
Sent: Sunday, May 21, 2006 10:16 PM
Subject: Critical uptime question (Was "Looking for some upgrade advice")
>
>
> John Esak wrote:
>
>>Date: Fri, 19 May 2006 07:37:16 -0400
>>From: "John Esak" <john at valar.com>
>>Subject: OT: RE: Looking for some upgrade advice
>>To: "Fplist (E-mail)" <filepro-list at seaslug.org>
>>Cc: Rick Walsh <rick at nexusplastics.com>
>>Message-ID: <JIECJPPMJGMIINMGGNGAAEHJPBAA.john at valar.com>
>>Content-Type: text/plain; charset="us-ascii"
>>
>>
>>
>>>I suspect the only reason I haven't seen comparable uptimes on my linux
>>>systems is because the kernel updates require a reboot. I talked
>>>directly
>>>to the 2nd in charge of the kernel, as well as some of the other kernel
>>>devs, and the consensus was that if I wanted a hot-swappable kernel, I
>>>could go and write the hot-swap code myself. They didn't consider it a
>>>priority, or even desirable.
>>>
>>>
>>
>>As you know, the *last* thing in the world I want to do is start a Linux
>>thread here. :-)
>>
>>BUT... this is something I hadn't considered in our upcoming major move to
>>SuSe Linux. We have a situation where the main *nix server (currently SCO
>>OpenServer 5.6) can NOT go down at all. Literally, it is used to produce
>>various things, mostly bar code lables 365/24/7... with absolutely NO down
>>time at all except for two week long vacations during the year and some
>>other extremely special circumstances... hardly would I called these
>>"planned maintenance"... mor like get in whatever we can because the
>>system
>>went down for some unforeseen reason! :-) Very occasionally, and I mean
>>very occasionally, we can stop the constant transactional postings (and
>>label printing) for a few minutes... rally, just a few. Otherwise, it
>>becomes much like the "I Love Lucy" chocolate factory conveyor belt scene.
>>
>>What, seriously, are we going to do in this situation. I was kin of hoping
>>we could find a *stable* Linux... meaning a kernel that does not need that
>>much or *any* patching. Are you talking about real security problems, or
>>feature upgrades? We simply can not bring the mahcine down for either
>>reason... at least not on *any* kind of ongoing basis.... how in the world
>>does *anyone* cope with such a situation.
>>
>>Yes, yes, I'm constantly considering and devising possible methods to
>>de-reference our main databases and CPU's from this immediate hardware
>>interface... but to date, I have not come up with anything that would work
>>well enough to meet the need. Our systems are currently up-to-the-minute
>>and
>>pretty much *have* to stay that way.
>>
>>Suggestions?
>>
>>John Esak
>>
>>
>>
> John,
>
> I was thinking about this over the weekend. It seems to me that you
> could give yourself a whole lot of flexibility if you could somehow
> duplicate the database you're working with. I think that I could do
> this if the database was not stored on the same machine as that which is
> executing the filePro code.
>
> Here is how I see it working:
> Run two different servers each with its own copy of the database files.
> One is the one that is directly accessed by the users while the other
> gets updated with all of its transactions. Whenever it gets a
> transaction it generates a record of the transaction as a separate file
> for the second database to read. The second database would have a
> process that will look for these transactions and update the files on
> its database.
>
> You could set up a controlled switch from the server running the first
> database to the one running the second database. At the end of each
> transaction executed on the data of the primary database you can have
> code that will check some status flag as to the condition of the server
> of that database. You can program the process to force the user to exit
> out of the application if it sees a flag that tells it to switch to the
> secondary database files. Once it exits you change the clients'
> configuration to work from the secondary database upon re-execution.
> All transactions will eventually move on to the second database server
> until all processes have transfered leaving the first free for any
> changes or updates. In the meantime that secondary database has now
> started acting as the primary database and is building up a list of
> transactions that the original database will have to update to bring it
> to the same condition as the first once it is started up again. (Or you
> might be able to just copy files.)
>
> I don't know much about Linux, but I could see how an application
> working on the same computer as the data would have a much harder time
> detecting and adjusting for a switch to run off a different server. But
> if all you're doing is going to a virtual drive similar to how I do it
> now--by running the processes on separate windows machines while they
> look to mapped network drives for the data--it is a much easier process
> to have a script that will exit out of the process, change the mapped
> network drive to point to the new server, and then re execute.
>
> It seems like this will work well enough, but not knowing the actual
> application I don't know if this is a good solution for you.
>
> Boaz
>
You could do that all with a little networking smoke & mirrors and rsync and
a few minutes downtime and no client side futzing.
Have both machines on the lan,
set everything up the same on both
rsync fp data to the secondary machine periodically, at least twice a day,
might even be able to do it every hour or even more often given two linux
boxes with good disk hardware on a gigabit lan with each other. rsync right
while running live, don't worry about the fact that the remote copy is
slightly inconsistent due to the source being actively changing while the
copy was being made. All that matters is that the secondary copy is kept
99.999% similar to the live copy.
now comes kernel update day
1) perform kernel/hardware/whatever maintenance on secondary box
---downltime starts---
2) set a user lockout flag on both boxes
3) get everyone/everything out gracefully
4) rsync one more time
5) pull network switcheroo
6) remove lockout flag from secondary (now primary) box only
7) let users back in.
---downtime ends---
8) perform maintenance on former primary box and leave it in place as the
new secondary box
(2) something that blocks new sessions from starting. /etc/nologin, or
better yet something you do to fp binaries so that cron/cgi/edi jobs are
halted as well as logins.
(3) watch ps for any reports/clerks that were started from other ways than
user logins untill they end gracefully
(4) now the remote copy is 100% good, and you see why all that matters was
that it was 99% good before, so that this rsync which happens during
downtime, goes very very fast.
(5) swap ip's, and/or dns names, and/or nat rules, and/or physical cables,
around so that the secondary machine appears where the live machine used to.
There are several possible ways to do that. The easiest/fastest that doesn't
involve rebooting the secondary machine is probbly to host your own dns
and/or wins and update a dns record that the clients all use and take the
live machine physically off line, off the lan, or change it's ip so that
clients can't cache it's ip and have to look it up from DNS or WINS fresh.
You could internally nat the servers so that you just make a change in a
internal nat router and it's really transparent and instant for everyone,
and fast & easy to do without making much/any changes on either the servers
or the clients, but , this adds at least one new machines into the critical
path and thus makes the system as a whole more risky.
The downtime is:
* only a few minutes, steps 2-7 can all take about 5-10 minutes total
maybe even under 2 if you script everything up well. Everything but the
rsync can be almost instantaneous, and the rsync could be as little as a
minute or two even for a large number of files, thanks to 3 things: fast
cpu's & ram, fast network, fast linux filesystems.
* does not have to include waiting for any reboots
* ...or incurring the risk of not coming up from a reboot
* does not have to require special client side setup or user education,
not even a secondary/alternate login icon, nor special allowances in any
non-interactive automated scripts or processes since the secondary machine
takes the place of the primary, assumes it's network identity, right down to
software setting the MAC addresses of the nics in both boxes if necessary.
The constant rsyncs also serve as a great backup in case the main machine
dies ungracefully. rsyncing a live db does not produce a 100% consistent
copy, but in reality it's pretty close. If you suffer a crash of your main
server while running, you should simply expect a few bad records and tell
everyone to verify the last few minutes of work they did. You're lucky thats
all thats wrong or lost.
The tape backup that took more than an hour to run is both several hours
older and far more inconsistent.
In cases where that's not good enough, more money solves it. Use an external
raid array and raid cards that allow for multiple servers to mount the same
array.
The fact that the array is raid provides reduncy for the array itself, and
the 2 servers provide redundancy for each other.
Then the only non redundant part is the enclosure the raid disks are plugged
into.
Get and extra enclosure and mount it in the rack right above/below to the
live one and it only takes a minute to swap the cables and move the disks if
that fails.
Brian K. White -- brian at aljex.com -- http://www.aljex.com/bkw/
+++++[>+++[>+++++>+++++++<<-]<-]>>+.>.+++++.+++++++.-.[>+<---]>++.
filePro BBx Linux SCO FreeBSD #callahans Satriani Filk!
More information about the Filepro-list
mailing list