Filepro-list Digest, Vol 28, Issue 47

Mon May 22 11:16:25 PDT 2006

>Date: Mon, 22 May 2006 12:59:35 -0400
>From: "Brian K. White" <brian at aljex.com>
>Subject: Re: Critical uptime question (Was "Looking for some upgrade
>	advice")
>To: <filepro-list at lists.celestial.com>
>Message-ID: <03d101c67dc1$2a1ef5b0$6500000a at venti>
>Content-Type: text/plain; format=flowed; charset="iso-8859-1";
>	reply-type=original
>
>
>----- Original Message ----- 
>From: "Boaz Bezborodko" <boaz at mirrotek.com>
>To: <filepro-list at lists.celestial.com>; "John Esak" <john at valar.com>
>Sent: Sunday, May 21, 2006 10:16 PM
>Subject: Critical uptime question (Was "Looking for some upgrade advice")
>
>
>  
>
>You could do that all with a little networking smoke & mirrors and rsync and 
>a few minutes downtime and no client side futzing.
>
>Have both machines on the lan,
>set everything up the same on both
>rsync fp data to the secondary machine periodically, at least twice a day, 
>might even be able to do it every hour or even more often given two linux 
>boxes with good disk hardware on a gigabit lan with each other. rsync right 
>while running live, don't worry about the fact that the remote copy is 
>slightly inconsistent due to the source being actively changing while the 
>copy was being made. All that matters is that the secondary copy is kept 
>99.999% similar to the live copy.
>
>
>now comes kernel update day
>
>1) perform kernel/hardware/whatever maintenance on secondary box
>---downltime starts---
>2) set a user lockout flag on both boxes
>3) get everyone/everything out gracefully
>4) rsync one more time
>5) pull network switcheroo
>6) remove lockout flag from secondary (now primary) box only
>7) let users back in.
>---downtime ends---
>8) perform maintenance on former primary box and leave it in place as the 
>new secondary box
>
>
>(2) something that blocks new sessions from starting. /etc/nologin, or 
>better yet something you do to fp binaries so that cron/cgi/edi jobs are 
>halted as well as logins.
>(3) watch ps for any reports/clerks that were started from other ways than 
>user logins untill they end gracefully
>(4) now the remote copy is 100% good, and you see why all that matters was 
>that it was 99% good before, so that this rsync which happens during 
>downtime, goes very very fast.
>(5) swap ip's, and/or dns names, and/or nat rules, and/or physical cables, 
>around so that the secondary machine appears where the live machine used to.
>There are several possible ways to do that. The easiest/fastest that doesn't 
>involve rebooting the secondary machine is probbly to host your own dns 
>and/or wins and update a dns record that the clients all use and take the 
>live machine physically off line, off the lan, or change it's ip so that 
>clients can't cache it's ip and have to look it up from DNS or WINS fresh.
>You could internally nat the servers so that you just make a change in a 
>internal nat router and it's really transparent and instant for everyone, 
>and fast & easy to do without making much/any changes on either the servers 
>or the clients, but , this adds at least one new machines into the critical 
>path and thus makes the system as a whole more risky.
>
>The downtime is:
>  * only a few minutes, steps 2-7 can all take about 5-10 minutes total 
>maybe even under 2 if you script everything up well. Everything but the 
>rsync can be almost instantaneous, and the rsync could be as little as a 
>minute or two even for a large number of files, thanks to 3 things: fast 
>cpu's & ram, fast network, fast linux filesystems.
>  * does not have to include waiting for any reboots
>  * ...or incurring the risk of not coming up from a reboot
>  * does not have to require special client side setup or user education, 
>not even a secondary/alternate login icon, nor special allowances in any 
>non-interactive automated scripts or processes since the secondary machine 
>takes the place of the primary, assumes it's network identity, right down to 
>software setting the MAC addresses of the nics in both boxes if necessary.
>
>The constant rsyncs also serve as a great backup in case the main machine 
>dies ungracefully. rsyncing a live db does not produce a 100% consistent 
>copy, but in reality it's pretty close. If you suffer a crash of your main 
>server while running, you should simply expect a few bad records and tell 
>everyone to verify the last few minutes of work they did. You're lucky thats 
>all thats wrong or lost.
>The tape backup that took more than an hour to run is both several hours 
>older and far more inconsistent.
>In cases where that's not good enough, more money solves it. Use an external 
>raid array and raid cards that allow for multiple servers to mount the same 
>array.
>The fact that the array is raid provides reduncy for the array itself, and 
>the 2 servers provide redundancy for each other.
>Then the only non redundant part is the enclosure the raid disks are plugged 
>into.
>Get and extra enclosure and mount it in the rack right above/below to the 
>live one and it only takes a minute to swap the cables and move the disks if 
>that fails.
>
>Brian K. White  --  brian at aljex.com  --  http://www.aljex.com/bkw/
>+++++[>+++[>+++++>+++++++<<-]<-]>>+.>.+++++.+++++++.-.[>+<---]>++.
>filePro  BBx    Linux  SCO  FreeBSD    #callahans  Satriani  Filk!
>
>
>  
>
Brian,

Step #3 ("get everyone/everything out gracefully") reminded me of this
cartoon: http://photos1.blogger.com/blogger/6965/604/1600/miracle.jpg . 

;-)

My problem with your suggestion of how to do it is that it is not
necessarily that graceful.  The system I propose would have even recent
data very up to date so that people can move from one to another server
over a period of time without trying to jam everything in within a
particular time frame.  Once they are all of the server you can then go
about your business of updating the server and not have to worry about
moving everyone at once.

The advantage your system has is that it is processing agnostic.  You
don't have to worry about developing a program for each iteration of
each file.  But you do have to worry about getting everything done
within a certain time-frame which is precisely what John said he didn't
have the time.  My approach would require a bunch of new coding for each
process to transfer the data and incorporate it into the second server's
database.  But you could then set the transfer flag and wait as the
active processes are backed out and re-initiated on the second server. 
The users would see a minor and short inconvenience as they backed out
and re-executed into the second server, but they would still be moving
along with their operations.

Boaz