Special cases, odd stuff, and Misc trouble-shooting stuff
Cluster? (Not sure whether or not to include an entire dedicate section here.)
Power up sequence is:
dab10 (home directories, bootleg-YP)
dab11, dab12, dab13, dab16, shoveler
spitting-duck, roast-duck, duck-breath
flake, stout
fuster-cluck
fuster-cluck-nodes
Power down order is reverse of power-up.
Best way to reboot a system is on the console with
% reboot
Best way to shutdown a system is on the console with
% shutdown -h now
Reset-Button: As a last resort and only when the system is confirmed fully wedged
Check that KVM is still connected
Make sure you can't ssh to the system from elsewhere first
Confirm that system is unresponsive to keyboard (ESC, CTRL-q, CTRL-d, CTRL-z, CTRL-c, CTRL-Alt-Del)
Lots of stuff can break w/ a reset/power cycle.
Critical systems/software/pieces/stuff
dab10: This is a bad system to have trouble with.
/home/sysadm is accessed by most all the servers, I may do a nightly mirror of it to dab11 or the new S-D.
Might make sense to have scripts be local and have an "if" statement to gracefully fail if /home/sysadm isn't reachable.
Systems look for new version of stuff under /home/sysadm/configs/etc/ typically stuff like passwd, group, auto_home...
Typically called from a crontab entry similar to this: 52 23 Error: this should not happen /home/sysadm/bin/fc3pull-configs 1 >/var/log/update-configs.out 2&1
dab10 and dab11 are identical systems so you can start swapping hardware if things are really bad.
If you suspect hardware, power down both systems and swap all hard drives (OS: top two left, RAID: bottom 3 rows) and power back up, watching for behavior.
License and software services:
RSI (IDL/Envi):
The cluster currently gets licenses from dab11 and shoveler. /etc/init.d/sys5_idl_lmgrd is the startup script but has to be able to see /home/rsi/licenses in order to fire up..
128.111.110.80 (new sd) is running a license server for the newer version of IDL/Envi but is not public yet.
The farming out occurs when people use: /home/rsi/idl/bin/idl, I altered that script just a little.
ESRI (ArcGIS):
The license server for ICESS and Bren is running on esrilic.icess.ucsb.edu (hutch.icess.ucsb.edu).
The wad sits on /home/data35/esri_license and the startup script is in /sbin/rc3.d/S56arclicense-[8,9], gotta be ROOT to muck w/ this.
Matlab:
Individual installations w/ standalone license files. Install wad is under /home/software/Mathworks_Matlab/...
WWW: roast-duck.bren.ucsb.edu, /usr and /var/log mirrored nightly to /home/sysadm/systems/roast-duck/
swiki: sd.bren.ucsb.edu, /usr/local mirrored nightly to /home/sysadm/systems/sitting-duck
smb: sd.bren.ucsb.edu, /etc/samba mirrored nightly to /home/sysadm/systems/sitting-duck/etc/
MediaWiki: currently zonbi, will be new sd.bren.ucsb.edu
SSH: sd.bren.ucsb.edu
DB services: Mostly on duck-breath... Peter probably knows more than I do on this topic.
Cluster: fuster-cluck.bren.ucsb.edu is the head node running ROCKS. It is pretty much stand-alone and must be managed/updated separately w/ user-adds, auto.home, etc.
Backups:
...are run over to ICESS on the BUBs there and are called from root crontab entries on each BUB pointing at /home/sysadm/bin/Rsyncers/rsync-bub##
There used to be a 1-1 mapping of DAB to BUB but money got tight so Colee got creative.
You can typically count on there being a full mirror of everything on one of the bubs, good way to figure out where:
code grep user69 /home/sysadm/bin/Rsyncerz/rsync-bub will point you at where user69 is being mirrored to. Commented entries are old locations.
If a system is not going to be down for a while I try not to point at the mirror simply cuz re-syncing tends to be a pain.
If the system is still up and just the file-system is having trouble you can just create a link to the mirror:
ln -s /home/bub13/user69 /ed10/ ##/ed10/ should just be an empty dir where the filesystem is normally mounted.
Otherwise you'll need to re-set the auto.home tables at ICESS and on the Colee systems.
You'll need Jason or one of the ICESS admins to do the ICESS part of the dance but the Colee systems are updated by updating dab10 which has a crontab entry which updates /home/sysadm/configs/etc/ and most Colee hosts have crontab entries that update from there if it has a newer version than they do locally.
Remember to comment out the backup of that chunk if yer pointing at the mirror/BUB cuz it'll tweak pointing at itself.
BUB13 is a frankenstein box, that's how it looks and if it's powered off you'll need to power it up, wait for the RAID to choke cuz all the hard drives don't spin up before it tries to mount them, and just hit the "RESET" button, system should come up fine then.
PCz are also backed up in a similar fashion from /home/sysadm/bin/Rsyncerz/rsync-pcz, currently to bub13.icess.ucsb.edu
This is a somewhat complicated process, you need to share the desired chunk to the "cifs" user and then make an entry in /etc/auto.home for the appropriate PC being backed up.
Exclusions: The rsync scripts all accept exclusion strings, called from the "rsyncer" chunk and generally included in /home/sysadm/bin/Rsyncerz/exclude-, handy for blocking stuff you don't want to backup.
Special cases, Odd stuff, and Misc crap I can't find a better place to put:
/usr/bin/vib: a simple cludge for making a backup of a file in it's directory/.bk/filename.YYYYMMHHMM
Replacing a failed RAID disk:
Pull the hotswap cartridge
Hope you've got a similar sized hard drive laying around to replace the buggered one (Colee has a drawer of drives in his office, top left hand cabinet).
Swap the drives in the tray w/ a phillips screw driver
Re-insert the hotswap cartridge w/ the new drive.
Go to the 3Ware management page... (finish this bit)
Link to this Page
Systems last edited on 18 July 2007 at 3:40 pm by dab10