SYSMGR

We're a bunch of Computers: Diana, Daphne, and Dido, called the 3D-cluster, running OpenVMS, Io running OpenVMS as well (in some obscure role in the network) Aphrodite, Athene and Irene running WindowsXP-Pro (SP2, of course) and Cerberus at the edge of the Network, with Charon, also running Linux, as standby. SYSMGR takes care of us.

Wednesday, May 25

24-May-2005

More new hardware,...
The Fax/scanner/printer has been installed: the used connector proved to be bad, after it was replaced it could be configured - and used.
A new DVD burner has been installed in Aphrodite - the one that was inherited from Venus couldn't read nor write CD's - DVD's were no problem, though. Still have to see whether it will, in case is doesn't - well, leave it in, it's able to record double-layered DVD's (8.5 Gb!) and, with the right brand of disk, it can write the label directly onto the disk. Looks VERY professional.
... a new router,...
Charon - the Internet router - is an old PC (486 SX @ 4Mhz, 20Mb mem en a 170Mb disk) that does it's job very well, but since it's a PC, it could well be replaced by a modern, low-power, dedicated router. So in comes Cerberus, a micro-router built by a Cisco subsidiary, that will take Charon's place. It has been set up - basically, and, of course, with some errors - but that is a minor issue: easy to be corrected before adding. Charon will be kept as a backup.
... and a change of configuration
With this, the DHCP configuration on Diana has been changed to accomodate clustering of DNS, DHCP, WEB, SMTP and other protocols. All services are now defined as residing on the cluster address: 192.168.0.200. Whichever machine holds the cluster IP address. Normally, it will be Diana, but when needed, Io, Daphne or another VMS box will take over (there's still that NT-Alpha waiting....). Cerberus will pass all traffic to the cluster - so it's all set up now. Just to get in in place.

Monday, May 23

23-May-2005, afternoon

No changes but hardware additions
Last weekend done some hardware additions to the machine park. The place on the desk that was taken by Io and Dapne - even stacked upon eachother - was desperately needed for this, so they have been disconnected for a while. They will be stacked on to of, or under Charon which is about the same size (old-fashioned desktop) which may require a re-shuffle of the whole location, because the Epson inkjet can no longer be placed there. Once the space was cleared, a fax/print/scanner machine - AKA all-in-one office machine - has been installed, that is to be connected to Aphrodite, so reshuffling the desk space would be a good idea anyway. It does copy and fax, but can not yet be accessed by Aphrodite, probably because the cabling has a problem (connector seems not to fit...)
Since Aphrodite's motherboard incorporates an on-board 7.1 sound system, a speaker system to faciliate has been installed and configured. Again, on the system table there is actually too little room. Again, a good reason for re-shuffling the machinary.
Finally, the on-board video of Aphrodite constitues a problem playing games. Even Tomb Raider doesn't run properly. So added a new video card - pretty straihgt forward, and installed accompanying software - including a game. Well, not all is fine since it Aphrodite may freeze all of a sudden, mouse and keyboard dead so a hard reset (or cycling power) would free the machine. Yet another thing to find out!
Published a new site, has been in DNS for months but now it can be accesssd: genealogy.grootersnet.nl. For privacy reasons, it has been split in a public, more restricted site (no data on living persons unless explicitly allowed) and a researcher's site (containing all data in the database). Neither of them is mentioned in robots.txt so the good bot's won't scan them.
The layout doesn't match standards but that's a limitation of the package used. Could be redone manually to match the standards, but it's lot of pages. The old pages on www.grootersnet.nl/genealogy are mainained for the time being, need to link both together.
Well, time to create a database and software on the VMS boxes. That is: Diana, though Io and Daphne could run RDB as well, there is just enough space available. But that requires investigation on where and how - since disk space is limited on those two. (Ok, Daphe contains 2 1Gb and 1 2Gb disk, and Io will require the pizzabox, but whether that is sufficient for both user environment (mail!), the webs, AND a database is to be investigated).

That's about all. Diana found nothing particular but a single DUMP attempt on anonymous FTP (failed, of course) and several spam attemps (both relay as dumping into the domain) that failed again. Time to run dig on all the spammer-data, as well as analyzing the access on the webserver and router. (Need to automate it all and publish the results - on the public OpenVMS page, of course.)

Thursday, May 19

18-May-2005

It works!
Setting the VOTES and EXPECTED_VOTES parameters as determined: EXPECTED_VOTES = 5 on all three boxes, and VOTES = 3 on Diana and VOTES = 1 on both Io and Daphne, and rebooted all machines after autogen: Now Diana can run without Io and Daphne.
Autogen on Diana was a bit troublesome since it was the first autogen after I removed the 256Mb of memory months ago, so it was not just the cluster parameters to be adjusted. Tjough, in the end, it works!

Did some adjustments on the genealogy website - still to do some work there, but fellow researches are now added (and could use Diana as a mailserver, but that has not yet been communicated).

Found out not all is 100% there but it's of minor concern.

There is some program requested to be migrated from Alpha to Itanium - not yet done, and it's in the new freeware CD. Something to do, there is another porting workshop next week. Good excercise before the bootcamp! Maybe on the company RX2600?

Wednesday, May 18

17-May-2005

Calculating VOTES
Just an idea....
When both Io and Daphne leave the cluster, Diana will recognize "quorum loss" and therefopre suspend all processing (even signalling the fact...). So, to prevent this, quorum loss is to be prevented. But how...
Well, if Diana would have so much votes that quorum is not lost! It's the algorithm used by VMS, that gives the answer:
Quorum is calculated as the maximum of the next calculations:
On boot: Estimated quorum = (EXPECTED_VOTES + 2)/2 Rounded down
On entering, or leaving the cluster: QUORUM = (total of all VOTES + 2)/2 Rounded down
The last calculated quorum (when a node has left the cluster) is observed as well.

If the current number of cluster votes drops below the quorum value (because of computers leaving the cluster) , the system will hang.

So: If Diana gets VOTES = Sum of the votes of all other members + 1, would that be good?

DIANA gets VOTES = 3, the others get VOTES = 1. EXPECTED_VOTES would be 3 + 1 + 1 = 5 on all.
So quorum, on boot, would be 5 + 2 /2, = 3 for all. Diana would run, and Io nor Daphne would, unless Diana is running (so they can join the cluster). But Diana would still have quorum if both Io and Daphne would leave the cluster. Just what is requested.

To have Io and Daphne run - stand-alone, or in a cluster of themselves, would require a second scheme. No problem: define [SYS1] and configure this in there...

Will try this....

Tuesday, May 17

16-May-2005

More cluster issues
Io and Daphne still cause some trouble - Diana hangs if they shut down, it has to do with VOTES and EXPECTED_VOTES parameters. Tried to set both Io and Daphne to non-voting members (VOTES := 0) but that implies them to boot from Diana. But Diana's [SYS0] directory is not accessable, so startup crashes. Non-clustered startup (hooked off the network) causes no problems, just takes a long time. Quite obvious since it will wait for connection (normally quick), or it will time-out (and that waiting period is set too high).
So restarted Io stand-alone, practicing some patience, and set VOTES back to 1. Autogen'd and rebooted. Next boot into the cluster, connetion to Diana succeeded - and the Io startup crashed again since it's system disk is still known by Diana. Diana accepted Io's request - and is now hung because Io's startup failed.....AHHH. Have to cycle power on Diana because ^P doesn't work (no console - it's a workstation WITH graphics...). Afterwards, rebooted Io and Daphne and all is well. Shutdown Daphne, keep Io running to prevent Diana hanging again....

Why Io running and not Daphne? Well, Daphne gets VERY, VERY hot (burnt my fingers on it's disks when trying to re-jumper them....) and Io seems to stay pretty cool.

Now the problem is: how to keep Diana running when the other two can be switched off...

Monday, May 9

09-May-2005

Yesterday's problem shows in log
As it turned out this morning when examining DIANA's operator log, it was found that yesterday's problems were indeed a cluster hang - but it was only after an attempt to do something it is shown in the operator log.
DAPHNE and IO were shutdown from a terminal session from HERA:

%%%%%%%%%%% OPCOM 8-MAY-2005 07:51:27.73 %%%%%%%%%%% (from node DAPHNE at 8-MAY-2005 07:51:50.79)
Message from user TCPIP TELNET on DAPHNE
TELNET Logout Request from Remote Host: hera.intra.grootersnet.nl Port: 1704

%%%%%%%%%%% OPCOM 8-MAY-2005 07:51:45.96 %%%%%%%%%%%
Message from user INTERnet on DIANA
TELNET Login from Host: hera.intra.grootersnet.nl Port: 1705

%%%%%%%%%%% OPCOM 8-MAY-2005 07:52:46.08 %%%%%%%%%%%
Message from user TCPIP TELNET on DIANA
TELNET Logout Request from Remote Host: hera.intra.grootersnet.nl Port: 1705

At this point, I found that DIANA did not respond and started IO.(Note that the next line follows immediately the previous one! There is NOTHING in between - not even a time stamp)

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.56 %%%%%%%%%%%
OPCOM on DIANA recognizes node IO, csid 00010011, system 65532
Attempting to establish communications, placing node in STARTING state
.

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.56 %%%%%%%%%%%
OPCOM on DIANA is deactivating DAPHNE, csid 0001000C, system 65533
Node is no longer with us, placing node in DEPARTED state.

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.56 %%%%%%%%%%%
07:58:30.58 Node DIANA (csid 00010008) lost connection to node DAPHNE

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.56 %%%%%%%%%%%
07:58:35.61 Node DIANA (csid 00010008) lost quorum, blocking activity

yea, right!
Note these cluster messages take the time it occurred (since these are CLUSTER messages) but that is shows up when activity is resumed! The system has been hung for about 10 hours.....

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
07:58:51.10 Node DIANA (csid 00010008) timed-out lost connection to node DAPHNE

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
07:58:51.10 Node DIANA (csid 00010008) proposed reconfiguration of the VMScluster

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
07:58:51.10 Node DAPHNE (csid 0001000C) has been removed from the VMScluster

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
07:58:51.10 Node DIANA (csid 00010008) completed VMScluster state transition

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
Mount verification has aborted for device $3$DKA0: (DAPHNE)

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
Mount verification has aborted for device $3$DKA100: (DAPHNE)

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
Mount verification has aborted for device $2$DKA0: (IO)

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
Mount verification has aborted for device $3$DKA400: (DAPHNE)

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
19:36:41.84 Node DIANA (csid 00010008) received VMScluster membership request from node IO

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
19:36:41.84 Node DIANA (csid 00010008) proposed addition of node IO

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.69 %%%%%%%%%%%
19:36:44.43 Node DIANA (csid 00010008) completed VMScluster state transition

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.69 %%%%%%%%%%%
19:36:44.49 Node DIANA (csid 00010008) regained quorum, proceeding

And now we're back in business!
SMTP starts getting all it's postponed messages (this is the first of them)

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.69 %%%%%%%%%%%
Message from user INTERnet on DIANAINTERnet ACP SMTP Accept Request from Host: ....

And further on - in parallel, so to say, IO is wakening up:

%%%%%%%%%%% OPCOM 8-MAY-2005 19:37:41.45 %%%%%%%%%%%
OPCOM on DIANA is trying again to talk to IO, csid 00010011, system 65532

%%%%%%%%%%% OPCOM 8-MAY-2005 19:37:41.48 %%%%%%%%%%%OPCOM on DIANA is activating IO, csid 00010011, system 65532
Have established communications, placing node in ACTIVE state.

%%%%%%%%%%% OPCOM 8-MAY-2005 19:37:48.95 %%%%%%%%%%% (from node IO at 8-MAY-2005 19:37:48.80)
Message from user SYSTEM on IO
%ACME-I-SERVERSTART, ACME_SERVER starting
...

Time to find out what settings are appropiate for VOTES and EXPECTED_VOTES....

08-May-2005

All is well - no, not entirely....
Did IO fully yesterday, had both new systems up and running last night but shut them down this morning at about 8:00 - after I found that the tape unit is not seen by VMS. Funny - to be investigated later.
Later working on the laptop - wirelessly connected - but DHCP didn't work, DNS did, but TELNET didn't either. PING revealed Diana was up and running.... But the machine hung appearently - not able to resume still running session on terminal! So IO was started - it _could_ be a cluster hang again, as before. And yes: once IO was up and connected to form a clustered system with DIANA, those problems were solved. Login (that is: session resume after entering password), DHCP, TELNET et all did work again. Why DNS didn't stop - could have been the cache on Charon....
Dismounted all external devices from IO and let it run so DIANA wouldn't hang again but with minimal power consumption.
Typically a cluster problem with VOTES and EXPECTED_VOTES sysgen parameters. To be looked at in the manuals....There is no problem here with split clusters so DIANA must be set up to be a one-node cluster - I don't care for the other machines. Either one can be running aside DIANA - or none. At least, for the moment.
Tape problems on IO (and probably DAPHNE as well) to be examined. Could be a 8.2 issue, since is has worked using 7.3-2....There is a backup as proof.

Saturday, May 7

07-May-2005

Finally - all done
Today, I took the AlphaStation200 that runs VMS, re-jumpered the disks so it fits my set up.
Tried lastlu written CD of 8.2, looks good but fails to boot - again. Took the one 1.05 disk from the pizzabox to hold 8.2 and installed from there. Ok, it takes some time - but now this machine - named DAPHNE - is now a VMS 8.2 machine in the cluster!
Set the whole thing up - taht is, startup copies and adapted from DIANA and that works smoothly as well. Major setup of TCPIP, not all since DNS, DHCP, POP et all still need to be determined how to set that up in cluster.
Did IO as well - the same thing, without a problem. That is: VMS 8.2 installed, put it in cluster but setup needs to be finished still.

Friday, May 6

06-May-2005

Getting VMS 8.2
Now I know where to get it (since we were fieldtest-site, we did have an account - and it is still valid!) I downloaded VMS8.2, the layered products and OpenSource tools. Placed them on empty disks on DIANA, unzipped them there - gave me .BCK-files. On another disk I created two containers (Logical Disk) - one for VMS and one for the layered products, and restored the savesets on these disks. First, tried to copy the container to a PC and burn is as an image, but VMS fails to boot - problem with disk label, for instance. Next, used the DFY$VMSCD program to create an image of a bootable C DROM. Time's up - see tomorrow if it will boot.

Thursday, May 5

05-May-2005

Building the cluster
Today the DEC3000 was set up as clustermember, using VMS 7.3-2 - a clean install, no upgrade, on the smaller disk (1.05 Gb). Took defaults, installed outside the network, machine is named IO. Comes up nicely after reboot. Defined it to be a clustermemeber, shutdown, connected to the network and booted.
But ALAS.
Since I used the default label for the system disk, the system tried to connect to the cluster but found the same label already mentioned elsewhere: on Diana (the only other one at the moment), causing IO to bugcheck and Diana to hang - and that required a power-cycle to get it up and running again...
To be honest - I had the same trouble when building the cluster in course last year....Beginner's mistake. So I relabelesd IO's system disk and booted. HURRAY! I GOT MY CLUSTER!

Sunday, May 1

30-Apr-2005

New Old Stuff???
Today finally some time to look to the new old Alpha's: two AlphaStation 200 4/166, one DEC 3000 M300LX, one BA353 (AKA Pizzabox), a few disks (SCSI, all 10 Mb/s SCSI) of which one cannot be read, and a geniune VT320 - to be used as console!
Started examining, had to break in since set-up involved changes of SYSUAF on startup - copied from another disk.....
In the end, found one Alphasation 200 runs NT (!), the others run VMS - 7.1-2 has been installed on both. Firmware is latest - so that's no problem, but disks need to be rejumpered. No problem - since VMS will be re-installed. 7.3-2 to test clustering, and probably 8.2 as final system...