SYSMGR

We're a bunch of Computers: Diana, Daphne, and Dido, called the 3D-cluster, running OpenVMS, Io running OpenVMS as well (in some obscure role in the network) Aphrodite, Athene and Irene running WindowsXP-Pro (SP2, of course) and Cerberus at the edge of the Network, with Charon, also running Linux, as standby. SYSMGR takes care of us.

Monday, May 9

09-May-2005

Yesterday's problem shows in log
As it turned out this morning when examining DIANA's operator log, it was found that yesterday's problems were indeed a cluster hang - but it was only after an attempt to do something it is shown in the operator log.
DAPHNE and IO were shutdown from a terminal session from HERA:

%%%%%%%%%%% OPCOM 8-MAY-2005 07:51:27.73 %%%%%%%%%%% (from node DAPHNE at 8-MAY-2005 07:51:50.79)
Message from user TCPIP TELNET on DAPHNE
TELNET Logout Request from Remote Host: hera.intra.grootersnet.nl Port: 1704

%%%%%%%%%%% OPCOM 8-MAY-2005 07:51:45.96 %%%%%%%%%%%
Message from user INTERnet on DIANA
TELNET Login from Host: hera.intra.grootersnet.nl Port: 1705

%%%%%%%%%%% OPCOM 8-MAY-2005 07:52:46.08 %%%%%%%%%%%
Message from user TCPIP TELNET on DIANA
TELNET Logout Request from Remote Host: hera.intra.grootersnet.nl Port: 1705

At this point, I found that DIANA did not respond and started IO.(Note that the next line follows immediately the previous one! There is NOTHING in between - not even a time stamp)

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.56 %%%%%%%%%%%
OPCOM on DIANA recognizes node IO, csid 00010011, system 65532
Attempting to establish communications, placing node in STARTING state
.

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.56 %%%%%%%%%%%
OPCOM on DIANA is deactivating DAPHNE, csid 0001000C, system 65533
Node is no longer with us, placing node in DEPARTED state.

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.56 %%%%%%%%%%%
07:58:30.58 Node DIANA (csid 00010008) lost connection to node DAPHNE

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.56 %%%%%%%%%%%
07:58:35.61 Node DIANA (csid 00010008) lost quorum, blocking activity

yea, right!
Note these cluster messages take the time it occurred (since these are CLUSTER messages) but that is shows up when activity is resumed! The system has been hung for about 10 hours.....

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
07:58:51.10 Node DIANA (csid 00010008) timed-out lost connection to node DAPHNE

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
07:58:51.10 Node DIANA (csid 00010008) proposed reconfiguration of the VMScluster

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
07:58:51.10 Node DAPHNE (csid 0001000C) has been removed from the VMScluster

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
07:58:51.10 Node DIANA (csid 00010008) completed VMScluster state transition

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
Mount verification has aborted for device $3$DKA0: (DAPHNE)

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
Mount verification has aborted for device $3$DKA100: (DAPHNE)

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
Mount verification has aborted for device $2$DKA0: (IO)

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
Mount verification has aborted for device $3$DKA400: (DAPHNE)

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
19:36:41.84 Node DIANA (csid 00010008) received VMScluster membership request from node IO

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.65 %%%%%%%%%%%
19:36:41.84 Node DIANA (csid 00010008) proposed addition of node IO

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.69 %%%%%%%%%%%
19:36:44.43 Node DIANA (csid 00010008) completed VMScluster state transition

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.69 %%%%%%%%%%%
19:36:44.49 Node DIANA (csid 00010008) regained quorum, proceeding

And now we're back in business!
SMTP starts getting all it's postponed messages (this is the first of them)

%%%%%%%%%%% OPCOM 8-MAY-2005 19:36:44.69 %%%%%%%%%%%
Message from user INTERnet on DIANAINTERnet ACP SMTP Accept Request from Host: ....

And further on - in parallel, so to say, IO is wakening up:

%%%%%%%%%%% OPCOM 8-MAY-2005 19:37:41.45 %%%%%%%%%%%
OPCOM on DIANA is trying again to talk to IO, csid 00010011, system 65532

%%%%%%%%%%% OPCOM 8-MAY-2005 19:37:41.48 %%%%%%%%%%%OPCOM on DIANA is activating IO, csid 00010011, system 65532
Have established communications, placing node in ACTIVE state.

%%%%%%%%%%% OPCOM 8-MAY-2005 19:37:48.95 %%%%%%%%%%% (from node IO at 8-MAY-2005 19:37:48.80)
Message from user SYSTEM on IO
%ACME-I-SERVERSTART, ACME_SERVER starting
...

Time to find out what settings are appropiate for VOTES and EXPECTED_VOTES....

0 Comments:

Post a Comment

<< Home