ComputingAdminSPF < Main

This is to document the possible single points of failure of T2 and T3 and relevant notes

T2 Headnode

We host a number of services in this server. Most of them have no redundancy, however in principle we should be able to keep operations functional for about 6h even without this server. The only potential pitfalls is that the old nodes that still rely in DHCP will lose their IP, but by now they should be so few that we shouldn't have a problem for days.

Complete list of services is :

Puppet master + associated stuff
Foreman server - also for T3.
NFS for /share/apps
DHCP
Private network DNS
TFTP
Ganglia

It's very likely that when /share/apps goes down, not much will work for the monitoring (Zabbix), as most scripts live there. I don't think that any system actually depends on the headnode to work.

GUMS

In the same Rack as phedex-devel, in one of the Twin servers, lives GUMS2.* If that goes down pretty much all Grid authentication will fail and no new jobs will arrive as well as all SRM/GridFTP transfers will fail. In principle there are ways to have GUMS HA if you work the MySQL HA under it. I suspect that we need a multi-master setup as read-only wouldn't work for new DNs allocation.

Hadoop NN

This one is pretty obvious. If this one goes down the DFS stops working, which usually has consequences everywhere. Good news is that the version of HDFS we're using right now is the first one to support an HA NN setup. Testing and commissioning it properly is not a quick task and the switchover involves downtime. So it probably good that one reserves a big chunk of time to do things properly here. Getting rid of this Single Point of Failure will be a game changer on how we operate the cluster.

-- Main.samir - 2015-04-27

Topic revision: r1 - 2015-04-27 - samir

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback