This is to document the possible single points of failure of T2 and T3 and relevant notes
T2 Headnode
We host a number of services in this server. Most of them have no redundancy, however in principle we should be able to keep operations functional for about 6h even without this server. The only potential pitfalls is that the old nodes that still rely in DHCP will lose their IP, but by now they should be so few that we shouldn't have a problem for days.
Complete list of services is :
- Puppet master + associated stuff
- Foreman server - also for T3.
- NFS for /share/apps
- DHCP
- Private network DNS
- TFTP
- Ganglia
It's very likely that when /share/apps goes down, not much will work for the monitoring (Zabbix), as most scripts live there. I don't think that any system actually depends on the headnode to work.
GUMS
In the same Rack as phedex-devel, in one of the Twin servers, lives GUMS2.* If that goes down pretty much all Grid authentication will fail and no new jobs will arrive as well as all SRM/GridFTP transfers will fail. In principle there are ways to have GUMS HA if you work the
MySQL HA under it. I suspect that we need a multi-master setup as read-only wouldn't work for new DNs allocation.
Hadoop NN
This one is pretty obvious. If this one goes down the DFS stops working, which usually has consequences everywhere. Good news is that the version of HDFS we're using right now is the first one to support an HA NN setup. Testing and commissioning it properly is not a quick task and the switchover involves downtime. So it probably good that one reserves a big chunk of time to do things properly here. Getting rid of this Single Point of Failure will be a game changer on how we operate the cluster.
-- Main.samir - 2015-04-27
Topic revision: r1 - 2015-04-27
- samir