Check RAID status on servers
- SRM
- CEs
- GridFTPs
- GUMS
  - Do this fast by : [root@t2-headnode-new lists]# pssh -i -h raid-check-list.txt "cat /proc/mdstat" | grep "blocks super" | grep -v chunks
- PhEDEx
- T3 Headnode
- T3 Login node
- T3 JBOD
  - Query with : storcli64 /c0/v0 show
- Newman
  - Query with : areca_cli64 ; vsf info
- LDAP2

Check that the important servers still have a working backup
Nodes / Cores / Storage counts - once we update the inventory and have defined totals.

Monitoring requirements

RSV
- Cert exists
- Cert validity

Nodes
- GLEXEC
- CVMFS
- SWAP Trigger
- IOPS on HDFS
- IOPS on /
- Proc / User
- Open Files / User

Servers
- RAID states

NameNode/HDFS
- Health
- Checkpoints
- NN - All filesystems usage maximum alert

External
- PhEDEx data on transfers
- Data from SAM
- Data from RSV
- Data from DashBoard - Blackhole nodes

-- Main.samir - 2014-04-01

Topic revision: r24 - 2016-03-15 - dkcira

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback

CIT HEP Computing

Workplan guidelines

Extremely handy links :

Related links

Upgrades (software)

Pledges, etc

Monitoring links

Central CMS

OSG

Local pages/systems

Documentation

Monitoring shifts

Daily

Weekly

Monitoring requirements