CIT HEP Computing
This page is dedicated to centralize all computing related subjects
Workplan guidelines
Our priorities are in this order :
- Uptime
- User support (blocker issues)
- Resource utilization -- making sure that all hardware is being used
- User support (non-blocker important issues)
- Backups
- Monitoring
- User support (potential non-issues)
- Security
- R&D - This probably needs extra manpower. We have ideas but no time.
Extremely handy links :
MonaLisa HepSpec table
USCMS T2 HepSpec table
Sites pledges
Related links
Upgrades (software)
USCMS Upgrades twiki
Pledges, etc
Monitoring links
We need a page to aggregate those, plus some DashBoard + PhEDEx + central CMS monitoring tools plots.
Central CMS
OSG
Local pages/systems
Documentation
Monitoring shifts
Daily
Weekly
- Check RAID status on servers
- SRM
- CEs
- GridFTPs
- GUMS
- Do this fast by : [root@t2-headnode-new lists]# pssh -i -h raid-check-list.txt "cat /proc/mdstat" | grep "blocks super" | grep -v chunks
- PhEDEx
- T3 Headnode
- T3 Login node
- T3 JBOD
- Query with : storcli64 /c0/v0 show
- Newman
- Query with : areca_cli64 ; vsf info
- LDAP2
- Check that the important servers still have a working backup
- Nodes / Cores / Storage counts - once we update the inventory and have defined totals.
Monitoring requirements
- RSV
- Cert exists
- Cert validity
- Nodes
- GLEXEC
- CVMFS
- SWAP Trigger
- IOPS on HDFS
- IOPS on /
- Proc / User
- Open Files / User
- NameNode/HDFS
- Health
- Checkpoints
- NN - All filesystems usage maximum alert
- External
- PhEDEx data on transfers
- Data from SAM
- Data from RSV
- Data from DashBoard - Blackhole nodes
-- Main.samir - 2014-04-01
Topic revision: r24 - 2016-03-15
- dkcira