Tier 2 Monitoring
Taken from here:
https://twiki.cern.ch/twiki/bin/view/Sandbox/SamirCurySandboxCaltech
Extremely handy links :
MonaLisa HepSpec table
USCMS T2 HepSpec table
Sites pledges
Upgrades (software)
USCMS Upgrades twiki
Pledges, etc
Monitoring links
We need a page to aggregate those, plus some DashBoard + PhEDEx + central CMS monitoring tools plots.
Central CMS
OSG
Local pages/systems
Documentation
Workplan 02/2013
T2
- Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures
- Improvements/integration on the monitoring
- Automate MonaLisa install into all T2 nodes and servers
- Clean current MonaLisa web dashboard with the help of the MonaLisa team
- Clean current Nagios dashboard, reduce to applicable alarm frequency
- (optional) Implement SMS alerts for Nagios most critical alarms.
- Integrate all new servers into Nagios/Backup schema (could not be done yet)
- Review of network/firewalls/security settings -- more a "learning review"
- Add few rules so cluster management from CERN is easier
- (Optional) Explore different SSH authentication methods (GSI or Kerberos) - passwordless but still secure SSH access
T3 (Local & Remote)
- Automate node deployment and profile maintenance with local (dedicated headnode) or remote (T2) puppet+foreman
- This will make us move away from Rocks in all clusters and have more uniformity in deployment/configuration procedures.
- Commission Condor as a batch system, better monitoring comes for free
Global (related to computing resources in general) :
- Explore how the T3s could use its Idle CPU time (if any) to process jobs for the T2 cluster. (condor integration, condor fork)
- Explore pros/cons of this : The T2 and T3-higgs nodes can store data under the same hadoop namenode, our local resources would be protected by quotas. Management gets simpler
- Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is < 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency < 70%
- Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers.
Topic revision: r1 - 2017-06-27
- dkcira