Tier 2 Monitoring

Taken from here: https://twiki.cern.ch/twiki/bin/view/Sandbox/SamirCurySandboxCaltech


Extremely handy links :

MonaLisa HepSpec table

USCMS T2 HepSpec table

Sites pledges

Upgrades (software)

USCMS Upgrades twiki

Pledges, etc

Monitoring links

We need a page to aggregate those, plus some DashBoard + PhEDEx + central CMS monitoring tools plots.

Central CMS

OSG

Local pages/systems

Documentation

Workplan 02/2013

T2

  • Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures
  • Improvements/integration on the monitoring
    • Automate MonaLisa install into all T2 nodes and servers
    • Clean current MonaLisa web dashboard with the help of the MonaLisa team
    • Clean current Nagios dashboard, reduce to applicable alarm frequency
      • (optional) Implement SMS alerts for Nagios most critical alarms.
  • Integrate all new servers into Nagios/Backup schema (could not be done yet)
  • Review of network/firewalls/security settings -- more a "learning review"
    • Add few rules so cluster management from CERN is easier
    • (Optional) Explore different SSH authentication methods (GSI or Kerberos) - passwordless but still secure SSH access

T3 (Local & Remote)

  • Automate node deployment and profile maintenance with local (dedicated headnode) or remote (T2) puppet+foreman
    • This will make us move away from Rocks in all clusters and have more uniformity in deployment/configuration procedures.
  • Commission Condor as a batch system, better monitoring comes for free

Global (related to computing resources in general) :

  • Explore how the T3s could use its Idle CPU time (if any) to process jobs for the T2 cluster. (condor integration, condor fork)

  • Explore pros/cons of this : The T2 and T3-higgs nodes can store data under the same hadoop namenode, our local resources would be protected by quotas. Management gets simpler

  • Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is < 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency < 70%

  • Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers.
Topic revision: r1 - 2017-06-27 - dkcira
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback