Home - this site is powered by TWiki(R)

Difference: Computing (19 vs. 20)

Revision 202014-12-08 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 119 to 119
	Chaging SITECONF settings CRAB3 mantra
Changed:
< <	Task wishlist T2 Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures Improvements/integration on the monitoring Automate MonaLisa install into all T2 nodes and servers Clean current MonaLisa web dashboard with the help of the MonaLisa team Clean current Nagios dashboard, reduce to applicable alarm frequency (optional) Implement SMS alerts for Nagios most critical alarms. Integrate all new servers into Nagios/Backup schema (could not be done yet) Review of network/firewalls/security settings -- more a "learning review" Add few rules so cluster management from CERN is easier (Optional) Explore different SSH authentication methods (GSI or Kerberos) - passwordless but still secure SSH access T3 (Local & Remote) Automate node deployment and profile maintenance with local (dedicated headnode) or remote (T2) puppet+foreman This will make us move away from Rocks in all clusters and have more uniformity in deployment/configuration procedures. Commission Condor as a batch system, better monitoring comes for free Global (related to computing resources in general) : https://github.com/dmwm/WMCore/wiki/All-in-one-test[Tutorial]] To install a WMAgent. Let's see. Explore how the T3s could use its Idle CPU time (if any) to process jobs for the T2 cluster. (condor integration, condor fork) Explore pros/cons of this : The T2 and T3-higgs nodes can store data under the same hadoop namenode, our local resources would be protected by quotas. Management gets simpler Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is < 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency < 70% Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers.
> >	Monitoring shifts Daily Check Readiness Check Site Status Board Check PhEDEx Transfers HDFS status, corrupted blocks Check Job Failure rate, reasons if there is a site problem Check Ganglia plots Check GridFTP Screen on Zabbix Weekly Check RAID status on servers SRM CEs GridFTPs GUMS PhEDEx T3 Headnode T3 Login node T3 JBOD Check that the important servers still have a working backup Nodes / Cores / Storage counts - once we update the inventory and have defined totals.
	Monitoring requirements

View topic | History: r24 < r23 < r22 < r21 | More topic actions...

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback