Difference: Computing (1 vs. 24)

Revision 242016-03-15 - dkcira

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 65 to 65
	Weekly pending Job efficiency (success/failure) Production jobs efficiency (success/failure)
Added:
> >	GWSSMon Caltech
	SAM Overview CIT_CMS_T2 Nagios SE

Revision 232015-03-29 - dkcira

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 70 to 70
	Nagios SE Nagios CE1 Nagios CE2
Added:
> >	Xeon CE
	Caltech T2 links from T1s Last 24 hours Transfers, from Caltech

Revision 222015-03-16 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 123 to 123
	Daily Check Readiness Check Site Status Board
Changed:
< <	Check PhEDEx Transfers
> >	Check PhEDEx Transfers (both directions)
	HDFS status, corrupted blocks
Changed:
< <	Check Job Failure rate, reasons if there is a site problem Check Ganglia plots Check GridFTP Screen on Zabbix
> >	Check Job Failure rate, reasons if there is a site problem Check Ganglia plots Check GridFTP Screen on Zabbix
	Weekly Check RAID status on servers

Revision 212015-01-06 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 135 to 135
	CEs GridFTPs GUMS
Added:
> >	Do this fast by : [root@t2-headnode-new lists]# pssh -i -h raid-check-list.txt "cat /proc/mdstat" \| grep "blocks super" \| grep -v chunks
	PhEDEx T3 Headnode T3 Login node T3 JBOD
Added:
> >	Query with : storcli64 /c0/v0 show Newman Query with : areca_cli64 ; vsf info LDAP2
	Check that the important servers still have a working backup Nodes / Cores / Storage counts - once we update the inventory and have defined totals.

Revision 202014-12-08 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 119 to 119
	Chaging SITECONF settings CRAB3 mantra
Changed:
< <	Task wishlist T2 Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures Improvements/integration on the monitoring Automate MonaLisa install into all T2 nodes and servers Clean current MonaLisa web dashboard with the help of the MonaLisa team Clean current Nagios dashboard, reduce to applicable alarm frequency (optional) Implement SMS alerts for Nagios most critical alarms. Integrate all new servers into Nagios/Backup schema (could not be done yet) Review of network/firewalls/security settings -- more a "learning review" Add few rules so cluster management from CERN is easier (Optional) Explore different SSH authentication methods (GSI or Kerberos) - passwordless but still secure SSH access T3 (Local & Remote) Automate node deployment and profile maintenance with local (dedicated headnode) or remote (T2) puppet+foreman This will make us move away from Rocks in all clusters and have more uniformity in deployment/configuration procedures. Commission Condor as a batch system, better monitoring comes for free Global (related to computing resources in general) : https://github.com/dmwm/WMCore/wiki/All-in-one-test[Tutorial]] To install a WMAgent. Let's see. Explore how the T3s could use its Idle CPU time (if any) to process jobs for the T2 cluster. (condor integration, condor fork) Explore pros/cons of this : The T2 and T3-higgs nodes can store data under the same hadoop namenode, our local resources would be protected by quotas. Management gets simpler Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is < 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency < 70% Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers.
> >	Monitoring shifts Daily Check Readiness Check Site Status Board Check PhEDEx Transfers HDFS status, corrupted blocks Check Job Failure rate, reasons if there is a site problem Check Ganglia plots Check GridFTP Screen on Zabbix Weekly Check RAID status on servers SRM CEs GridFTPs GUMS PhEDEx T3 Headnode T3 Login node T3 JBOD Check that the important servers still have a working backup Nodes / Cores / Storage counts - once we update the inventory and have defined totals.
	Monitoring requirements

Revision 192014-11-07 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 50 to 50
	Central CMS Overview
Added:
> >	Site Race
	Batch system efficiency Site Status Board Site Readiness

Revision 182014-09-11 - dkcira

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 69 to 69
	Nagios SE Nagios CE1 Nagios CE2
Added:
> >	Caltech T2 links from T1s Last 24 hours
	Transfers, from Caltech Weekly rate Weekly quality

Revision 172014-09-06 - dkcira

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 89 to 89
	XrootD DashBoard
Added:
> >	xrootd fallback
	FTS not-as-good monitoring FTS great new dashboard

Revision 162014-09-01 - dkcira

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 65 to 65
	Job efficiency (success/failure) Production jobs efficiency (success/failure)
Added:
> >	SAM Overview CIT_CMS_T2
	Nagios SE Nagios CE1 Nagios CE2

Revision 152014-08-24 - dkcira

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 53 to 53
	Batch system efficiency Site Status Board Site Readiness
Changed:
< <	HammerCloud
> >	HammerCloud
	Running production x Pledge Pledges and shares

Revision 142014-08-22 - dkcira

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 53 to 53
	Batch system efficiency Site Status Board Site Readiness
Added:
> >	HammerCloud
	Running production x Pledge Pledges and shares
Line: 66 to 67
	Nagios SE Nagios CE1
Added:
> >	Nagios CE2
	Transfers, from Caltech Weekly rate Weekly quality

Revision 132014-08-19 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 104 to 104
	Campus cacti MonaLISA Temperature Zabbix worker node checks
Added:
> >	Zabbix distribution of item for nodes
	Documentation

Revision 122014-08-19 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 103 to 103
	Ticket system Campus cacti MonaLISA Temperature
Added:
> >	Zabbix worker node checks
	Documentation

Revision 112014-08-15 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 104 to 102
	Nagios Ticket system Campus cacti
Added:
> >	MonaLISA Temperature
	Documentation

Revision 102014-08-07 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 22 to 22
	Security R&D - This probably needs extra manpower. We have ideas but no time.
Changed:
< <	Monitoring links
> >
	Extremely handy links :
Line: 32 to 32
	Sites pledges
Added:
> >	Related links USLHCNET SC
	Upgrades (software) USCMS Upgrades twiki

Revision 92014-08-01 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 85 to 85
	FTS not-as-good monitoring
Changed:
< <	FTS great new dashboard
> >	FTS great new dashboard
	OSG

Revision 82014-07-15 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 105 to 105
	Chaging SITECONF settings CRAB3 mantra
Changed:
< <	Workplan 02/2013
> >	Task wishlist
	T2 Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures
Line: 135 to 135
	Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is < 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency < 70% Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers.
Added:
> >	Monitoring requirements RSV Cert exists Cert validity Nodes GLEXEC CVMFS SWAP Trigger IOPS on HDFS IOPS on / Proc / User Open Files / User Servers RAID states NameNode/HDFS Health Checkpoints NN - All filesystems usage maximum alert External PhEDEx data on transfers Data from SAM Data from RSV Data from DashBoard - Blackhole nodes
	-- Main.samir - 2014-04-01

Revision 72014-06-23 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 34 to 34
	Upgrades (software)
Changed:
< <	USCMS Upgrades twiki
> >	USCMS Upgrades twiki
	Pledges, etc

Revision 62014-06-17 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 79 to 79
	GlideIn monitoring page
Changed:
< <	XrootD DashBoard
> >	XrootD Monitoring XrootD DashBoard
	FTS not-as-good monitoring

Revision 52014-05-30 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 81 to 81
	XrootD DashBoard
Added:
> >	FTS not-as-good monitoring FTS great new dashboard
	OSG RSV probes

Revision 42014-05-30 - samir

Line: 1 to 1
	CIT HEP Computing CIT HEP Computing Workplan guidelines Extremely handy links : Related links Upgrades (software) Pledges, etc Monitoring links Central CMS OSG Local pages/systems Documentation Monitoring shifts Daily Weekly Monitoring requirements
Line: 21 to 21
	User support (potential non-issues) Security R&D - This probably needs extra manpower. We have ideas but no time.
Added:
> >	Monitoring links Extremely handy links : MonaLisa HepSpec table USCMS T2 HepSpec table Sites pledges Upgrades (software) USCMS Upgrades twiki Pledges, etc REBUS SiteDB Monitoring links We need a page to aggregate those, plus some DashBoard + PhEDEx + central CMS monitoring tools plots. Central CMS Overview Batch system efficiency Site Status Board Site Readiness Running production x Pledge Pledges and shares Grid jobs Daily running Daily pending Weekly running Weekly pending Job efficiency (success/failure) Production jobs efficiency (success/failure) Nagios SE Nagios CE1 Transfers, from Caltech Weekly rate Weekly quality Transfers, to Caltech Weekly Rate 48h quality Perfsonar Analysis central monitoring per user Subir phedex page GlideIn monitoring page XrootD DashBoard OSG RSV probes Local RSV Local pages/systems Internal batch system monitoring Tier-2 hadoop monitoring Tier-3 remote hadoop monitoring Nagios Ticket system Campus cacti Documentation Chaging SITECONF settings CRAB3 mantra Workplan 02/2013 T2 Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures Improvements/integration on the monitoring Automate MonaLisa install into all T2 nodes and servers Clean current MonaLisa web dashboard with the help of the MonaLisa team Clean current Nagios dashboard, reduce to applicable alarm frequency (optional) Implement SMS alerts for Nagios most critical alarms. Integrate all new servers into Nagios/Backup schema (could not be done yet) Review of network/firewalls/security settings -- more a "learning review" Add few rules so cluster management from CERN is easier (Optional) Explore different SSH authentication methods (GSI or Kerberos) - passwordless but still secure SSH access T3 (Local & Remote) Automate node deployment and profile maintenance with local (dedicated headnode) or remote (T2) puppet+foreman This will make us move away from Rocks in all clusters and have more uniformity in deployment/configuration procedures. Commission Condor as a batch system, better monitoring comes for free Global (related to computing resources in general) : https://github.com/dmwm/WMCore/wiki/All-in-one-test[Tutorial]] To install a WMAgent. Let's see. Explore how the T3s could use its Idle CPU time (if any) to process jobs for the T2 cluster. (condor integration, condor fork) Explore pros/cons of this : The T2 and T3-higgs nodes can store data under the same hadoop namenode, our local resources would be protected by quotas. Management gets simpler Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is < 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency < 70% Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers.
	-- Main.samir - 2014-04-01

Revision 32014-04-11 - samir

->
>
+ CIT HEP Computing 

 
  CIT HEP Computing 
  Workplan guidelines 
  Extremely handy links :
   Related links
 
   Upgrades (software) 
  Pledges, etc
   Monitoring links 
  Central CMS
   OSG
   Local pages/systems
   Documentation
 
   Monitoring shifts 
  Daily
   Weekly
 
   Monitoring requirements
 This page is dedicated to centralize all computing related subjects
 

 ComputingGridTools
  ComputingT3HiggsFAQ
->
>
+ Workplan guidelines 

Our priorities are in this order :
 

 Uptime
  User support (blocker issues)
  Resource utilization -- making sure that all hardware is being used
  User support (non-blocker important issues)
  Backups
  Monitoring
  User support (potential non-issues)
  Security
  R&D - This probably needs extra manpower. We have ideas but no time.
 -- Main.samir - 2014-04-01

Revision 22014-04-01 - samir

Line: 1 to 1
	This page is dedicated to centralize all computing related subjects ComputingGridTools
Added:
> >	ComputingT3HiggsFAQ
	-- Main.samir - 2014-04-01

Revision 12014-04-01 - samir

Line: 1 to 1
Added:
> >	This page is dedicated to centralize all computing related subjects ComputingGridTools -- Main.samir - 2014-04-01

Line: 1 to 1

Added:

>
>

This page is dedicated to centralize all computing related subjects

ComputingGridTools

-- Main.samir - 2014-04-01

View topic | History: r24 < r23 < r22 < r21 | More topic actions...

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback

Difference: Computing (1 vs. 24)

Revision 242016-03-15 - dkcira

CIT HEP Computing

Revision 232015-03-29 - dkcira

CIT HEP Computing

Revision 222015-03-16 - samir

CIT HEP Computing

Daily

Weekly

Revision 212015-01-06 - samir

CIT HEP Computing

Revision 202014-12-08 - samir

CIT HEP Computing

Task wishlist

T2

T3 (Local & Remote)

Global (related to computing resources in general) :

Monitoring shifts

Daily

Weekly

Monitoring requirements

Revision 192014-11-07 - samir

CIT HEP Computing

Central CMS

Revision 182014-09-11 - dkcira

CIT HEP Computing

Revision 172014-09-06 - dkcira

CIT HEP Computing

Revision 162014-09-01 - dkcira

CIT HEP Computing

Revision 152014-08-24 - dkcira

CIT HEP Computing

Revision 142014-08-22 - dkcira

CIT HEP Computing

Revision 132014-08-19 - samir

CIT HEP Computing

Documentation

Revision 122014-08-19 - samir

CIT HEP Computing

Documentation

Revision 112014-08-15 - samir

CIT HEP Computing

Documentation

Revision 102014-08-07 - samir

CIT HEP Computing

Monitoring links

Extremely handy links :

Related links

Upgrades (software)

Revision 92014-08-01 - samir

CIT HEP Computing

OSG

Revision 82014-07-15 - samir

CIT HEP Computing

Workplan 02/2013

Task wishlist

T2

Monitoring requirements

Revision 72014-06-23 - samir

CIT HEP Computing

Upgrades (software)

Pledges, etc

Revision 62014-06-17 - samir

CIT HEP Computing

Revision 52014-05-30 - samir

CIT HEP Computing

OSG

Revision 42014-05-30 - samir

CIT HEP Computing

Monitoring links

Extremely handy links :

Upgrades (software)

Pledges, etc

Monitoring links

Central CMS

OSG

Local pages/systems

Documentation

Workplan 02/2013

T2