Difference: Computing (1 vs. 24)

Revision 222015-03-16 - samir

Line: 1 to 1
 

CIT HEP Computing

Line: 123 to 123
 

Daily

  • Check Readiness
  • Check Site Status Board
Changed:
<
<
>
>
 
  • HDFS status, corrupted blocks
Changed:
<
<
  • Check Job Failure rate, reasons if there is a site problem
  • Check Ganglia plots
  • Check GridFTP Screen on Zabbix
>
>
 

Weekly

  • Check RAID status on servers

Revision 212015-01-06 - samir

Line: 1 to 1
 

CIT HEP Computing

Line: 135 to 135
 
Added:
>
>
      • Do this fast by : [root@t2-headnode-new lists]# pssh -i -h raid-check-list.txt "cat /proc/mdstat" | grep "blocks super" | grep -v chunks
 
    • PhEDEx
    • T3 Headnode
    • T3 Login node
    • T3 JBOD
Added:
>
>
      • Query with : storcli64 /c0/v0 show
    • Newman
      • Query with : areca_cli64 ; vsf info
    • LDAP2
 
  • Check that the important servers still have a working backup
  • Nodes / Cores / Storage counts - once we update the inventory and have defined totals.

Revision 202014-12-08 - samir

Line: 1 to 1
 

CIT HEP Computing

Line: 119 to 119
 
Changed:
<
<

Task wishlist

T2

  • Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures
  • Improvements/integration on the monitoring
    • Automate MonaLisa install into all T2 nodes and servers
    • Clean current MonaLisa web dashboard with the help of the MonaLisa team
    • Clean current Nagios dashboard, reduce to applicable alarm frequency
      • (optional) Implement SMS alerts for Nagios most critical alarms.
  • Integrate all new servers into Nagios/Backup schema (could not be done yet)
  • Review of network/firewalls/security settings -- more a "learning review"
    • Add few rules so cluster management from CERN is easier
    • (Optional) Explore different SSH authentication methods (GSI or Kerberos) - passwordless but still secure SSH access

T3 (Local & Remote)

  • Automate node deployment and profile maintenance with local (dedicated headnode) or remote (T2) puppet+foreman
    • This will make us move away from Rocks in all clusters and have more uniformity in deployment/configuration procedures.
  • Commission Condor as a batch system, better monitoring comes for free

Global (related to computing resources in general) :

  • Explore how the T3s could use its Idle CPU time (if any) to process jobs for the T2 cluster. (condor integration, condor fork)

  • Explore pros/cons of this : The T2 and T3-higgs nodes can store data under the same hadoop namenode, our local resources would be protected by quotas. Management gets simpler

  • Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is < 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency < 70%

  • Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers.
>
>

Monitoring shifts

Daily

  • Check Readiness
  • Check Site Status Board
  • Check PhEDEx Transfers
  • HDFS status, corrupted blocks
  • Check Job Failure rate, reasons if there is a site problem
  • Check Ganglia plots
  • Check GridFTP Screen on Zabbix

Weekly

  • Check RAID status on servers
  • Check that the important servers still have a working backup
  • Nodes / Cores / Storage counts - once we update the inventory and have defined totals.
 

Monitoring requirements

Revision 102014-08-07 - samir

Line: 1 to 1
 

CIT HEP Computing

Line: 22 to 22
 
  • Security
  • R&D - This probably needs extra manpower. We have ideas but no time.
Changed:
<
<

Monitoring links

>
>
 

Extremely handy links :

Line: 32 to 32
  Sites pledges
Added:
>
>

Related links

 

Upgrades (software)

USCMS Upgrades twiki

Revision 82014-07-15 - samir

Line: 1 to 1
 

CIT HEP Computing

Line: 105 to 105
 
Changed:
<
<

Workplan 02/2013

>
>

Task wishlist

 

T2

  • Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures
Line: 135 to 135
 
  • Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is < 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency < 70%

  • Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers.
Added:
>
>

Monitoring requirements

  • RSV
    • Cert exists
    • Cert validity

  • Nodes
    • GLEXEC
    • CVMFS
    • SWAP Trigger
    • IOPS on HDFS
    • IOPS on /
    • Proc / User
    • Open Files / User

  • Servers
    • RAID states

  • NameNode/HDFS
    • Health
    • Checkpoints
    • NN - All filesystems usage maximum alert

  • External
    • PhEDEx data on transfers
    • Data from SAM
    • Data from RSV
    • Data from DashBoard - Blackhole nodes

 -- Main.samir - 2014-04-01

Revision 42014-05-30 - samir

Line: 1 to 1
 

CIT HEP Computing

Line: 21 to 21
 
  • User support (potential non-issues)
  • Security
  • R&D - This probably needs extra manpower. We have ideas but no time.
Added:
>
>

Monitoring links

Extremely handy links :

MonaLisa HepSpec table

USCMS T2 HepSpec table

Sites pledges

Upgrades (software)

USCMS Upgrades twiki

Pledges, etc

Monitoring links

We need a page to aggregate those, plus some DashBoard + PhEDEx + central CMS monitoring tools plots.

Central CMS

OSG

Local pages/systems

Documentation

Workplan 02/2013

T2

  • Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures
  • Improvements/integration on the monitoring
    • Automate MonaLisa install into all T2 nodes and servers
    • Clean current MonaLisa web dashboard with the help of the MonaLisa team
    • Clean current Nagios dashboard, reduce to applicable alarm frequency
      • (optional) Implement SMS alerts for Nagios most critical alarms.
  • Integrate all new servers into Nagios/Backup schema (could not be done yet)
  • Review of network/firewalls/security settings -- more a "learning review"
    • Add few rules so cluster management from CERN is easier
    • (Optional) Explore different SSH authentication methods (GSI or Kerberos) - passwordless but still secure SSH access

T3 (Local & Remote)

  • Automate node deployment and profile maintenance with local (dedicated headnode) or remote (T2) puppet+foreman
    • This will make us move away from Rocks in all clusters and have more uniformity in deployment/configuration procedures.
  • Commission Condor as a batch system, better monitoring comes for free

Global (related to computing resources in general) :

  • Explore how the T3s could use its Idle CPU time (if any) to process jobs for the T2 cluster. (condor integration, condor fork)

  • Explore pros/cons of this : The T2 and T3-higgs nodes can store data under the same hadoop namenode, our local resources would be protected by quotas. Management gets simpler

  • Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is < 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency < 70%

  • Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers.
 -- Main.samir - 2014-04-01

Revision 32014-04-11 - samir

Line: 1 to 1
Added:
>
>

CIT HEP Computing

 This page is dedicated to centralize all computing related subjects

Added:
>
>

Workplan guidelines

Our priorities are in this order :

  • Uptime
  • User support (blocker issues)
  • Resource utilization -- making sure that all hardware is being used
  • User support (non-blocker important issues)
  • Backups
  • Monitoring
  • User support (potential non-issues)
  • Security
  • R&D - This probably needs extra manpower. We have ideas but no time.
 -- Main.samir - 2014-04-01

Revision 22014-04-01 - samir

Line: 1 to 1
 This page is dedicated to centralize all computing related subjects

Added:
>
>
 -- Main.samir - 2014-04-01

Revision 12014-04-01 - samir

Line: 1 to 1
Added:
>
>
This page is dedicated to centralize all computing related subjects

-- Main.samir - 2014-04-01
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback