Difference: Computing (7 vs. 8)

Revision 82014-07-15 - samir

Line: 1 to 1
 

CIT HEP Computing

Line: 105 to 105
 
Changed:
<
<

Workplan 02/2013

>
>

Task wishlist

 

T2

  • Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures
Line: 135 to 135
 
  • Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is < 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency < 70%

  • Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers.
Added:
>
>

Monitoring requirements

  • RSV
    • Cert exists
    • Cert validity

  • Nodes
    • GLEXEC
    • CVMFS
    • SWAP Trigger
    • IOPS on HDFS
    • IOPS on /
    • Proc / User
    • Open Files / User

  • Servers
    • RAID states

  • NameNode/HDFS
    • Health
    • Checkpoints
    • NN - All filesystems usage maximum alert

  • External
    • PhEDEx data on transfers
    • Data from SAM
    • Data from RSV
    • Data from DashBoard - Blackhole nodes

 -- Main.samir - 2014-04-01
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback