(r4) Computing < Main

Tags: view all tags
---+ CIT HEP Computing

%TOC%

This page is dedicated to centralize all computing related subjects

   * [[ComputingGridTools]]
   * [[ComputingT3HiggsFAQ]]

---++ Workplan guidelines

Our priorities are in this order :

   * Uptime
   * User support (blocker issues)
   * Resource utilization -- making sure that all hardware is being used
   * User support (non-blocker important issues)
   * Backups
   * Monitoring
   * User support (potential non-issues)
   * Security
   * R&D - This probably needs extra manpower. We have ideas but no time.

---++ Monitoring links

---+++ Extremely handy links :

[[http://alimonitor.cern.ch/hepspec/][MonaLisa HepSpec table]]

[[https://docs.google.com/spreadsheet/ccc?key=0AvE7aiWBwKzWdHl4MVpSZTRBcjktdXBqWlFhcnZrVmc#gid=0][USCMS T2 HepSpec table]]

[[http://dashb-ssb.cern.ch/dashboard/request.py/siteview#currentView=production&highlight=true][Sites pledges]]

---++ Upgrades (software)

[[https://twiki.cern.ch/twiki/bin/viewauth/CMS/USCMSTier2Upgrades][USCMS Upgrades twiki]]

---+++ Pledges, etc

   * [[http://gstat-wlcg.cern.ch/apps/pledges/requirements/][REBUS]]
   * [[https://cmsweb.cern.ch/sitedb/prod/sites/T2_US_Caltech][SiteDB]]

---+++ Monitoring links

We need a page to aggregate those, plus some !DashBoard + !PhEDEx + central CMS monitoring tools plots.

---++++ Central CMS
   * Overview 
      * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites%5B%5D=T2_US_Caltech&sitesSort=2&start=null&end=null&timerange=last24&granularity=Hourly&generic=0&sortby=1&series=All][Batch system efficiency]]
      * [[http://dashb-ssb.cern.ch/dashboard/request.py/siteview?site=T2_US_Caltech#currentView=default&highlight=true][Site Status Board]]
      * [[http://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_US_Caltech][Site Readiness]]

   * [[http://dashb-ssb-dev.cern.ch/dashboard/templates/sitePendingRunningJobs.html?site=T2_US_Caltech][Running production x Pledge]]
   * [[http://dashb-cms-job-dev.cern.ch/dashboard/request.py/dailysummary#button=resourceutil&sites%5B%5D=Caltech+CMS+T2&sitesSort=10&start=null&end=null&timerange=lastMonth&granularity=Daily&generic=0&sortby=1&series=All&activities%5B%5D=all][Pledges and shares]]
   * Grid jobs 
      * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=r][Daily running]]
      * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=p][Daily pending]]
      * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=lastWeek&granularity=Hourly&generic=0&sortBy=0&type=r][Weekly running]]
      * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=lastWeek&granularity=Hourly&generic=0&sortBy=0&type=p][Weekly pending]]
      * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/efficiency_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=ea][Job efficiency (success/failure)]]
      * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/efficiency_individual?sites=T2_US_Caltech&activities=production&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=ea][Production jobs efficiency (success/failure)]]

   * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?host=cit-se.ultralight.org&style=detail][Nagios SE]]
   * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?host=cit-gatekeeper.ultralight.org&style=detail][Nagios CE1]]
   * Transfers, *from* Caltech 
      * [[https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=T2_US_Caltech&dest_filter=&no_mss=true&period=l7d&upto=&.submit=Update][Weekly rate]]
      * [[https://cmsweb.cern.ch/phedex/prod/Activity::QualityPlots?src_filter=;period=l7d;no_mss=true;dest_filter=T2_US_Caltech;upto=;entity=dest;graph=quantity_rates][Weekly quality]]

   * Transfers, *to* Caltech 
      * [[https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=&dest_filter=T2_US_Caltech&no_mss=true&period=l7d&upto=][Weekly Rate]]
      * [[https://cmsweb.cern.ch/phedex/prod/Activity::QualityPlots?graph=quality_all&entity=src&src_filter=&dest_filter=T2_US_Caltech&no_mss=true&period=l48h&upto=][48h quality]]

   * [[http://perfsonar.racf.bnl.gov:8080/exda/?page=25&cloudName=USCMS][Perfsonar]]

   * [[http://submit-3.t2.ucsd.edu/CSstoragePath/UserPrio/UserPrio-dev.html][Analysis central monitoring per user]]

   * [[http://farmsmon.pi.infn.it/phedex/][Subir phedex page]]

   * [[http://vocms32.cern.ch/gfactory/][GlideIn monitoring page]]

   * [[http://xrootd.t2.ucsd.edu/dump_cache.jsp?pred=%25%2FCMS%3A%3A%25%3A%3AXrdReport%2F%25%2Fsite&submit=Filter][XrootD DashBoard]]

---++++ OSG

   * [[https://myosg.grid.iu.edu/rgstatushistory/index?downtime_attrs_showpast=&account_type=cumulative_hours&ce_account_type=gip_vo&se_account_type=vo_transfer_volume&bdiitree_type=total_jobs&bdii_object=service&bdii_server=is-osg&start_type=7daysago&start_date=02%2F15%2F2013&end_type=now&end_date=02%2F15%2F2013&facility=on&facility_10006=on&gridtype=on&gridtype_1=on&service_1=on&service_5=on&active=on&active_value=1&disable_value=1][RSV probes]]
   * [[https://t2-monitor.ultralight.org/rsv/][Local RSV]]

---++++ Local pages/systems
   * [[https://t2-monitor.ultralight.org/jobview/][Internal batch system monitoring]]
   * [[http://t2-headnode.ultralight.org:50070/dfshealth.jsp][Tier-2 hadoop monitoring]]
   * [[http://t3-remote.ultralight.org:50070/dfshealth.jsp][Tier-3 remote hadoop monitoring]]
   * [[https://cms-nagios.caltech.edu/nagios/][Nagios]]
   * [[https://bugzilla.hep.caltech.edu/][Ticket system]]
   * [[https://mgmt.hep.caltech.edu/cacti/][Campus cacti]]

---++++ Documentation

   * [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsSiteconfMigrationToGit][Chaging SITECONF settings]]
   * [[https://twiki.cern.ch/twiki/bin/viewauth/CMS/CRAB3CheatSheet][CRAB3 mantra]]

---+++ Workplan 02/2013

---++++ T2
   * Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures
   * Improvements/integration on the monitoring 
      * Automate MonaLisa install into all T2 nodes and servers
      * Clean current MonaLisa web dashboard with the help of the MonaLisa team
      * Clean current Nagios dashboard, reduce to applicable alarm frequency 
         * (optional) Implement SMS alerts for Nagios most critical alarms.
   * Integrate all new servers into Nagios/Backup schema (could not be done yet)
   * Review of network/firewalls/security settings -- more a "learning review" 
      * Add few rules so cluster management from CERN is easier
      * (Optional) Explore different SSH authentication methods (GSI or Kerberos) - passwordless but still secure SSH access

---++++ T3 (Local & Remote)

   * Automate node deployment and profile maintenance with local (dedicated headnode) or remote (T2) puppet+foreman 
      * This will make us move away from Rocks in all clusters and have more uniformity in deployment/configuration procedures.
   * Commission Condor as a batch system, better monitoring comes for free

---++++ Global (related to computing resources in general) :
   * [[https://github.com/dmwm/WMCore/wiki/All-in-one-test]][Tutorial]] To install a WMAgent. Let's see.

   * Explore how the T3s could use its Idle CPU time (if any) to process jobs for the T2 cluster. (condor integration, condor fork)

   * Explore pros/cons of this : The T2 and T3-higgs nodes can store data under the same hadoop namenode, our local resources would be protected by quotas. Management gets simpler

   * Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is &lt; 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency &lt; 70%

   * Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers.
-- Main.samir - 2014-04-01
Topic revision: r4 - 2014-05-30 - samir
Account
- Log In
~~Edit~~
~~Attach~~
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback