---+ CIT HEP Computing %TOC% This page is dedicated to centralize all computing related subjects * [[ComputingGridTools]] * [[ComputingT3HiggsFAQ]] ---++ Workplan guidelines Our priorities are in this order : * Uptime * User support (blocker issues) * Resource utilization -- making sure that all hardware is being used * User support (non-blocker important issues) * Backups * Monitoring * User support (potential non-issues) * Security * R&D - This probably needs extra manpower. We have ideas but no time. ---++ Monitoring links ---+++ Extremely handy links : [[http://alimonitor.cern.ch/hepspec/][MonaLisa HepSpec table]] [[https://docs.google.com/spreadsheet/ccc?key=0AvE7aiWBwKzWdHl4MVpSZTRBcjktdXBqWlFhcnZrVmc#gid=0][USCMS T2 HepSpec table]] [[http://dashb-ssb.cern.ch/dashboard/request.py/siteview#currentView=production&highlight=true][Sites pledges]] ---++ Upgrades (software) [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/USCMSTier2Upgrades][USCMS Upgrades twiki]] ---+++ Pledges, etc * [[http://gstat-wlcg.cern.ch/apps/pledges/requirements/][REBUS]] * [[https://cmsweb.cern.ch/sitedb/prod/sites/T2_US_Caltech][SiteDB]] ---+++ Monitoring links We need a page to aggregate those, plus some !DashBoard + !PhEDEx + central CMS monitoring tools plots. ---++++ Central CMS * Overview * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites%5B%5D=T2_US_Caltech&sitesSort=2&start=null&end=null&timerange=last24&granularity=Hourly&generic=0&sortby=1&series=All][Batch system efficiency]] * [[http://dashb-ssb.cern.ch/dashboard/request.py/siteview?site=T2_US_Caltech#currentView=default&highlight=true][Site Status Board]] * [[http://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_US_Caltech][Site Readiness]] * [[http://dashb-ssb-dev.cern.ch/dashboard/templates/sitePendingRunningJobs.html?site=T2_US_Caltech][Running production x Pledge]] * [[http://dashb-cms-job-dev.cern.ch/dashboard/request.py/dailysummary#button=resourceutil&sites%5B%5D=Caltech+CMS+T2&sitesSort=10&start=null&end=null&timerange=lastMonth&granularity=Daily&generic=0&sortby=1&series=All&activities%5B%5D=all][Pledges and shares]] * Grid jobs * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=r][Daily running]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=p][Daily pending]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=lastWeek&granularity=Hourly&generic=0&sortBy=0&type=r][Weekly running]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=lastWeek&granularity=Hourly&generic=0&sortBy=0&type=p][Weekly pending]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/efficiency_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=ea][Job efficiency (success/failure)]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/efficiency_individual?sites=T2_US_Caltech&activities=production&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=ea][Production jobs efficiency (success/failure)]] * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?host=cit-se.ultralight.org&style=detail][Nagios SE]] * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?host=cit-gatekeeper.ultralight.org&style=detail][Nagios CE1]] * Transfers, *from* Caltech * [[https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=T2_US_Caltech&dest_filter=&no_mss=true&period=l7d&upto=&.submit=Update][Weekly rate]] * [[https://cmsweb.cern.ch/phedex/prod/Activity::QualityPlots?src_filter=;period=l7d;no_mss=true;dest_filter=T2_US_Caltech;upto=;entity=dest;graph=quantity_rates][Weekly quality]] * Transfers, *to* Caltech * [[https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=&dest_filter=T2_US_Caltech&no_mss=true&period=l7d&upto=][Weekly Rate]] * [[https://cmsweb.cern.ch/phedex/prod/Activity::QualityPlots?graph=quality_all&entity=src&src_filter=&dest_filter=T2_US_Caltech&no_mss=true&period=l48h&upto=][48h quality]] * [[http://perfsonar.racf.bnl.gov:8080/exda/?page=25&cloudName=USCMS][Perfsonar]] * [[http://submit-3.t2.ucsd.edu/CSstoragePath/UserPrio/UserPrio-dev.html][Analysis central monitoring per user]] * [[http://farmsmon.pi.infn.it/phedex/][Subir phedex page]] * [[http://vocms32.cern.ch/gfactory/][GlideIn monitoring page]] * [[http://xrootd.t2.ucsd.edu/dump_cache.jsp?pred=%25%2FCMS%3A%3A%25%3A%3AXrdReport%2F%25%2Fsite&submit=Filter][XrootD Monitoring]] * [[http://dashb-cms-xrootd-transfers.cern.ch/ui/#c.plots=%255B%257B%2522metric%2522%253A%2522outgoing_bytes%2522%252C%2522type%2522%253A%2522column%2522%252C%2522yAxis%2522%253A0%257D%252C%257B%2522metric%2522%253A%2522incoming_bytes%2522%252C%2522type%2522%253A%2522column%2522%252C%2522yAxis%2522%253A%25220%2522%257D%252C%257B%2522metric%2522%253A%2522outgoing_transfers%2522%252C%2522type%2522%253A%2522scatter%2522%252C%2522yAxis%2522%253A%25221%2522%257D%252C%257B%2522metric%2522%253A%2522incoming_transfers%2522%252C%2522type%2522%253A%2522scatter%2522%252C%2522yAxis%2522%253A%25221%2522%257D%255D&c.title=Traffic+by+site&c.xAxis=site&c.yAxis=%255B%257B%2522name%2522%253A%2522Bytes%2522%252C%2522position%2522%253A%2522Left%2522%252C%2522type%2522%253A%2522logarithmic%2522%257D%252C%257B%2522name%2522%253A%2522Number%2520of%2520transfers%2522%252C%2522position%2522%253A%2522Right%2522%252C%2522type%2522%253A%2522logarithmic%2522%257D%255D&ctr.site=(T2_US_Caltech)][XrootD DashBoard]] * [[https://www-ftsmon.gridpp.rl.ac.uk:8449/fts3/ftsmon/#/][FTS not-as-good monitoring]] * [[http://dashb-fts-transfers.cern.ch/ui/][FTS great new dashboard]] ---++++ OSG * [[https://myosg.grid.iu.edu/rgstatushistory/index?downtime_attrs_showpast=&account_type=cumulative_hours&ce_account_type=gip_vo&se_account_type=vo_transfer_volume&bdiitree_type=total_jobs&bdii_object=service&bdii_server=is-osg&start_type=7daysago&start_date=02%2F15%2F2013&end_type=now&end_date=02%2F15%2F2013&facility=on&facility_10006=on&gridtype=on&gridtype_1=on&service_1=on&service_5=on&active=on&active_value=1&disable_value=1][RSV probes]] * [[https://t2-monitor.ultralight.org/rsv/][Local RSV]] ---++++ Local pages/systems * [[https://t2-monitor.ultralight.org/jobview/][Internal batch system monitoring]] * [[http://t2-headnode.ultralight.org:50070/dfshealth.jsp][Tier-2 hadoop monitoring]] * [[http://t3-remote.ultralight.org:50070/dfshealth.jsp][Tier-3 remote hadoop monitoring]] * [[https://cms-nagios.caltech.edu/nagios/][Nagios]] * [[https://bugzilla.hep.caltech.edu/][Ticket system]] * [[https://mgmt.hep.caltech.edu/cacti/][Campus cacti]] ---++++ Documentation * [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsSiteconfMigrationToGit][Chaging SITECONF settings]] * [[https://twiki.cern.ch/twiki/bin/viewauth/CMS/CRAB3CheatSheet][CRAB3 mantra]] ---+++ Task wishlist ---++++ T2 * Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures * Improvements/integration on the monitoring * Automate MonaLisa install into all T2 nodes and servers * Clean current MonaLisa web dashboard with the help of the MonaLisa team * Clean current Nagios dashboard, reduce to applicable alarm frequency * (optional) Implement SMS alerts for Nagios most critical alarms. * Integrate all new servers into Nagios/Backup schema (could not be done yet) * Review of network/firewalls/security settings -- more a "learning review" * Add few rules so cluster management from CERN is easier * (Optional) Explore different SSH authentication methods (GSI or Kerberos) - passwordless but still secure SSH access ---++++ T3 (Local & Remote) * Automate node deployment and profile maintenance with local (dedicated headnode) or remote (T2) puppet+foreman * This will make us move away from Rocks in all clusters and have more uniformity in deployment/configuration procedures. * Commission Condor as a batch system, better monitoring comes for free ---++++ Global (related to computing resources in general) : * [[https://github.com/dmwm/WMCore/wiki/All-in-one-test]][Tutorial]] To install a WMAgent. Let's see. * Explore how the T3s could use its Idle CPU time (if any) to process jobs for the T2 cluster. (condor integration, condor fork) * Explore pros/cons of this : The T2 and T3-higgs nodes can store data under the same hadoop namenode, our local resources would be protected by quotas. Management gets simpler * Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is < 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency < 70% * Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers. ---+++ Monitoring requirements * RSV * Cert exists * Cert validity * Nodes * GLEXEC * CVMFS * SWAP Trigger * IOPS on HDFS * IOPS on / * Proc / User * Open Files / User * Servers * RAID states * NameNode/HDFS * Health * Checkpoints * NN - All filesystems usage maximum alert * External * PhEDEx data on transfers * Data from SAM * Data from RSV * Data from DashBoard - Blackhole nodes -- Main.samir - 2014-04-01
This topic: Main
>
Computing
Topic revision: r8 - 2014-07-15 - samir
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback