---+ CIT HEP Computing %TOC% This page is dedicated to centralize all computing related subjects * [[ComputingGridTools]] * [[ComputingT3HiggsFAQ]] ---++ Workplan guidelines Our priorities are in this order : * Uptime * User support (blocker issues) * Resource utilization -- making sure that all hardware is being used * User support (non-blocker important issues) * Backups * Monitoring * User support (potential non-issues) * Security * R&D - This probably needs extra manpower. We have ideas but no time. ---+++ Extremely handy links : [[http://alimonitor.cern.ch/hepspec/][MonaLisa HepSpec table]] [[https://docs.google.com/spreadsheet/ccc?key=0AvE7aiWBwKzWdHl4MVpSZTRBcjktdXBqWlFhcnZrVmc#gid=0][USCMS T2 HepSpec table]] [[http://dashb-ssb.cern.ch/dashboard/request.py/siteview#currentView=production&highlight=true][Sites pledges]] ---+++ Related links * [[http://www.uslhcnet.org/][USLHCNET]] * [[http://supercomputing.caltech.edu/][SC]] ---++ Upgrades (software) [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/USCMSTier2Upgrades][USCMS Upgrades twiki]] ---+++ Pledges, etc * [[http://gstat-wlcg.cern.ch/apps/pledges/requirements/][REBUS]] * [[https://cmsweb.cern.ch/sitedb/prod/sites/T2_US_Caltech][SiteDB]] ---+++ Monitoring links We need a page to aggregate those, plus some !DashBoard + !PhEDEx + central CMS monitoring tools plots. ---++++ Central CMS * Overview * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/dailysummary#button=resourceutil&sites%5B%5D=T2_US_Caltech&sites%5B%5D=T2_US_Florida&sites%5B%5D=T2_US_MIT&sites%5B%5D=T2_US_Nebraska&sites%5B%5D=T2_US_Purdue&sites%5B%5D=T2_US_UCSD&sites%5B%5D=T2_US_Vanderbilt&sites%5B%5D=T2_US_Wisconsin&sitesSort=2&start=null&end=null&timerange=last24&granularity=Hourly&generic=0&sortby=0&series=All][Site Race]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites%5B%5D=T2_US_Caltech&sitesSort=2&start=null&end=null&timerange=last24&granularity=Hourly&generic=0&sortby=1&series=All][Batch system efficiency]] * [[http://dashb-ssb.cern.ch/dashboard/request.py/siteview?site=T2_US_Caltech#currentView=default&highlight=true][Site Status Board]] * [[http://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_US_Caltech][Site Readiness]] * [[http://hammercloud.cern.ch/hc/app/cms/][HammerCloud]] * [[http://dashb-ssb-dev.cern.ch/dashboard/templates/sitePendingRunningJobs.html?site=T2_US_Caltech][Running production x Pledge]] * [[http://dashb-cms-job-dev.cern.ch/dashboard/request.py/dailysummary#button=resourceutil&sites%5B%5D=Caltech+CMS+T2&sitesSort=10&start=null&end=null&timerange=lastMonth&granularity=Daily&generic=0&sortby=1&series=All&activities%5B%5D=all][Pledges and shares]] * Grid jobs * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=r][Daily running]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=p][Daily pending]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=lastWeek&granularity=Hourly&generic=0&sortBy=0&type=r][Weekly running]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=lastWeek&granularity=Hourly&generic=0&sortBy=0&type=p][Weekly pending]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/efficiency_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=ea][Job efficiency (success/failure)]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/efficiency_individual?sites=T2_US_Caltech&activities=production&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=ea][Production jobs efficiency (success/failure)]] * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?hostgroup=site-CIT_CMS_T2&style=overview][SAM Overview CIT_CMS_T2]] * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?host=cit-se.ultralight.org&style=detail][Nagios SE]] * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?host=cit-gatekeeper.ultralight.org&style=detail][Nagios CE1]] * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?host=cit-gatekeeper2.ultralight.org&style=detail][Nagios CE2]] * Caltech T2 links from T1s * [[https://cmsweb.cern.ch/phedex/debug/Activity::QualityPlots?src_filter=T1_%2A;period=l24h;no_mss=true;dest_filter=T2_US_Caltech;upto=;entity=src;graph=quality_all][Last 24 hours]] * Transfers, *from* Caltech * [[https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=T2_US_Caltech&dest_filter=&no_mss=true&period=l7d&upto=&.submit=Update][Weekly rate]] * [[https://cmsweb.cern.ch/phedex/prod/Activity::QualityPlots?src_filter=;period=l7d;no_mss=true;dest_filter=T2_US_Caltech;upto=;entity=dest;graph=quantity_rates][Weekly quality]] * Transfers, *to* Caltech * [[https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=&dest_filter=T2_US_Caltech&no_mss=true&period=l7d&upto=][Weekly Rate]] * [[https://cmsweb.cern.ch/phedex/prod/Activity::QualityPlots?graph=quality_all&entity=src&src_filter=&dest_filter=T2_US_Caltech&no_mss=true&period=l48h&upto=][48h quality]] * [[http://perfsonar.racf.bnl.gov:8080/exda/?page=25&cloudName=USCMS][Perfsonar]] * [[http://submit-3.t2.ucsd.edu/CSstoragePath/UserPrio/UserPrio-dev.html][Analysis central monitoring per user]] * [[http://farmsmon.pi.infn.it/phedex/][Subir phedex page]] * [[http://vocms32.cern.ch/gfactory/][GlideIn monitoring page]] * [[http://xrootd.t2.ucsd.edu/dump_cache.jsp?pred=%25%2FCMS%3A%3A%25%3A%3AXrdReport%2F%25%2Fsite&submit=Filter][XrootD Monitoring]] * [[http://dashb-cms-xrootd-transfers.cern.ch/ui/#c.plots=%255B%257B%2522metric%2522%253A%2522outgoing_bytes%2522%252C%2522type%2522%253A%2522column%2522%252C%2522yAxis%2522%253A0%257D%252C%257B%2522metric%2522%253A%2522incoming_bytes%2522%252C%2522type%2522%253A%2522column%2522%252C%2522yAxis%2522%253A%25220%2522%257D%252C%257B%2522metric%2522%253A%2522outgoing_transfers%2522%252C%2522type%2522%253A%2522scatter%2522%252C%2522yAxis%2522%253A%25221%2522%257D%252C%257B%2522metric%2522%253A%2522incoming_transfers%2522%252C%2522type%2522%253A%2522scatter%2522%252C%2522yAxis%2522%253A%25221%2522%257D%255D&c.title=Traffic+by+site&c.xAxis=site&c.yAxis=%255B%257B%2522name%2522%253A%2522Bytes%2522%252C%2522position%2522%253A%2522Left%2522%252C%2522type%2522%253A%2522logarithmic%2522%257D%252C%257B%2522name%2522%253A%2522Number%2520of%2520transfers%2522%252C%2522position%2522%253A%2522Right%2522%252C%2522type%2522%253A%2522logarithmic%2522%257D%255D&ctr.site=(T2_US_Caltech)][XrootD DashBoard]] * [[http://dashb-cms-sum.cern.ch/dashboard/request.py/latestresultssmry-sum#profile=CMS_CRITICAL_FULL&group=AllGroups&site%5B%5D=T2_US_Caltech&flavour%5B%5D=All+Service+Flavours&metric%5B%5D=org.cms.WN-xrootd-access&metric%5B%5D=org.cms.WN-xrootd-fallback&status%5B%5D=All+Exit+Status][xrootd fallback]] * [[https://www-ftsmon.gridpp.rl.ac.uk:8449/fts3/ftsmon/#/][FTS not-as-good monitoring]] * [[http://dashb-fts-transfers.cern.ch/ui/#date.interval=1440&grouping.dst=(site)&grouping.src=(site)&p.grouping=dst&src.site=(CIT)&tab=transfer_plots][FTS great new dashboard]] ---++++ OSG * [[https://myosg.grid.iu.edu/rgstatushistory/index?downtime_attrs_showpast=&account_type=cumulative_hours&ce_account_type=gip_vo&se_account_type=vo_transfer_volume&bdiitree_type=total_jobs&bdii_object=service&bdii_server=is-osg&start_type=7daysago&start_date=02%2F15%2F2013&end_type=now&end_date=02%2F15%2F2013&facility=on&facility_10006=on&gridtype=on&gridtype_1=on&service_1=on&service_5=on&active=on&active_value=1&disable_value=1][RSV probes]] * [[https://t2-monitor.ultralight.org/rsv/][Local RSV]] ---++++ Local pages/systems * [[https://t2-monitor.ultralight.org/jobview/][Internal batch system monitoring]] * [[http://t2-headnode.ultralight.org:50070/dfshealth.jsp][Tier-2 hadoop monitoring]] * [[http://t3-remote.ultralight.org:50070/dfshealth.jsp][Tier-3 remote hadoop monitoring]] * [[https://cms-nagios.caltech.edu/nagios/][Nagios]] * [[https://bugzilla.hep.caltech.edu/][Ticket system]] * [[https://mgmt.hep.caltech.edu/cacti/][Campus cacti]] * [[http://monalisa4.ultralight.org/display?page=sensors/temp&dont_cache=true][MonaLISA Temperature]] * [[http://zabbix.hep.caltech.edu/zabbix/overview.php?sid=529298ad22b9043e&form_refresh=1&groupid=13&view_style=0][Zabbix worker node checks]] * [[http://zabbix.hep.caltech.edu/zabbix/report6.php?sid=1f3ab99f03ec0616&form_refresh=1&form=1&report_show=show&config=3&report_timesince=20140818000000&report_timetill=20140819000000&sortorder=0&hostids%5B10813%5D=10813&hostids%5B10790%5D=10790&hostids%5B10711%5D=10711&hostids%5B10717%5D=10717&hostids%5B10657%5D=10657&hostids%5B10658%5D=10658&hostids%5B10656%5D=10656&hostids%5B10687%5D=10687&hostids%5B10827%5D=10827&hostids%5B10692%5D=10692&hostids%5B10823%5D=10823&hostids%5B10664%5D=10664&hostids%5B10759%5D=10759&hostids%5B10825%5D=10825&hostids%5B10719%5D=10719&hostids%5B10712%5D=10712&hostids%5B10689%5D=10689&hostids%5B10663%5D=10663&itemid=53145&title=Report+3&xlabel=&ylabel=&groupid=0&report_timesince_day=18&report_timesince_month=08&report_timesince_year=2014&report_timesince_hour=00&report_timesince_minute=00&report_timetill_day=19&report_timetill_month=08&report_timetill_year=2014&report_timetill_hour=00&report_timetill_minute=00&scaletype=2&avgperiod=2&item_name=compute-6-8.tier2%3A+Free+disk+space+on+%2Fcvmfs%2Fgrid.cern.ch&palette=0&palettetype=1&report_show=Show][Zabbix distribution of item for nodes]] ---++++ Documentation * [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsSiteconfMigrationToGit][Chaging SITECONF settings]] * [[https://twiki.cern.ch/twiki/bin/viewauth/CMS/CRAB3CheatSheet][CRAB3 mantra]] ---+++ Monitoring shifts ---++++ Daily * Check Readiness * Check Site Status Board * Check [[https://cmsweb.cern.ch/phedex/debug/Activity::QualityPlots?src_filter=T2_US_Caltech;period=l48h;no_mss=true;dest_filter=;upto=;entity=dest;graph=quality_all][PhEDEx]] Transfers ([[https://cmsweb.cern.ch/phedex/debug/Activity::QualityPlots?src_filter=;period=l48h;no_mss=true;dest_filter=T2_US_Caltech;upto=;entity=src;graph=quality_all][both directions]]) * HDFS status, corrupted blocks * Check [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites%5B%5D=T2_US_Caltech&sites%5B%5D=T2_US_Florida&sites%5B%5D=T2_US_MIT&sites%5B%5D=T2_US_Nebraska&sites%5B%5D=T2_US_Purdue&sites%5B%5D=T2_US_UCSD&sites%5B%5D=T2_US_Vanderbilt&sites%5B%5D=T2_US_Wisconsin&sitesSort=2&start=null&end=null&timerange=last24&granularity=Hourly&generic=0&sortby=0&series=All][Job Failure rate]], reasons if there is a site problem * Check [[http://t2-headnode-new.ultralight.org/ganglia/?c=T2_US_Caltech-puppet&m=load_one&r=month&s=descending&hc=4&mc=2][Ganglia]] plots * Check [[http://zabbix.hep.caltech.edu/zabbix/screens.php?ddreset=1&sid=1f3ab99f03ec0616][GridFTP Screen]] on Zabbix ---++++ Weekly * Check RAID status on servers * SRM * CEs * GridFTPs * GUMS * Do this fast by : [root@t2-headnode-new lists]# pssh -i -h raid-check-list.txt "cat /proc/mdstat" | grep "blocks super" | grep -v chunks * PhEDEx * T3 Headnode * T3 Login node * T3 JBOD * Query with : storcli64 /c0/v0 show * Newman * Query with : areca_cli64 ; vsf info * LDAP2 * Check that the important servers still have a working backup * Nodes / Cores / Storage counts - once we update the inventory and have defined totals. ---+++ Monitoring requirements * RSV * Cert exists * Cert validity * Nodes * GLEXEC * CVMFS * SWAP Trigger * IOPS on HDFS * IOPS on / * Proc / User * Open Files / User * Servers * RAID states * NameNode/HDFS * Health * Checkpoints * NN - All filesystems usage maximum alert * External * PhEDEx data on transfers * Data from SAM * Data from RSV * Data from DashBoard - Blackhole nodes -- Main.samir - 2014-04-01
This topic: Main
>
Computing
Topic revision: r22 - 2015-03-16 - samir
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback