Tags:
view all tags
---+ CIT HEP Computing %TOC% This page is dedicated to centralize all computing related subjects * [[ComputingGridTools]] * [[ComputingT3HiggsFAQ]] ---++ Workplan guidelines Our priorities are in this order : * Uptime * User support (blocker issues) * Resource utilization -- making sure that all hardware is being used * User support (non-blocker important issues) * Backups * Monitoring * User support (potential non-issues) * Security * R&D - This probably needs extra manpower. We have ideas but no time. ---+++ Extremely handy links : [[http://alimonitor.cern.ch/hepspec/][MonaLisa HepSpec table]] [[https://docs.google.com/spreadsheet/ccc?key=0AvE7aiWBwKzWdHl4MVpSZTRBcjktdXBqWlFhcnZrVmc#gid=0][USCMS T2 HepSpec table]] [[http://dashb-ssb.cern.ch/dashboard/request.py/siteview#currentView=production&highlight=true][Sites pledges]] ---+++ Related links * [[http://www.uslhcnet.org/][USLHCNET]] * [[http://supercomputing.caltech.edu/][SC]] ---++ Upgrades (software) [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/USCMSTier2Upgrades][USCMS Upgrades twiki]] ---+++ Pledges, etc * [[http://gstat-wlcg.cern.ch/apps/pledges/requirements/][REBUS]] * [[https://cmsweb.cern.ch/sitedb/prod/sites/T2_US_Caltech][SiteDB]] ---+++ Monitoring links We need a page to aggregate those, plus some !DashBoard + !PhEDEx + central CMS monitoring tools plots. ---++++ Central CMS * Overview * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/dailysummary#button=resourceutil&sites%5B%5D=T2_US_Caltech&sites%5B%5D=T2_US_Florida&sites%5B%5D=T2_US_MIT&sites%5B%5D=T2_US_Nebraska&sites%5B%5D=T2_US_Purdue&sites%5B%5D=T2_US_UCSD&sites%5B%5D=T2_US_Vanderbilt&sites%5B%5D=T2_US_Wisconsin&sitesSort=2&start=null&end=null&timerange=last24&granularity=Hourly&generic=0&sortby=0&series=All][Site Race]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites%5B%5D=T2_US_Caltech&sitesSort=2&start=null&end=null&timerange=last24&granularity=Hourly&generic=0&sortby=1&series=All][Batch system efficiency]] * [[http://dashb-ssb.cern.ch/dashboard/request.py/siteview?site=T2_US_Caltech#currentView=default&highlight=true][Site Status Board]] * [[http://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_US_Caltech][Site Readiness]] * [[http://hammercloud.cern.ch/hc/app/cms/][HammerCloud]] * [[http://dashb-ssb-dev.cern.ch/dashboard/templates/sitePendingRunningJobs.html?site=T2_US_Caltech][Running production x Pledge]] * [[http://dashb-cms-job-dev.cern.ch/dashboard/request.py/dailysummary#button=resourceutil&sites%5B%5D=Caltech+CMS+T2&sitesSort=10&start=null&end=null&timerange=lastMonth&granularity=Daily&generic=0&sortby=1&series=All&activities%5B%5D=all][Pledges and shares]] * Grid jobs * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=r][Daily running]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=p][Daily pending]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=lastWeek&granularity=Hourly&generic=0&sortBy=0&type=r][Weekly running]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=lastWeek&granularity=Hourly&generic=0&sortBy=0&type=p][Weekly pending]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/efficiency_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=ea][Job efficiency (success/failure)]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/efficiency_individual?sites=T2_US_Caltech&activities=production&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=ea][Production jobs efficiency (success/failure)]] * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?hostgroup=site-CIT_CMS_T2&style=overview][SAM Overview CIT_CMS_T2]] * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?host=cit-se.ultralight.org&style=detail][Nagios SE]] * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?host=cit-gatekeeper.ultralight.org&style=detail][Nagios CE1]] * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?host=cit-gatekeeper2.ultralight.org&style=detail][Nagios CE2]] * Caltech T2 links from T1s * [[https://cmsweb.cern.ch/phedex/debug/Activity::QualityPlots?src_filter=T1_%2A;period=l24h;no_mss=true;dest_filter=T2_US_Caltech;upto=;entity=src;graph=quality_all][Last 24 hours]] * Transfers, *from* Caltech * [[https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=T2_US_Caltech&dest_filter=&no_mss=true&period=l7d&upto=&.submit=Update][Weekly rate]] * [[https://cmsweb.cern.ch/phedex/prod/Activity::QualityPlots?src_filter=;period=l7d;no_mss=true;dest_filter=T2_US_Caltech;upto=;entity=dest;graph=quantity_rates][Weekly quality]] * Transfers, *to* Caltech * [[https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=&dest_filter=T2_US_Caltech&no_mss=true&period=l7d&upto=][Weekly Rate]] * [[https://cmsweb.cern.ch/phedex/prod/Activity::QualityPlots?graph=quality_all&entity=src&src_filter=&dest_filter=T2_US_Caltech&no_mss=true&period=l48h&upto=][48h quality]] * [[http://perfsonar.racf.bnl.gov:8080/exda/?page=25&cloudName=USCMS][Perfsonar]] * [[http://submit-3.t2.ucsd.edu/CSstoragePath/UserPrio/UserPrio-dev.html][Analysis central monitoring per user]] * [[http://farmsmon.pi.infn.it/phedex/][Subir phedex page]] * [[http://vocms32.cern.ch/gfactory/][GlideIn monitoring page]] * [[http://xrootd.t2.ucsd.edu/dump_cache.jsp?pred=%25%2FCMS%3A%3A%25%3A%3AXrdReport%2F%25%2Fsite&submit=Filter][XrootD Monitoring]] * [[http://dashb-cms-xrootd-transfers.cern.ch/ui/#c.plots=%255B%257B%2522metric%2522%253A%2522outgoing_bytes%2522%252C%2522type%2522%253A%2522column%2522%252C%2522yAxis%2522%253A0%257D%252C%257B%2522metric%2522%253A%2522incoming_bytes%2522%252C%2522type%2522%253A%2522column%2522%252C%2522yAxis%2522%253A%25220%2522%257D%252C%257B%2522metric%2522%253A%2522outgoing_transfers%2522%252C%2522type%2522%253A%2522scatter%2522%252C%2522yAxis%2522%253A%25221%2522%257D%252C%257B%2522metric%2522%253A%2522incoming_transfers%2522%252C%2522type%2522%253A%2522scatter%2522%252C%2522yAxis%2522%253A%25221%2522%257D%255D&c.title=Traffic+by+site&c.xAxis=site&c.yAxis=%255B%257B%2522name%2522%253A%2522Bytes%2522%252C%2522position%2522%253A%2522Left%2522%252C%2522type%2522%253A%2522logarithmic%2522%257D%252C%257B%2522name%2522%253A%2522Number%2520of%2520transfers%2522%252C%2522position%2522%253A%2522Right%2522%252C%2522type%2522%253A%2522logarithmic%2522%257D%255D&ctr.site=(T2_US_Caltech)][XrootD DashBoard]] * [[http://dashb-cms-sum.cern.ch/dashboard/request.py/latestresultssmry-sum#profile=CMS_CRITICAL_FULL&group=AllGroups&site%5B%5D=T2_US_Caltech&flavour%5B%5D=All+Service+Flavours&metric%5B%5D=org.cms.WN-xrootd-access&metric%5B%5D=org.cms.WN-xrootd-fallback&status%5B%5D=All+Exit+Status][xrootd fallback]] * [[https://www-ftsmon.gridpp.rl.ac.uk:8449/fts3/ftsmon/#/][FTS not-as-good monitoring]] * [[http://dashb-fts-transfers.cern.ch/ui/#date.interval=1440&grouping.dst=(site)&grouping.src=(site)&p.grouping=dst&src.site=(CIT)&tab=transfer_plots][FTS great new dashboard]] ---++++ OSG * [[https://myosg.grid.iu.edu/rgstatushistory/index?downtime_attrs_showpast=&account_type=cumulative_hours&ce_account_type=gip_vo&se_account_type=vo_transfer_volume&bdiitree_type=total_jobs&bdii_object=service&bdii_server=is-osg&start_type=7daysago&start_date=02%2F15%2F2013&end_type=now&end_date=02%2F15%2F2013&facility=on&facility_10006=on&gridtype=on&gridtype_1=on&service_1=on&service_5=on&active=on&active_value=1&disable_value=1][RSV probes]] * [[https://t2-monitor.ultralight.org/rsv/][Local RSV]] ---++++ Local pages/systems * [[https://t2-monitor.ultralight.org/jobview/][Internal batch system monitoring]] * [[http://t2-headnode.ultralight.org:50070/dfshealth.jsp][Tier-2 hadoop monitoring]] * [[http://t3-remote.ultralight.org:50070/dfshealth.jsp][Tier-3 remote hadoop monitoring]] * [[https://cms-nagios.caltech.edu/nagios/][Nagios]] * [[https://bugzilla.hep.caltech.edu/][Ticket system]] * [[https://mgmt.hep.caltech.edu/cacti/][Campus cacti]] * [[http://monalisa4.ultralight.org/display?page=sensors/temp&dont_cache=true][MonaLISA Temperature]] * [[http://zabbix.hep.caltech.edu/zabbix/overview.php?sid=529298ad22b9043e&form_refresh=1&groupid=13&view_style=0][Zabbix worker node checks]] * [[http://zabbix.hep.caltech.edu/zabbix/report6.php?sid=1f3ab99f03ec0616&form_refresh=1&form=1&report_show=show&config=3&report_timesince=20140818000000&report_timetill=20140819000000&sortorder=0&hostids%5B10813%5D=10813&hostids%5B10790%5D=10790&hostids%5B10711%5D=10711&hostids%5B10717%5D=10717&hostids%5B10657%5D=10657&hostids%5B10658%5D=10658&hostids%5B10656%5D=10656&hostids%5B10687%5D=10687&hostids%5B10827%5D=10827&hostids%5B10692%5D=10692&hostids%5B10823%5D=10823&hostids%5B10664%5D=10664&hostids%5B10759%5D=10759&hostids%5B10825%5D=10825&hostids%5B10719%5D=10719&hostids%5B10712%5D=10712&hostids%5B10689%5D=10689&hostids%5B10663%5D=10663&itemid=53145&title=Report+3&xlabel=&ylabel=&groupid=0&report_timesince_day=18&report_timesince_month=08&report_timesince_year=2014&report_timesince_hour=00&report_timesince_minute=00&report_timetill_day=19&report_timetill_month=08&report_timetill_year=2014&report_timetill_hour=00&report_timetill_minute=00&scaletype=2&avgperiod=2&item_name=compute-6-8.tier2%3A+Free+disk+space+on+%2Fcvmfs%2Fgrid.cern.ch&palette=0&palettetype=1&report_show=Show][Zabbix distribution of item for nodes]] ---++++ Documentation * [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsSiteconfMigrationToGit][Chaging SITECONF settings]] * [[https://twiki.cern.ch/twiki/bin/viewauth/CMS/CRAB3CheatSheet][CRAB3 mantra]] ---+++ Monitoring shifts ---++++ Daily * Check Readiness * Check Site Status Board * Check [[https://cmsweb.cern.ch/phedex/debug/Activity::QualityPlots?src_filter=T2_US_Caltech;period=l48h;no_mss=true;dest_filter=;upto=;entity=dest;graph=quality_all][PhEDEx]] Transfers ([[https://cmsweb.cern.ch/phedex/debug/Activity::QualityPlots?src_filter=;period=l48h;no_mss=true;dest_filter=T2_US_Caltech;upto=;entity=src;graph=quality_all][both directions]]) * HDFS status, corrupted blocks * Check [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites%5B%5D=T2_US_Caltech&sites%5B%5D=T2_US_Florida&sites%5B%5D=T2_US_MIT&sites%5B%5D=T2_US_Nebraska&sites%5B%5D=T2_US_Purdue&sites%5B%5D=T2_US_UCSD&sites%5B%5D=T2_US_Vanderbilt&sites%5B%5D=T2_US_Wisconsin&sitesSort=2&start=null&end=null&timerange=last24&granularity=Hourly&generic=0&sortby=0&series=All][Job Failure rate]], reasons if there is a site problem * Check [[http://t2-headnode-new.ultralight.org/ganglia/?c=T2_US_Caltech-puppet&m=load_one&r=month&s=descending&hc=4&mc=2][Ganglia]] plots * Check [[http://zabbix.hep.caltech.edu/zabbix/screens.php?ddreset=1&sid=1f3ab99f03ec0616][GridFTP Screen]] on Zabbix ---++++ Weekly * Check RAID status on servers * SRM * CEs * GridFTPs * GUMS * Do this fast by : [root@t2-headnode-new lists]# pssh -i -h raid-check-list.txt "cat /proc/mdstat" | grep "blocks super" | grep -v chunks * PhEDEx * T3 Headnode * T3 Login node * T3 JBOD * Query with : storcli64 /c0/v0 show * Newman * Query with : areca_cli64 ; vsf info * LDAP2 * Check that the important servers still have a working backup * Nodes / Cores / Storage counts - once we update the inventory and have defined totals. ---+++ Monitoring requirements * RSV * Cert exists * Cert validity * Nodes * GLEXEC * CVMFS * SWAP Trigger * IOPS on HDFS * IOPS on / * Proc / User * Open Files / User * Servers * RAID states * NameNode/HDFS * Health * Checkpoints * NN - All filesystems usage maximum alert * External * PhEDEx data on transfers * Data from SAM * Data from RSV * Data from DashBoard - Blackhole nodes -- Main.samir - 2014-04-01
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r24
<
r23
<
r22
<
r21
<
r20
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r22 - 2015-03-16
-
samir
Home
Site map
Main web
Sandbox web
TWiki web
Main Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Edit
Attach
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback