Tags:
view all tags
---+ CIT HEP Computing %TOC% This page is dedicated to centralize all computing related subjects * [[ComputingGridTools]] * [[ComputingT3HiggsFAQ]] ---++ Workplan guidelines Our priorities are in this order : * Uptime * User support (blocker issues) * Resource utilization -- making sure that all hardware is being used * User support (non-blocker important issues) * Backups * Monitoring * User support (potential non-issues) * Security * R&D - This probably needs extra manpower. We have ideas but no time. ---++ Monitoring links ---+++ Extremely handy links : [[http://alimonitor.cern.ch/hepspec/][MonaLisa HepSpec table]] [[https://docs.google.com/spreadsheet/ccc?key=0AvE7aiWBwKzWdHl4MVpSZTRBcjktdXBqWlFhcnZrVmc#gid=0][USCMS T2 HepSpec table]] [[http://dashb-ssb.cern.ch/dashboard/request.py/siteview#currentView=production&highlight=true][Sites pledges]] ---++ Upgrades (software) [[https://twiki.cern.ch/twiki/bin/viewauth/CMS/USCMSTier2Upgrades][USCMS Upgrades twiki]] ---+++ Pledges, etc * [[http://gstat-wlcg.cern.ch/apps/pledges/requirements/][REBUS]] * [[https://cmsweb.cern.ch/sitedb/prod/sites/T2_US_Caltech][SiteDB]] ---+++ Monitoring links We need a page to aggregate those, plus some !DashBoard + !PhEDEx + central CMS monitoring tools plots. ---++++ Central CMS * Overview * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites%5B%5D=T2_US_Caltech&sitesSort=2&start=null&end=null&timerange=last24&granularity=Hourly&generic=0&sortby=1&series=All][Batch system efficiency]] * [[http://dashb-ssb.cern.ch/dashboard/request.py/siteview?site=T2_US_Caltech#currentView=default&highlight=true][Site Status Board]] * [[http://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_US_Caltech][Site Readiness]] * [[http://dashb-ssb-dev.cern.ch/dashboard/templates/sitePendingRunningJobs.html?site=T2_US_Caltech][Running production x Pledge]] * [[http://dashb-cms-job-dev.cern.ch/dashboard/request.py/dailysummary#button=resourceutil&sites%5B%5D=Caltech+CMS+T2&sitesSort=10&start=null&end=null&timerange=lastMonth&granularity=Daily&generic=0&sortby=1&series=All&activities%5B%5D=all][Pledges and shares]] * Grid jobs * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=r][Daily running]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=p][Daily pending]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=lastWeek&granularity=Hourly&generic=0&sortBy=0&type=r][Weekly running]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/jobnumbers_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=lastWeek&granularity=Hourly&generic=0&sortBy=0&type=p][Weekly pending]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/efficiency_individual?sites=T2_US_Caltech&activities=all&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=ea][Job efficiency (success/failure)]] * [[http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/efficiency_individual?sites=T2_US_Caltech&activities=production&sitesSort=2&start=null&end=null&timeRange=last24&granularity=Hourly&generic=0&sortBy=0&type=ea][Production jobs efficiency (success/failure)]] * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?host=cit-se.ultralight.org&style=detail][Nagios SE]] * [[https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?host=cit-gatekeeper.ultralight.org&style=detail][Nagios CE1]] * Transfers, *from* Caltech * [[https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=T2_US_Caltech&dest_filter=&no_mss=true&period=l7d&upto=&.submit=Update][Weekly rate]] * [[https://cmsweb.cern.ch/phedex/prod/Activity::QualityPlots?src_filter=;period=l7d;no_mss=true;dest_filter=T2_US_Caltech;upto=;entity=dest;graph=quantity_rates][Weekly quality]] * Transfers, *to* Caltech * [[https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=&dest_filter=T2_US_Caltech&no_mss=true&period=l7d&upto=][Weekly Rate]] * [[https://cmsweb.cern.ch/phedex/prod/Activity::QualityPlots?graph=quality_all&entity=src&src_filter=&dest_filter=T2_US_Caltech&no_mss=true&period=l48h&upto=][48h quality]] * [[http://perfsonar.racf.bnl.gov:8080/exda/?page=25&cloudName=USCMS][Perfsonar]] * [[http://submit-3.t2.ucsd.edu/CSstoragePath/UserPrio/UserPrio-dev.html][Analysis central monitoring per user]] * [[http://farmsmon.pi.infn.it/phedex/][Subir phedex page]] * [[http://vocms32.cern.ch/gfactory/][GlideIn monitoring page]] * [[http://xrootd.t2.ucsd.edu/dump_cache.jsp?pred=%25%2FCMS%3A%3A%25%3A%3AXrdReport%2F%25%2Fsite&submit=Filter][XrootD DashBoard]] ---++++ OSG * [[https://myosg.grid.iu.edu/rgstatushistory/index?downtime_attrs_showpast=&account_type=cumulative_hours&ce_account_type=gip_vo&se_account_type=vo_transfer_volume&bdiitree_type=total_jobs&bdii_object=service&bdii_server=is-osg&start_type=7daysago&start_date=02%2F15%2F2013&end_type=now&end_date=02%2F15%2F2013&facility=on&facility_10006=on&gridtype=on&gridtype_1=on&service_1=on&service_5=on&active=on&active_value=1&disable_value=1][RSV probes]] * [[https://t2-monitor.ultralight.org/rsv/][Local RSV]] ---++++ Local pages/systems * [[https://t2-monitor.ultralight.org/jobview/][Internal batch system monitoring]] * [[http://t2-headnode.ultralight.org:50070/dfshealth.jsp][Tier-2 hadoop monitoring]] * [[http://t3-remote.ultralight.org:50070/dfshealth.jsp][Tier-3 remote hadoop monitoring]] * [[https://cms-nagios.caltech.edu/nagios/][Nagios]] * [[https://bugzilla.hep.caltech.edu/][Ticket system]] * [[https://mgmt.hep.caltech.edu/cacti/][Campus cacti]] ---++++ Documentation * [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsSiteconfMigrationToGit][Chaging SITECONF settings]] * [[https://twiki.cern.ch/twiki/bin/viewauth/CMS/CRAB3CheatSheet][CRAB3 mantra]] ---+++ Workplan 02/2013 ---++++ T2 * Base improvements in the T2 cluster management schema -- will improve the cluster stability and ease maintenance procedures * Improvements/integration on the monitoring * Automate MonaLisa install into all T2 nodes and servers * Clean current MonaLisa web dashboard with the help of the MonaLisa team * Clean current Nagios dashboard, reduce to applicable alarm frequency * (optional) Implement SMS alerts for Nagios most critical alarms. * Integrate all new servers into Nagios/Backup schema (could not be done yet) * Review of network/firewalls/security settings -- more a "learning review" * Add few rules so cluster management from CERN is easier * (Optional) Explore different SSH authentication methods (GSI or Kerberos) - passwordless but still secure SSH access ---++++ T3 (Local & Remote) * Automate node deployment and profile maintenance with local (dedicated headnode) or remote (T2) puppet+foreman * This will make us move away from Rocks in all clusters and have more uniformity in deployment/configuration procedures. * Commission Condor as a batch system, better monitoring comes for free ---++++ Global (related to computing resources in general) : * [[https://github.com/dmwm/WMCore/wiki/All-in-one-test]][Tutorial]] To install a WMAgent. Let's see. * Explore how the T3s could use its Idle CPU time (if any) to process jobs for the T2 cluster. (condor integration, condor fork) * Explore pros/cons of this : The T2 and T3-higgs nodes can store data under the same hadoop namenode, our local resources would be protected by quotas. Management gets simpler * Explore dynamic Condor job scheduling based on resource utilization and job requirements. Condor will not anymore serve X slots, but will schedule jobs as long as CPU usage is < 90% (configurable) and there is enough available memory in a node. In general this should provide slightly more job slots than the number of cores that a node has. -- motivation : most of the jobs have CPU efficiency < 70% * Improve BeStman GridFTP selection algorithm (light development needed) -- currently round-robin, very sub-optimal for setup like ours where we have fast (10 GE) servers and slow (1 GE) servers. -- Main.samir - 2014-04-01
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r24
|
r6
<
r5
<
r4
<
r3
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r4 - 2014-05-30
-
samir
Home
Site map
Main web
Sandbox web
TWiki web
Main Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Edit
Attach
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback