Difference: ComputingT2Transfers (4 vs. 5)

Revision 52014-08-02 - samir

Line: 1 to 1
 

USCMS T2 Transfers

Changed:
<
<
This twiki is intended to aggregate all necessary information for the current effort of improving inter-T2 PhEDEx transfers in the context of USCMS.
>
>
This twiki was created to report the latest status on this initiative.
 
Changed:
<
<
It is known that the networks supported between the 8 sites are of high capacity and availability. However there seem to be some limitations to be addressed and tested at the level of CMS Transfer tools or configurations of these, that could improve the overall performance and at the end, make these systems perform better and deliver data faster between sites.
>
>
We will have different sections for each site's notes
 
Changed:
<
<
The general picture on transfers over 20 Gbps and some of these configuration problems are mentioned in Samir's talk at the T2 meeting of 07/29.
>
>

General status

 
Changed:
<
<
So far, the showstopper was the uplink bandwidth for most sites. Since July 2014 this is starting to change.
>
>
DashBoard FTS plots
 
Changed:
<
<
The ideal is that even 10 Gbps sites could participate, as it is possible that the currrent settings are not optimal for fast transfers. We could tune it until it saturates the 10 Gbps link and everyone would have exercised how to improve transfer rates in Debug.
>
>

Site notes

 
Changed:
<
<

Plan for the exercise

>
>

Caltech

 
Changed:
<
<
As discussed in the meeting, we would like to use Caltech as the source site, as it managed to do 25/29 Gbps with its setup, being a good source for sites optimizing their configurations. Once everyone else optimizes their download configurations and we observe which rates we get to which sites, we could start rotating who is the source site, and see what are the maximum rates that we get from them. It is important that we have multiple sink sites, as even if there are limitations in sites, the others will add up to the total rate.
>
>
Had a power outage in 30 nodes at 2 PM PST that degraded transfers as HDFS was affected. All nodes recovered within 10 minutes but the optimizer got traumatized and took a while to ramp-up again.
 
Changed:
<
<
There are 3 major steps on this exercise, 2 of them will require coordination among sites :
>
>

Nebraska

 
Changed:
<
<
  • Tuning PhEDEx download configurations, so LoadTest settings will correspond better to reality
  • Observing how transfers behave at the FTS level, note if the Optimizer algorithm is a limiting factor or it actually helps to achieve the optimal setting of active transfers for the available bandwidth at a given moment.
  • The logical limitations would have been removed, sites can focus on setting their upload rates as they want and optimize their GridFTP setups, start observing what are the best rates they can get out of the storage.
When we are done with these, the transfer test framework through PhEDEx will be more responsive and we will actually be able to run more advanced tests on higher rates. For example coordinate pushing data from many sites to one.
>
>
Had good rates on this Friday, when we started ramping up. Had problems with their PhEDEx node which interrupted transfers at some points in the day. Last time I have seen there were bursty transfers -- not enough pending.
 
Changed:
<
<

Participation of sites

>
>

GFTP issues

 
Changed:
<
<
In order to contact only the interested sites, please fill out the table below :

Site Connectivity Participating Notes
T2_BR_SPRACE 10G N/A  
T2_US_Caltech 100G DONE  
T2_US_Florida 100G N/A  
T2_US_MIT 10G N/A  
T2_US_Nebraska 10G N/A Upgrading to 100G soon
T2_US_Purdue 100G N/A  
T2_US_UCSD 10G N/A  
T2_US_Wisconsin 10G N/A  
T2_US_Vanderbilt 10G N/A  

FTS Notes

>
>
Error reason: TRANSFER globus_ftp_client: the server responded with an error 500 Command failed. : Unable to extend our file-backed buffers; aborting transfer.
 
Changed:
<
<
Currently we have 3 official FTS servers :
>
>
I don't really understand the reason for this. But might be fixable with different (higher) GridFTP buffer configurations?
 
Changed:
<
<
  • cmsfts3.fnal.gov
  • fts3.cern.ch
  • lcgfts3.gridpp.rl.ac.uk
There is an official recommendation that is the most logic distribution of what you should use, however for this exercise people are encouraged to try other deployments and possibly different behaviors. For example it was observed 208 transfers in parallel in CERN's FTS, but not more than 50 at FNAL (yet).
>
>

Purdue

 
Changed:
<
<
In the long run, US Sites should use FNAL. But it might be worth to understand if other FTS servers have a different optimizer behavior and why.
>
>
It seemed that we have Network issues in Kansas. Even though when the optimizer lets us, we have good rates. Manoj analyzed thoroughly the network paths for sites that were good and bad, cross-checking with PerfSONAR and the best clue is that we have a problem with a route in Los Angeles, which looks good to Nebraska but bad to Purdue.
 
Changed:
<
<

PhEDEx Documentation

>
>
Will follow up with Iperf testing to reveal packet loss and contact Network Support.
 
Changed:
<
<
We will be exercising mostly the Download agent, therefore the most useful documenation for us is this.
>
>

GFTP Issues :

 
Changed:
<
<
However there is also this if you would like to read more.
>
>
Error reason: TRANSFER globus_ftp_client: the server responded with an error 500 Command failed. : Allocated all 1500 file-backed buffers on server cms-g004.rcac.purdue.edu; aborting transfer.
 
Changed:
<
<

PhEDEx configurations

>
>
It probably ran out of memory buffers and started using Disk buffers, what in principle shouldn't happen (Brian will know more). Quick workaround would be to raise the file buffers in the configuration and restart the service, but the ideal is to find the root cause of why it needs so much file buffers.
 
Changed:
<
<
One of the limitations is how much the download site PhEDEx agent submits to FTS. Caltech was asked in the meeting how they control that. In that case, we have 2 agents. One for general transfers and another exclusively for US Transfers. the -ignore and -accept flags will do the separation. Also, see that one can throttle the number of active transfers for each site as needed and set a default for the sites not specified. The relevant part for Config.Debug is :
>
>

Florida

 
Changed:
<
<
### AGENT LABEL=download-debug-fts PROGRAM=Toolkit/Transfer/FileDownload DEFAULT=on
 -db              ${PHEDEX_DBPARAM}
 -nodes           ${PHEDEX_NODE}
 -delete          ${PHEDEX_CONF}/FileDownloadDelete
 -validate        ${PHEDEX_CONF}/FileDownloadVerify
 -ignore          '%T2_US%'
 -verbose
 -backend         FTS
 -batch-files     50
 -link-pending-files     200
 -max-active-files 700
 -link-active-files   'T1_CH_CERN_Buffer=50'
 -link-active-files   'T1_DE_KIT_Buffer=10'
 -link-active-files   'T1_DE_KIT_Disk=10'
 -link-active-files   'T1_ES_PIC_Buffer=100'
 -link-active-files   'T2_RU_RRC_KI=2'
 -link-active-files   'T1_FR_CCIN2P3_Buffer=100'
 -link-active-files   'T1_FR_CCIN2P3_Disk=100'
 -link-active-files   'T1_IT_CNAF_Buffer=150'
 -link-active-files   'T1_TW_ASGC_Buffer=100'
 -link-active-files   'T1_UK_RAL_Buffer=50'
 -link-active-files   'T1_US_FNAL_Buffer=100'
 -link-active-files   'T2_DE_RWTH=10'
 -link-active-files   'T2_IT_Pisa=20'
 -default-link-active-files 100
 -protocols       srmv2
 -mapfile         ${PHEDEX_FTS_MAP}


### AGENT LABEL=download-debug-t2fts PROGRAM=Toolkit/Transfer/FileDownload DEFAULT=on
 -db              ${PHEDEX_DBPARAM}
 -nodes           ${PHEDEX_NODE}
 -delete          ${PHEDEX_CONF}/FileDownloadDelete
 -validate        ${PHEDEX_CONF}/FileDownloadVerify
 -accept          '%T2_US%'
 -verbose
 -backend         FTS
 -batch-files     20
 -link-pending-files     300
 -max-active-files 300
 -protocols       srmv2
 -mapfile         ${PHEDEX_FTS_MAP}
>
>
Just joined. Thanks, will ramp up transfers.
 
Added:
>
>

Vanderbilt

 
Changed:
<
<
>
>
Joined in the first day, LStore performance is randomly anywhere from great to poor. Have seen 600 MBps in the past but not a lot right now. Still figuring out the contacts to follow up there as they reorganize the team.
  -- Main.samir - 2014-07-29
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback