Difference: ComputingT2Transfers (4 vs. 5)

Revision 52014-08-02 - samir

Line: 1 to 1

USCMS T2 Transfers
General status
Site notes
- Caltech
- Nebraska
- Purdue
- Vanderbilt
- UCSD

USCMS T2 Transfers

Changed:

<
<

This twiki is intended to aggregate all necessary information for the current effort of improving inter-T2 PhEDEx transfers in the context of USCMS.

>
>

This twiki was created to report the latest status on this initiative

Changed:

<
<

It is known that the networks supported between the 8 sites are of high capacity and availability. However there seem to be some limitations to be addressed and tested at the level of CMS Transfer tools or configurations of these, that could improve the overall performance and at the end, make these systems perform better and deliver data faster between sites.

>
>

We will have different sections for each site's notes

Changed:

<
<

The general picture on transfers over 20 Gbps and some of these configuration problems are mentioned in Samir's talk

at the T2 meeting of 07/29.

>
>

General status

Changed:

<
<

So far, the showstopper was the uplink bandwidth for most sites. Since July 2014 this is starting to change.

>
>

DashBoard FTS plots

Changed:

<
<

The ideal is that even 10 Gbps sites could participate, as it is possible that the currrent settings are not optimal for fast transfers. We could tune it until it saturates the 10 Gbps link and everyone would have exercised how to improve transfer rates in Debug.

>
>

Site notes

Changed:

<
<

Plan for the exercise

>
>

Caltech

Changed:

<
<

As discussed in the meeting, we would like to use Caltech as the source site, as it managed to do 25/29 Gbps with its setup, being a good source for sites optimizing their configurations. Once everyone else optimizes their download configurations and we observe which rates we get to which sites, we could start rotating who is the source site, and see what are the maximum rates that we get from them. It is important that we have multiple sink sites, as even if there are limitations in sites, the others will add up to the total rate.

>
>

Had a power outage in 30 nodes at 2 PM PST that degraded transfers as HDFS was affected. All nodes recovered within 10 minutes but the optimizer got traumatized and took a while to ramp-up again.

Changed:

<
<

There are 3 major steps on this exercise, 2 of them will require coordination among sites :

>
>

Nebraska

Changed:

<
<

Tuning PhEDEx download configurations, so LoadTest settings will correspond better to reality
Observing how transfers behave at the FTS level, note if the Optimizer algorithm is a limiting factor or it actually helps to achieve the optimal setting of active transfers for the available bandwidth at a given moment.
The logical limitations would have been removed, sites can focus on setting their upload rates as they want and optimize their GridFTP setups, start observing what are the best rates they can get out of the storage.

When we are done with these, the transfer test framework through PhEDEx will be more responsive and we will actually be able to run more advanced tests on higher rates. For example coordinate pushing data from many sites to one.

>
>

Had good rates on this Friday, when we started ramping up. Had problems with their PhEDEx node which interrupted transfers at some points in the day. Last time I have seen there were bursty transfers -- not enough pending.

Changed:

<
<

Participation of sites

>
>

GFTP issues

Changed:

<
<

In order to contact only the interested sites, please fill out the table below :

Site	Connectivity	Participating	Notes
T2_BR_SPRACE	10G	N/A
T2_US_Caltech	100G
T2_US_Florida	100G	N/A
T2_US_MIT	10G	N/A
T2_US_Nebraska	10G	N/A	Upgrading to 100G soon
T2_US_Purdue	100G	N/A
T2_US_UCSD	10G	N/A
T2_US_Wisconsin	10G	N/A
T2_US_Vanderbilt	10G	N/A

FTS Notes

>
>

Error reason: TRANSFER globus_ftp_client: the server responded with an error 500 Command failed. : Unable to extend our file-backed buffers; aborting transfer.

Changed:

<
<

Currently we have 3 official FTS servers :

>
>

I don't really understand the reason for this. But might be fixable with different (higher) GridFTP buffer configurations?

Changed:

<
<

cmsfts3.fnal.gov
fts3.cern.ch
lcgfts3.gridpp.rl.ac.uk

There is an official recommendation that is the most logic distribution of what you should use, however for this exercise people are encouraged to try other deployments and possibly different behaviors. For example it was observed 208 transfers in parallel in CERN's FTS, but not more than 50 at FNAL (yet).

>
>

Purdue

Changed:

<
<

In the long run, US Sites should use FNAL. But it might be worth to understand if other FTS servers have a different optimizer behavior and why.

>
>

It seemed that we have Network issues in Kansas. Even though when the optimizer lets us, we have good rates. Manoj analyzed thoroughly the network paths for sites that were good and bad, cross-checking with PerfSONAR and the best clue is that we have a problem with a route in Los Angeles, which looks good to Nebraska but bad to Purdue.

Changed:

<
<

PhEDEx Documentation

>
>

Will follow up with Iperf testing to reveal packet loss and contact Network Support.

Changed:

<
<

We will be exercising mostly the Download agent, therefore the most useful documenation for us is this

>
>

GFTP Issues :

Changed:

<
<

However there is also this

if you would like to read more.

>
>

Error reason: TRANSFER globus_ftp_client: the server responded with an error 500 Command failed. : Allocated all 1500 file-backed buffers on server cms-g004.rcac.purdue.edu; aborting transfer.

Changed:

<
<

PhEDEx configurations

>
>

It probably ran out of memory buffers and started using Disk buffers, what in principle shouldn't happen (Brian will know more). Quick workaround would be to raise the file buffers in the configuration and restart the service, but the ideal is to find the root cause of why it needs so much file buffers.

Changed:

<
<

One of the limitations is how much the download site PhEDEx agent submits to FTS. Caltech was asked in the meeting how they control that. In that case, we have 2 agents. One for general transfers and another exclusively for US Transfers. the -ignore and -accept flags will do the separation. Also, see that one can throttle the number of active transfers for each site as needed and set a default for the sites not specified. The relevant part for Config.Debug is :

>
>

Florida

Changed:

<
<

### AGENT LABEL=download-debug-fts PROGRAM=Toolkit/Transfer/FileDownload DEFAULT=on
 -db              ${PHEDEX_DBPARAM}
 -nodes           ${PHEDEX_NODE}
 -delete          ${PHEDEX_CONF}/FileDownloadDelete
 -validate        ${PHEDEX_CONF}/FileDownloadVerify
 -ignore          '%T2_US%'
 -verbose
 -backend         FTS
 -batch-files     50
 -link-pending-files     200
 -max-active-files 700
 -link-active-files   'T1_CH_CERN_Buffer=50'
 -link-active-files   'T1_DE_KIT_Buffer=10'
 -link-active-files   'T1_DE_KIT_Disk=10'
 -link-active-files   'T1_ES_PIC_Buffer=100'
 -link-active-files   'T2_RU_RRC_KI=2'
 -link-active-files   'T1_FR_CCIN2P3_Buffer=100'
 -link-active-files   'T1_FR_CCIN2P3_Disk=100'
 -link-active-files   'T1_IT_CNAF_Buffer=150'
 -link-active-files   'T1_TW_ASGC_Buffer=100'
 -link-active-files   'T1_UK_RAL_Buffer=50'
 -link-active-files   'T1_US_FNAL_Buffer=100'
 -link-active-files   'T2_DE_RWTH=10'
 -link-active-files   'T2_IT_Pisa=20'
 -default-link-active-files 100
 -protocols       srmv2
 -mapfile         ${PHEDEX_FTS_MAP}


### AGENT LABEL=download-debug-t2fts PROGRAM=Toolkit/Transfer/FileDownload DEFAULT=on
 -db              ${PHEDEX_DBPARAM}
 -nodes           ${PHEDEX_NODE}
 -delete          ${PHEDEX_CONF}/FileDownloadDelete
 -validate        ${PHEDEX_CONF}/FileDownloadVerify
 -accept          '%T2_US%'
 -verbose
 -backend         FTS
 -batch-files     20
 -link-pending-files     300
 -max-active-files 300
 -protocols       srmv2
 -mapfile         ${PHEDEX_FTS_MAP}

>
>

Just joined. Thanks, will ramp up transfers.

Added:

>
>

Vanderbilt

Changed:

<
<

>
>

Joined in the first day, LStore performance is randomly anywhere from great to poor. Have seen 600 MBps in the past but not a lot right now. Still figuring out the contacts to follow up there as they reorganize the team.

-- Main.samir - 2014-07-29

View topic | History: r9 < r8 < r7 < r6 | More topic actions...

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback