Line: 1 to 1 | |||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
USCMS T2 Transfers | |||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | This twiki is intended to aggregate all necessary information for the current effort of improving inter-T2 PhEDEx transfers in the context of USCMS. | ||||||||||||||||||||||||||||||||||||||||
> > | This twiki was created to report the latest status on this initiative![]() | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | It is known that the networks supported between the 8 sites are of high capacity and availability. However there seem to be some limitations to be addressed and tested at the level of CMS Transfer tools or configurations of these, that could improve the overall performance and at the end, make these systems perform better and deliver data faster between sites. | ||||||||||||||||||||||||||||||||||||||||
> > | We will have different sections for each site's notes | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | The general picture on transfers over 20 Gbps and some of these configuration problems are mentioned in Samir's talk![]() | ||||||||||||||||||||||||||||||||||||||||
> > | General status | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | So far, the showstopper was the uplink bandwidth for most sites. Since July 2014 this is starting to change. | ||||||||||||||||||||||||||||||||||||||||
> > | DashBoard FTS plots![]() | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | The ideal is that even 10 Gbps sites could participate, as it is possible that the currrent settings are not optimal for fast transfers. We could tune it until it saturates the 10 Gbps link and everyone would have exercised how to improve transfer rates in Debug. | ||||||||||||||||||||||||||||||||||||||||
> > | Site notes | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | Plan for the exercise | ||||||||||||||||||||||||||||||||||||||||
> > | Caltech | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | As discussed in the meeting, we would like to use Caltech as the source site, as it managed to do 25/29 Gbps with its setup, being a good source for sites optimizing their configurations. Once everyone else optimizes their download configurations and we observe which rates we get to which sites, we could start rotating who is the source site, and see what are the maximum rates that we get from them. It is important that we have multiple sink sites, as even if there are limitations in sites, the others will add up to the total rate. | ||||||||||||||||||||||||||||||||||||||||
> > | Had a power outage in 30 nodes at 2 PM PST that degraded transfers as HDFS was affected. All nodes recovered within 10 minutes but the optimizer got traumatized and took a while to ramp-up again. | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | There are 3 major steps on this exercise, 2 of them will require coordination among sites : | ||||||||||||||||||||||||||||||||||||||||
> > | Nebraska | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < |
| ||||||||||||||||||||||||||||||||||||||||
> > | Had good rates on this Friday, when we started ramping up. Had problems with their PhEDEx node which interrupted transfers at some points in the day. Last time I have seen there were bursty transfers -- not enough pending. | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | Participation of sites | ||||||||||||||||||||||||||||||||||||||||
> > | GFTP issues | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | In order to contact only the interested sites, please fill out the table below :
FTS Notes | ||||||||||||||||||||||||||||||||||||||||
> > | Error reason: TRANSFER globus_ftp_client: the server responded with an error 500 Command failed. : Unable to extend our file-backed buffers; aborting transfer. | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | Currently we have 3 official FTS servers : | ||||||||||||||||||||||||||||||||||||||||
> > | I don't really understand the reason for this. But might be fixable with different (higher) GridFTP buffer configurations? | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < |
| ||||||||||||||||||||||||||||||||||||||||
> > | Purdue | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | In the long run, US Sites should use FNAL. But it might be worth to understand if other FTS servers have a different optimizer behavior and why. | ||||||||||||||||||||||||||||||||||||||||
> > | It seemed that we have Network issues in Kansas. Even though when the optimizer lets us, we have good rates. Manoj analyzed thoroughly the network paths for sites that were good and bad, cross-checking with PerfSONAR and the best clue is that we have a problem with a route in Los Angeles, which looks good to Nebraska but bad to Purdue. | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | PhEDEx Documentation | ||||||||||||||||||||||||||||||||||||||||
> > | Will follow up with Iperf testing to reveal packet loss and contact Network Support. | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | We will be exercising mostly the Download agent, therefore the most useful documenation for us is this![]() | ||||||||||||||||||||||||||||||||||||||||
> > | GFTP Issues : | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | However there is also this![]() | ||||||||||||||||||||||||||||||||||||||||
> > | Error reason: TRANSFER globus_ftp_client: the server responded with an error 500 Command failed. : Allocated all 1500 file-backed buffers on server cms-g004.rcac.purdue.edu; aborting transfer. | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | PhEDEx configurations | ||||||||||||||||||||||||||||||||||||||||
> > | It probably ran out of memory buffers and started using Disk buffers, what in principle shouldn't happen (Brian will know more). Quick workaround would be to raise the file buffers in the configuration and restart the service, but the ideal is to find the root cause of why it needs so much file buffers. | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | One of the limitations is how much the download site PhEDEx agent submits to FTS. Caltech was asked in the meeting how they control that. In that case, we have 2 agents. One for general transfers and another exclusively for US Transfers. the -ignore and -accept flags will do the separation. Also, see that one can throttle the number of active transfers for each site as needed and set a default for the sites not specified. The relevant part for Config.Debug is : | ||||||||||||||||||||||||||||||||||||||||
> > | Florida | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | ### AGENT LABEL=download-debug-fts PROGRAM=Toolkit/Transfer/FileDownload DEFAULT=on -db ${PHEDEX_DBPARAM} -nodes ${PHEDEX_NODE} -delete ${PHEDEX_CONF}/FileDownloadDelete -validate ${PHEDEX_CONF}/FileDownloadVerify -ignore '%T2_US%' -verbose -backend FTS -batch-files 50 -link-pending-files 200 -max-active-files 700 -link-active-files 'T1_CH_CERN_Buffer=50' -link-active-files 'T1_DE_KIT_Buffer=10' -link-active-files 'T1_DE_KIT_Disk=10' -link-active-files 'T1_ES_PIC_Buffer=100' -link-active-files 'T2_RU_RRC_KI=2' -link-active-files 'T1_FR_CCIN2P3_Buffer=100' -link-active-files 'T1_FR_CCIN2P3_Disk=100' -link-active-files 'T1_IT_CNAF_Buffer=150' -link-active-files 'T1_TW_ASGC_Buffer=100' -link-active-files 'T1_UK_RAL_Buffer=50' -link-active-files 'T1_US_FNAL_Buffer=100' -link-active-files 'T2_DE_RWTH=10' -link-active-files 'T2_IT_Pisa=20' -default-link-active-files 100 -protocols srmv2 -mapfile ${PHEDEX_FTS_MAP} ### AGENT LABEL=download-debug-t2fts PROGRAM=Toolkit/Transfer/FileDownload DEFAULT=on -db ${PHEDEX_DBPARAM} -nodes ${PHEDEX_NODE} -delete ${PHEDEX_CONF}/FileDownloadDelete -validate ${PHEDEX_CONF}/FileDownloadVerify -accept '%T2_US%' -verbose -backend FTS -batch-files 20 -link-pending-files 300 -max-active-files 300 -protocols srmv2 -mapfile ${PHEDEX_FTS_MAP} | ||||||||||||||||||||||||||||||||||||||||
> > | Just joined. Thanks, will ramp up transfers. | ||||||||||||||||||||||||||||||||||||||||
Added: | |||||||||||||||||||||||||||||||||||||||||
> > | Vanderbilt | ||||||||||||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||||||||||||
< < | |||||||||||||||||||||||||||||||||||||||||
> > | Joined in the first day, LStore performance is randomly anywhere from great to poor. Have seen 600 MBps in the past but not a lot right now. Still figuring out the contacts to follow up there as they reorganize the team. | ||||||||||||||||||||||||||||||||||||||||
-- Main.samir - 2014-07-29 |