Closed Bug 661656 Opened 9 years ago Closed 9 years ago

Determine if we can improve the download times between sjc1 and scl1

Categories

(Release Engineering :: General, defect, P2, critical)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

(Whiteboard: [buildfaster:p1] p2p link enabled on Aug. 9th)

Attachments

(1 file)

From doing analysis in bug 661585 I realized that download times of builds, symbols and packaged tests can vary a lot and can sometimes take a very long time.

Before we moved to scl1 we had squid between sjc1 and mtv which helped on improving the download time.

What would you suggest we do to improve this?
Do we have metrics on bandwith between sjc1 and scl1?
Do we have metrics for stage.m.o? (We download the builds & tests from there).

[4:23pm] armenzg: jabba: could we set up squid once again in between stage and scl1?
[4:23pm] armenzg: or would you suggest something else?
[4:24pm] jabba: we weren't really happy with squid when we used it
[4:24pm] jabba: it seemed to cause more problems than it solved
[4:24pm] jabba: I mean we could probably try it, or set up varnish or something
[4:25pm] jabba: what is the actual problem though? bandwidth or load on stage.m.o ?
[4:27pm] armenzg: jabba: the download from stage to a slave on scl1 takes sometimes a long time
[4:28pm] armenzg: it doesn't have to be squid but anything that could improve the situation
[4:30pm] jabba: armenzg: I'd be curious to know what exactly the problem is, wether it is a bandwidth problem between the data centers, or if it is too much load on the origin server
[4:30pm] armenzg: jabba: how could we figure that out?
[4:30pm] jabba: and a caching solution should be well thought through and done consistently
[4:30pm] jabba: the squid thing in mtv was kind of a hack at the time
[4:31pm] jabba: and I'd prefer to solve whatever the underlying issue is rather than just throw more moving parts at an already sensitive and SPOF-laden ecosystem
[4:31pm] jabba: we should get the netops guys to help determine if we are hitting maximum bandwidth between scl1 and sjc1
[4:33pm] jabba: but in general, I've been quite disconnected with your overall infrastructure, that I don't want to go suggesting to do or not to do something 
[4:33pm] jabba: I'll defer to zandr and arr
[4:34pm] armenzg: OK
Armen- Please provide a specific case of a download that has performance problems (source host, destination host, protocol) and I'll loop in netops to see what we can learn.

Ideally pick a staging machine that we can test from without disturbing production.
I have set talos-r3-snow-leopard-002 to do any work in here.

What I am trying to optimize are the following steps:
> wget --progress=dot:mega -N http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1307099495/firefox-7.0a1.en-US.mac64.dmg
> wget --progress=dot:mega -N http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1307099495/firefox-7.0a1.en-US.mac64.crashreporter-symbols.zip
> wget --progress=dot:mega -N http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1307099495/firefox-7.0a1.en-US.mac64.tests.zip

Similar steps are run for each OS.
These 3 steps are run for each unit test job and talos job.
There are around 8-10 suites for each type so that means that we have 8-10 slaves around the same time requesting the same 3 sets of files.

With squid there was caching happening but once in a while there was corruption and all 8-10 jobs would just go red. I can't recall much if jabba/IT discovered if a newer version of squid fixed that.

What would you like me to try? What are we trying to measure?
These are the times for a given Mac build:
> download_build download ( 1 mins, 18 secs ) - 21M
> download_symbols download ( 42 secs ) - 19M
> download tests download ( 48 secs ) - 44M

What we want to optimize is reducing this setup time as we want to spend as much time possible on the actual test run.
Another earlier job (Thu Jun 2 13:33:20 2011) took this long:
> download_build download ( 17 secs ) 
> download_symbols download ( 20 secs ) 
> download tests download ( 24 secs ) 
which is much better than the other job.

The previous job was run at Fri Jun 3 05:12:52 2011 which is supposed to be low load but I can't tell.


Could we have talos-r3-leopard-002 measure every 5 mins download speed and email us every night?
Do we have anything to monitor network activity from stage.m.o to scl1?
Armen - this is great, I'll get netops involved.
I have also seen very slow scp transfers from backup.mtv1 to pxe1.scl1. Conversations with netops indicate that this is likely due to the VPN connections between sites.

There is a point-to-point connection available, but it will take a downtime window to cut over to it.

It will be a ~5min loss of connectivity into SCL1, but all TCP sessions will drop.

The netops resources involved are west coast, so a PDT evening window would be ideal. Let's plan for a 30min downtime. Tree closure or not at RelEng's discretion.
Depends on: 662071
I will bring it up on Monday's meeting. I believe tomorrow there is a GO to build so the following evenings will be easier to book for a downtime.
Once bug 662071 is done, how could we monitor the transfer rate between the two colos?
I wouldn't be surprised if in the future we have to look at this again and would be very useful to have a page to look at it.
What is the scheduled date for bug 662071?
I don't think there is more releases this week beside 3.6.18 that is going on today.
Armen, are you still using this slave (talos-r3-leopard-002)?
(In reply to comment #10)
> Armen, are you still using this slave (talos-r3-leopard-002)?

Not for this bug but for the tp5 bug.
(In reply to comment #6)
...
> There is a point-to-point connection available, but it will take a downtime
> window to cut over to it.
> 
> It will be a ~5min loss of connectivity into SCL1, but all TCP sessions will
> drop.
> 
> The netops resources involved are west coast, so a PDT evening window would
> be ideal. Let's plan for a 30min downtime. Tree closure or not at RelEng's
> discretion.

Flagging for possible downtime
Flags: needs-treeclosure?
The upcoming downtime window is for tomorrow (June 16) PDT early morning (4am - 8am) comment 6 says this would be best done in PDT evening - please let me know if that is flexible and if so, if this will ride along tomorrow.
This will not ride along tomorrow.
Heh, there are a few bugs here.  Bug 662071 is the netops-related details of the change described in comment 6, and bug 664639 is the downtime request.
Depends on: 664639
Flags: needs-treeclosure?
It seems that everything is covered by the other two bugs.

If no one says anything I will close this bug and make the other two block bug 661585.

This current bug seems to not be needed anymore.
No longer depends on: 664639
No longer depends on: 662071
There are no actions left on this bug.

The work of improving the download speed between sjc1 and scl1 is to be done in bug 662071.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 662071
Please bare with me as I have tons of questions and many thanks in advance!

This is how we grab files from stage unto the test slaves in SCL:
> wget --progress=dot:mega -N http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux/1310679875/firefox-8.0a1.en-US.linux-i686.tar.bz2
Is there a different way we could grab them? instead of wget/http/stage way.
If we changed the mount points would it change anything?
Is it possible to move a host inside of SCL and make the build machines upload builds and tests there rather than to stage which on SJC?
Is there a way we could cache those files? We used from SJC to MTV.
I asked arr to look into seeing if we could have ganglia for surf/stage to see CPU load and network load.

FTR justdave says the tinderbox directories are mounted from cm-ixstore.
FTR dustin says that we are traversing *two* network connections (http and then NFS)

I have setup in dev-master01 (which is on SCL) a cronjob every 5 mins to try to grab a file and log the average transfer rate.

As you can see below the initial data shows that we have very fluctuating times.
The file being grabbed is 50MB.
Would you suggest another way to grab the data?

[armenzg@dev-master01 ~]$ tail -F test_stage_scl_speed.log 
09:20:01 2.78 MB/s
09:25:01 4.14 MB/s
09:30:01 817 KB/s
09:35:01 1.19 MB/s
09:40:01 2.79 MB/s
09:45:01 3.45 MB/s
09:50:01 7.87 MB/s
09:55:01 1.58 MB/s

[armenzg@dev-master01 ~]$ crontab -l
MAILTO=armenzg@mozilla.com
0,5,10,15,20,25,30,35,40,45,50,55 * * * * ./test_speed.sh >> test_stage_scl_speed.log 
[armenzg@dev-master01 ~]$ cat test_speed.sh 
#!/bin/sh
FILE=http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32-debug/1310732600/firefox-8.0a1.en-US.win32.tests.zip
echo "`date "+%H:%M:%S"` $(wget $FILE 2>&1 | grep "\([0-9.]\+ [KM]B/s\)" | sed -e "s|^.*(\([0-9.]\+ [KM]B/s\)).*$|\1|")"
for file in `find . -type f -name *zip`; do rm $file; done
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Since the expected fix (bug 662071) is still not complete, I don't think we need to plan further fixes just yet.  Let's leave this open to measure the speeds before/after the fix, and determine whether further action is worthwhile at that point.
Emergency maintenance is scheduled for Jul 19 at 1800 PST (0100 UTC) to fix issues in 671366 which impacts SCL1.  Unfortunately RelEng is blocking any work that would impact SCL1 so until a time they can grant a window this is essentially a WONTFIX.
Severity: normal → enhancement
Status: REOPENED → NEW
Depends on: 671366
Flags: needs-treeclosure?
Raising priority and assigning to RelEng as a proxy for buildduty.

We need a window to get this work done.
Assignee: server-ops-releng → nobody
Severity: enhancement → critical
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
I believe joduinn got a downtime window for tomorrow:
http://groups.google.com/group/mozilla.dev.planning/browse_thread/thread/8c947d18bc4df813#

Are these comments from before he got the permission?

I see a comment of joduinn 2-3 hours later saying that things got talked out with ravi in bug 662071.
Assignee: nobody → armenzg
Priority: -- → P2
Quick summary of the status, based on dependent bugs:

After the downtime for bug 671366 last Tuesday, releng is the only user of the SRX240's in sjc1, but the scl1-sjc1 link for the build network is still traversing the VPN, which is terminated in those SRX240's.

Bug 662071 is tracking another attempt to switch to the P2P link, which we expect to be faster.  That's not scheduled yet, and I'm going to take myself out of the loop for getting that scheduled -- but it should :)
Clearing the treeclosure flag until we decide on a time for bug 662071; let me know if it's looking like that'll happen this week.
Flags: needs-treeclosure?
bug 662071 has now been fixed and we have better transfer rates.
Measurements of 10/15x increase:

17:00:01 1.05 MB/s
17:05:01 2.16 MB/s
17:10:01 1.34 MB/s
17:15:01 1.04 MB/s
17:20:01 1.92 MB/s
17:25:01 1.70 MB/s
17:30:01 27.7 MB/s
17:35:01 21.5 MB/s
17:40:01 16.8 MB/s
17:45:02 18.4 MB/s
17:50:01 17.2 MB/s
17:55:01 25.3 MB/s
18:00:01 29.2 MB/s
18:05:01 20.3 MB/s
Status: NEW → RESOLVED
Closed: 9 years ago9 years ago
Depends on: 662071
Resolution: --- → FIXED
This shows that the setup time on the slaves has now become consistent since the download times are now consistent.
Whiteboard: [buildfaster:p1] p2p link enabled on Aug. 9th
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.