661656 - Determine if we can improve the download times between sjc1 and scl1

Assignee

Description

•

13 years ago

From doing analysis in bug 661585 I realized that download times of builds, symbols and packaged tests can vary a lot and can sometimes take a very long time.

Before we moved to scl1 we had squid between sjc1 and mtv which helped on improving the download time.

What would you suggest we do to improve this?
Do we have metrics on bandwith between sjc1 and scl1?
Do we have metrics for stage.m.o? (We download the builds & tests from there).

[4:23pm] armenzg: jabba: could we set up squid once again in between stage and scl1?
[4:23pm] armenzg: or would you suggest something else?
[4:24pm] jabba: we weren't really happy with squid when we used it
[4:24pm] jabba: it seemed to cause more problems than it solved
[4:24pm] jabba: I mean we could probably try it, or set up varnish or something
[4:25pm] jabba: what is the actual problem though? bandwidth or load on stage.m.o ?
[4:27pm] armenzg: jabba: the download from stage to a slave on scl1 takes sometimes a long time
[4:28pm] armenzg: it doesn't have to be squid but anything that could improve the situation
[4:30pm] jabba: armenzg: I'd be curious to know what exactly the problem is, wether it is a bandwidth problem between the data centers, or if it is too much load on the origin server
[4:30pm] armenzg: jabba: how could we figure that out?
[4:30pm] jabba: and a caching solution should be well thought through and done consistently
[4:30pm] jabba: the squid thing in mtv was kind of a hack at the time
[4:31pm] jabba: and I'd prefer to solve whatever the underlying issue is rather than just throw more moving parts at an already sensitive and SPOF-laden ecosystem
[4:31pm] jabba: we should get the netops guys to help determine if we are hitting maximum bandwidth between scl1 and sjc1
[4:33pm] jabba: but in general, I've been quite disconnected with your overall infrastructure, that I don't want to go suggesting to do or not to do something 
[4:33pm] jabba: I'll defer to zandr and arr
[4:34pm] armenzg: OK

Zandr Milewski [:zandr]

Comment 1

•

13 years ago

Armen- Please provide a specific case of a download that has performance problems (source host, destination host, protocol) and I'll loop in netops to see what we can learn.

Ideally pick a staging machine that we can test from without disturbing production.

Armen [:armenzg]

Assignee

Comment 2

•

13 years ago

I have set talos-r3-snow-leopard-002 to do any work in here.

What I am trying to optimize are the following steps:
> wget --progress=dot:mega -N http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1307099495/firefox-7.0a1.en-US.mac64.dmg
> wget --progress=dot:mega -N http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1307099495/firefox-7.0a1.en-US.mac64.crashreporter-symbols.zip
> wget --progress=dot:mega -N http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1307099495/firefox-7.0a1.en-US.mac64.tests.zip

Similar steps are run for each OS.
These 3 steps are run for each unit test job and talos job.
There are around 8-10 suites for each type so that means that we have 8-10 slaves around the same time requesting the same 3 sets of files.

With squid there was caching happening but once in a while there was corruption and all 8-10 jobs would just go red. I can't recall much if jabba/IT discovered if a newer version of squid fixed that.

What would you like me to try? What are we trying to measure?

Armen [:armenzg]

Assignee

Comment 3

•

13 years ago

These are the times for a given Mac build:
> download_build download ( 1 mins, 18 secs ) - 21M
> download_symbols download ( 42 secs ) - 19M
> download tests download ( 48 secs ) - 44M

What we want to optimize is reducing this setup time as we want to spend as much time possible on the actual test run.

Armen [:armenzg]

Assignee

Comment 4

•

13 years ago

Another earlier job (Thu Jun 2 13:33:20 2011) took this long:
> download_build download ( 17 secs ) 
> download_symbols download ( 20 secs ) 
> download tests download ( 24 secs ) 
which is much better than the other job.

The previous job was run at Fri Jun 3 05:12:52 2011 which is supposed to be low load but I can't tell.


Could we have talos-r3-leopard-002 measure every 5 mins download speed and email us every night?
Do we have anything to monitor network activity from stage.m.o to scl1?

Zandr Milewski [:zandr]

Comment 5

•

13 years ago

Armen - this is great, I'll get netops involved.

Zandr Milewski [:zandr]

Comment 6

•

13 years ago

I have also seen very slow scp transfers from backup.mtv1 to pxe1.scl1. Conversations with netops indicate that this is likely due to the VPN connections between sites.

There is a point-to-point connection available, but it will take a downtime window to cut over to it.

It will be a ~5min loss of connectivity into SCL1, but all TCP sessions will drop.

The netops resources involved are west coast, so a PDT evening window would be ideal. Let's plan for a 30min downtime. Tree closure or not at RelEng's discretion.

Armen [:armenzg]

Assignee

Comment 7

•

13 years ago

I will bring it up on Monday's meeting. I believe tomorrow there is a GO to build so the following evenings will be easier to book for a downtime.

Armen [:armenzg]

Assignee

Comment 8

•

13 years ago

Once bug 662071 is done, how could we monitor the transfer rate between the two colos?
I wouldn't be surprised if in the future we have to look at this again and would be very useful to have a page to look at it.

Armen [:armenzg]

Assignee

Comment 9

•

13 years ago

What is the scheduled date for bug 662071?
I don't think there is more releases this week beside 3.6.18 that is going on today.

Dustin J. Mitchell [:dustin] (he/him)

Comment 10

•

13 years ago

Armen, are you still using this slave (talos-r3-leopard-002)?

Armen [:armenzg]

Assignee

Comment 11

•

13 years ago

(In reply to comment #10)
> Armen, are you still using this slave (talos-r3-leopard-002)?

Not for this bug but for the tp5 bug.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 12

•

13 years ago

(In reply to comment #6)
...
> There is a point-to-point connection available, but it will take a downtime
> window to cut over to it.
> 
> It will be a ~5min loss of connectivity into SCL1, but all TCP sessions will
> drop.
> 
> The netops resources involved are west coast, so a PDT evening window would
> be ideal. Let's plan for a 30min downtime. Tree closure or not at RelEng's
> discretion.

Flagging for possible downtime

Flags: needs-treeclosure?

Lukas Blakk [:lsblakk] use ?needinfo

Comment 13

•

13 years ago

The upcoming downtime window is for tomorrow (June 16) PDT early morning (4am - 8am) comment 6 says this would be best done in PDT evening - please let me know if that is flexible and if so, if this will ride along tomorrow.

Zandr Milewski [:zandr]

Comment 14

•

13 years ago

This will not ride along tomorrow.

Dustin J. Mitchell [:dustin] (he/him)

Comment 15

•

13 years ago

Heh, there are a few bugs here.  Bug 662071 is the netops-related details of the change described in comment 6, and bug 664639 is the downtime request.

Flags: needs-treeclosure?

Armen [:armenzg]

Assignee

Comment 16

•

13 years ago

It seems that everything is covered by the other two bugs.

If no one says anything I will close this bug and make the other two block bug 661585.

This current bug seems to not be needed anymore.

Armen [:armenzg]

Assignee

Comment 17

•

13 years ago

There are no actions left on this bug.

The work of improving the download speed between sjc1 and scl1 is to be done in bug 662071.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → DUPLICATE

Armen [:armenzg]

Assignee

Comment 18

•

13 years ago

Please bare with me as I have tons of questions and many thanks in advance!

This is how we grab files from stage unto the test slaves in SCL:
> wget --progress=dot:mega -N http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux/1310679875/firefox-8.0a1.en-US.linux-i686.tar.bz2
Is there a different way we could grab them? instead of wget/http/stage way.
If we changed the mount points would it change anything?
Is it possible to move a host inside of SCL and make the build machines upload builds and tests there rather than to stage which on SJC?
Is there a way we could cache those files? We used from SJC to MTV.
I asked arr to look into seeing if we could have ganglia for surf/stage to see CPU load and network load.

FTR justdave says the tinderbox directories are mounted from cm-ixstore.
FTR dustin says that we are traversing *two* network connections (http and then NFS)

I have setup in dev-master01 (which is on SCL) a cronjob every 5 mins to try to grab a file and log the average transfer rate.

As you can see below the initial data shows that we have very fluctuating times.
The file being grabbed is 50MB.
Would you suggest another way to grab the data?

[armenzg@dev-master01 ~]$ tail -F test_stage_scl_speed.log 
09:20:01 2.78 MB/s
09:25:01 4.14 MB/s
09:30:01 817 KB/s
09:35:01 1.19 MB/s
09:40:01 2.79 MB/s
09:45:01 3.45 MB/s
09:50:01 7.87 MB/s
09:55:01 1.58 MB/s

[armenzg@dev-master01 ~]$ crontab -l
MAILTO=armenzg@mozilla.com
0,5,10,15,20,25,30,35,40,45,50,55 * * * * ./test_speed.sh >> test_stage_scl_speed.log 
[armenzg@dev-master01 ~]$ cat test_speed.sh 
#!/bin/sh
FILE=http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32-debug/1310732600/firefox-8.0a1.en-US.win32.tests.zip
echo "`date "+%H:%M:%S"` $(wget $FILE 2>&1 | grep "\([0-9.]\+ [KM]B/s\)" | sed -e "s|^.*(\([0-9.]\+ [KM]B/s\)).*$|\1|")"
for file in `find . -type f -name *zip`; do rm $file; done

Status: RESOLVED → REOPENED

Resolution: DUPLICATE → ---

Dustin J. Mitchell [:dustin] (he/him)

Comment 19

•

13 years ago

Since the expected fix (bug 662071) is still not complete, I don't think we need to plan further fixes just yet.  Let's leave this open to measure the speeds before/after the fix, and determine whether further action is worthwhile at that point.

Ravi Pina [:ravi]

Comment 20

•

13 years ago

Emergency maintenance is scheduled for Jul 19 at 1800 PST (0100 UTC) to fix issues in 671366 which impacts SCL1.  Unfortunately RelEng is blocking any work that would impact SCL1 so until a time they can grant a window this is essentially a WONTFIX.

Severity: normal → enhancement

Status: REOPENED → NEW

Flags: needs-treeclosure?

Zandr Milewski [:zandr]

Comment 21

•

13 years ago

Raising priority and assigning to RelEng as a proxy for buildduty.

We need a window to get this work done.

Assignee: server-ops-releng → nobody

Severity: enhancement → critical

Component: Server Operations: RelEng → Release Engineering

QA Contact: zandr → release

Armen [:armenzg]

Assignee

Comment 22

•

13 years ago

I believe joduinn got a downtime window for tomorrow:
http://groups.google.com/group/mozilla.dev.planning/browse_thread/thread/8c947d18bc4df813#

Are these comments from before he got the permission?

I see a comment of joduinn 2-3 hours later saying that things got talked out with ravi in bug 662071.

Armen [:armenzg]

Assignee

Updated

•

13 years ago

Assignee: nobody → armenzg

Priority: -- → P2

Dustin J. Mitchell [:dustin] (he/him)

Comment 23

•

13 years ago

Quick summary of the status, based on dependent bugs:

After the downtime for bug 671366 last Tuesday, releng is the only user of the SRX240's in sjc1, but the scl1-sjc1 link for the build network is still traversing the VPN, which is terminated in those SRX240's.

Bug 662071 is tracking another attempt to switch to the P2P link, which we expect to be faster.  That's not scheduled yet, and I'm going to take myself out of the loop for getting that scheduled -- but it should :)

Aki Sasaki (not active)

Comment 24

•

13 years ago

Clearing the treeclosure flag until we decide on a time for bug 662071; let me know if it's looking like that'll happen this week.

Flags: needs-treeclosure?

Armen [:armenzg]

Assignee

Comment 25

•

13 years ago

bug 662071 has now been fixed and we have better transfer rates.
Measurements of 10/15x increase:

17:00:01 1.05 MB/s
17:05:01 2.16 MB/s
17:10:01 1.34 MB/s
17:15:01 1.04 MB/s
17:20:01 1.92 MB/s
17:25:01 1.70 MB/s
17:30:01 27.7 MB/s
17:35:01 21.5 MB/s
17:40:01 16.8 MB/s
17:45:02 18.4 MB/s
17:50:01 17.2 MB/s
17:55:01 25.3 MB/s
18:00:01 29.2 MB/s
18:05:01 20.3 MB/s

Status: NEW → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Assignee

Comment 26

•

13 years ago

Attached image screenshot showing setup improvements on testing slaves — Details

This shows that the setup time on the slaves has now become consistent since the download times are now consistent.

Armen [:armenzg]

Assignee

Updated

•

13 years ago

Whiteboard: [buildfaster:p1] p2p link enabled on Aug. 9th

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering