RESOLVED FIXED

Status

--
critical
RESOLVED FIXED
5 years ago
7 months ago

People

(Reporter: emorley, Assigned: jhopkins)

Tracking

({sheriffing-P1})

Details

Attachments

(1 attachment)

(Reporter)

Description

5 years ago
"The connection was reset
The connection to the server was reset while the page was loading..."

Other trees are fine (albeit slow), presuming it's just the sheer number of inbound job types? (a la bug 816934, bug 815556, bug 749081, bug 756532) ... though there should be less of a difference between inbound and the other integration trees nowadays surely..?

Whilst we do have the CLOBBER file now, this is still potentially tree-closing worthy over anything other than very short term.
(Reporter)

Comment 1

5 years ago
Guessing inbound has a lot more clobber history that's slowing things down too.
Depends on: 827790

Comment 2

5 years ago
Adding buildduty.
hi Ed;

(In reply to Ed Morley [:edmorley UTC+1] from comment #0)
> "The connection was reset
> The connection to the server was reset while the page was loading..."
> 
> Other trees are fine (albeit slow), presuming it's just the sheer number of
> inbound job types? (a la bug 816934, bug 815556, bug 749081, bug 756532) ...
> though there should be less of a difference between inbound and the other
> integration trees nowadays surely..?
The number of job types should be fairly similar across the 3 integration branches, iirc. 

Since we started official sheriff coverage on b2g-inbound and fx-team, we're seeing load *decrease* significantly on mozilla-inbound, as developers now land on the less crowded b2g-inbound/fx-team. For context, last month saw load as follows: try (48.5%), mozilla-inbound (16.3%), b2g-inbound (8.2%), fx-team (6.2%). More details here: http://oduinn.com/blog/2013/09/02/infrastructure-load-for-august-2013/

Are you seeing similar problems with b2g-inbound or fx-team?
Flags: needinfo?(emorley)
(Reporter)

Comment 4

5 years ago
(In reply to John O'Duinn [:joduinn] from comment #3)
> Since we started official sheriff coverage on b2g-inbound and fx-team, we're
> seeing load *decrease* significantly on mozilla-inbound, as developers now
> land on the less crowded b2g-inbound/fx-team. For context, last month saw
> load as follows: try (48.5%), mozilla-inbound (16.3%), b2g-inbound (8.2%),
> fx-team (6.2%). More details here:
> http://oduinn.com/blog/2013/09/02/infrastructure-load-for-august-2013/

Yup, I read your blog posts ;-)

Comment 0 mentions number of job _types_, not number of jobs run per <unit of time> :-)

> Are you seeing similar problems with b2g-inbound or fx-team?

No (per comment 0).

Cheers :-)
Flags: needinfo?(emorley)
FWIW, I can't load this at all at the moment.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #5)
> FWIW, I can't load this at all at the moment.

Same. It prompted me for my LDAP credentials, but isn't loading anything.
(In reply to Wes Kocher (:KWierso) from comment #6)
> (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #5)
> > FWIW, I can't load this at all at the moment.
> 
> Same. It prompted me for my LDAP credentials, but isn't loading anything.

And as soon as I make that comment, the page finally loaded...
(Reporter)

Updated

5 years ago
Keywords: sheriffing-P1

Comment 8

5 years ago
sheeri, dustin, do you know if the DBs are running any slower?

The web page: https://secure.pub.build.mozilla.org/clobberer/?branch=mozilla-inbound
The code lives in here: http://hg.mozilla.org/build/tools/file/fb9d7d5eeb13/clobberer
Flags: needinfo?(scabral)
Flags: needinfo?(dustin)

Comment 9

5 years ago
There's also some DB performance talk in bug 827790.
The extent of my knowledge here is that the clobberer PHP is particularly inefficient.  There was some work underway to fix that, but I don't know how much was completed.  I'll leave the DB question to sheeri.
Flags: needinfo?(dustin)
I'm seeing a lot less throughput, with a corresponding dip in memory usage, but nothing that indicates a problem on the server (e.g. no high CPU). There is also a dip in interface packets, which means the throughput is likely due to not getting a lot of requests, not that the DB was slow.

Adding in casey, as there was some scl3 load balancer wonkiness yesterday, and I'm not sure if the timing works out to be the same, but it's something to look at / rule out.
Flags: needinfo?(scabral) → needinfo?(cransom)
I'm a poor contact for the load balancers as I don't own them, passing to webops.
Flags: needinfo?(cransom) → needinfo?(bburton)
(In reply to Sheeri Cabral [:sheeri] from comment #11)
> I'm seeing a lot less throughput, with a corresponding dip in memory usage,
> but nothing that indicates a problem on the server (e.g. no high CPU). There
> is also a dip in interface packets, which means the throughput is likely due
> to not getting a lot of requests, not that the DB was slow.
> 
> Adding in casey, as there was some scl3 load balancer wonkiness yesterday,
> and I'm not sure if the timing works out to be the same, but it's something
> to look at / rule out.

There were brief problems with two of the four public load balancer servers in SCL3 on Monday, 9/9, that was resolved by 9:30AM.

From the times/dates in this bug I do not believe these are related.

I am taking a look at the problem URL to see if I can help assess the cause of the slowness
Flags: needinfo?(bburton)
(In reply to Sheeri Cabral [:sheeri] from comment #11)
> I'm seeing a lot less throughput, with a corresponding dip in memory usage,
> but nothing that indicates a problem on the server (e.g. no high CPU). There
> is also a dip in interface packets, which means the throughput is likely due
> to not getting a lot of requests, not that the DB was slow.
> 
> Adding in casey, as there was some scl3 load balancer wonkiness yesterday,
> and I'm not sure if the timing works out to be the same, but it's something
> to look at / rule out.

Additionally, the Traffic IP Group this site uses is on zlb1.ops.scl3 which was unaffected, https://www.zlb.ops.scl3.mozilla.com:9090/apps/zxtm/index.fcgi?name=releng-zlb.vips.scl3.mozilla.com&section=Traffic%20IP%20Groups%3AEdit ::  63.245.215.57/24      	 zlb1.ops.scl3.mozilla.com/eth0.5
(Assignee)

Comment 15

5 years ago
Created attachment 803353 [details] [diff] [review]
[tools] fetch clobber times in batches

With this patch, clobberer takes 1/2 as long to produce identical output.
Assignee: nobody → jhopkins
Status: NEW → ASSIGNED
Attachment #803353 - Flags: review?(catlee)
So I determined this to be a combination of a couple load balancer time out settings that needed to be increased.

With the timeouts being bumped from 10 seconds to 120 seconds the page now loads for me, anywhere between 45s and 80s. See http://bits.inatree.org/images/Mozilla_Buildbot_Clobberer_17E13117.png
Attachment #803353 - Flags: review?(catlee) → review+
(Assignee)

Comment 17

5 years ago
Comment on attachment 803353 [details] [diff] [review]
[tools] fetch clobber times in batches

Landed in https://hg.mozilla.org/build/tools/rev/1ba62dad0efa
Attachment #803353 - Flags: checked-in+
pushed to prod and stage
(Assignee)

Comment 19

5 years ago
Right now it's taking about 12 seconds to load https://secure.pub.build.mozilla.org/clobberer/index.php?branch=mozilla-inbound
(Assignee)

Updated

5 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
(Reporter)

Comment 20

5 years ago
Thank you for fixing this! :-)
(In reply to Brandon Burton [:solarce] from comment #16)
> So I determined this to be a combination of a couple load balancer time out
> settings that needed to be increased.
> 
> With the timeouts being bumped from 10 seconds to 120 seconds the page now
> loads for me, anywhere between 45s and 80s. See
> http://bits.inatree.org/images/Mozilla_Buildbot_Clobberer_17E13117.png

(In reply to John Hopkins (:jhopkins) from comment #19)
> Right now it's taking about 12 seconds to load
> https://secure.pub.build.mozilla.org/clobberer/index.php?branch=mozilla-
> inbound

The clobberer has been noticeably faster to load the last two days since this work was done. Thanks again John and Brandon!
Do we need to decrease the timeouts now, or keep them in for safety's sake?
(Reporter)

Comment 23

5 years ago
I think keep them in, since the tool is only used by a small number of people, so shouldn't cause too much upstream load were we to have a regression, and all of whom will be pretty vocal if load times suddenly increase :-)

Updated

7 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.