Closed Bug 1047190 Opened 10 years ago Closed 10 years ago

Intermittent slowness on download.m.o and bounceradmin.m.c

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: rbryce)

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/671] )

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

10 years ago

We have a nagios check on a few products to make sure they return a redirect:
Thu 19:54:07 PDT [4453] buildbot-master81.srv.releng.scl3.mozilla.com:bouncer is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds
This hits download.m.o, but I don't have any information about which DC it's getting.

The release automation tried to submit some data at 19:11 for a beta release to bounceradmin.m.c (phx1 only), which failed without giving a good reason.

So I tried loading the admin UI, eg https://bounceradmin.mozilla.com/admin/, and there are intermittent requests which take > 5s, sometimes 15s, to return css files and images. This always seems to be on bouncer4.webapp.phx1.mozilla.com, and the other nodes are OK. Inventory says this on a SeaMicro chassis, so may have been rebooted today.

Could we investigate bouncer4.webapp.phx1, or at least drain it from the zlb(s) ?

:kanban

Updated

•

10 years ago

Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/671]

david garvey:dgarvey

Comment 1

•

10 years ago

•

Edited

In the following LB www.zlb.ops.phx1.mozilla.com I put bouncer4.webapp.phx1.mozilla.com or 10.8.81.155:81 to draining on pool bouncer01.zlb.phx.mozilla.net:81. This didn't help so i restored it back to active.

Rick Bryce [:rbryce]

Assignee

Comment 2

•

10 years ago

(In reply to Nick Thomas [:nthomas] from comment #0)
> We have a nagios check on a few products to make sure they return a redirect:
> Thu 19:54:07 PDT [4453]
> buildbot-master81.srv.releng.scl3.mozilla.com:bouncer is CRITICAL:
> CHECK_NRPE: Socket timeout after 30 seconds
> This hits download.m.o, but I don't have any information about which DC it's
> getting.
> 
> The release automation tried to submit some data at 19:11 for a beta release
> to bounceradmin.m.c (phx1 only), which failed without giving a good reason.
> 
> So I tried loading the admin UI, eg https://bounceradmin.mozilla.com/admin/,
> and there are intermittent requests which take > 5s, sometimes 15s, to
> return css files and images. This always seems to be on
> bouncer4.webapp.phx1.mozilla.com, and the other nodes are OK. Inventory says
> this on a SeaMicro chassis, so may have been rebooted today.
> 
> Could we investigate bouncer4.webapp.phx1, or at least drain it from the
> zlb(s) ?

We took some steps in Bug 1045432 just now that should alleviate the traffic issues we are seeing from this node.  Worst case we will just remove bouncer4 from the cluster altogether.  Thanks for the heads up.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

10 years ago

I'm still seeing intermittent alerts from nagios for our bouncer checks, and hangs on some assets when loading https://bounceradmin.mozilla.com/admin/. Also, if I hit 
   http://download.mozilla.org/?product=firefox-latest&os=win&lang=en-US&${RANDOM}
repeatedly a small proportion of requests hang.

I get sensible response from bouncerN.webapp.phx1.mozilla.com for n in 1..3, 5..10, but never a good response from 4.

Josh Mize [:jgmize]

Comment 4

•

10 years ago

We are also seeing intermittent status code 500 responses from download.mozilla.org in selenium tests that test the download button on www.mozilla.org. I'll paste the relevant line from http://selenium.qa.mtv2.mozilla.com:8080/job/mozilla.com.prod.saucelabs/17235/console because it requires vpn:

E       AssertionError: Expected status code 302.  Lang 'zh-CN' https://download.mozilla.org/?product=firefox-24.7.0esr-SSL&os=win&lang=zh-CN link: status 500

bhearsum@mozilla.com (:bhearsum)

Comment 5

•

10 years ago

We're aiming to ship this 32.0b3 today, and this is blocking it.

Severity: major → critical

Rick Bryce [:rbryce]

Assignee

Updated

•

10 years ago

Assignee: server-ops-webops → rbryce

Rick Bryce [:rbryce]

Assignee

Comment 6

•

10 years ago

:jakem just drained the bouncers nodes in this chassis from the loadbalancer. Bouncer1 is starting to be affected by the same issue as bouncer4 on the chassis. I am currently engaged with seamicro to fix this.  You are good to ship, we have plenty of capacity on the bouncer cluster.

bhearsum@mozilla.com (:bhearsum)

Comment 7

•

10 years ago

(In reply to Rick Bryce [:rbryce] from comment #6)
> :jakem just drained the bouncers nodes in this chassis from the
> loadbalancer. Bouncer1 is starting to be affected by the same issue as
> bouncer4 on the chassis. I am currently engaged with seamicro to fix this. 
> You are good to ship, we have plenty of capacity on the bouncer cluster.

Thanks. That seems to have cleared up the symptoms we were seeing.

Jake Maul [:jakem]

Comment 8

•

10 years ago

I think this can be closed? We're not going to put those nodes back in service, so no recurrence is expected. The only thing that might (probably will) happen is to spin up some bouncer capacity on VMs, but that's out of scope of this bug.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Intermittent slowness on download.m.o and bounceradmin.m.c

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

Tracking

(Not tracked)

People

(Reporter: nthomas, Assigned: rbryce)

References

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/671] )

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Updated