Closed Bug 1047190 Opened 10 years ago Closed 10 years ago

Intermittent slowness on download.m.o and bounceradmin.m.c

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

x86
All
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: rbryce)

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/671] )

We have a nagios check on a few products to make sure they return a redirect:
Thu 19:54:07 PDT [4453] buildbot-master81.srv.releng.scl3.mozilla.com:bouncer is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds
This hits download.m.o, but I don't have any information about which DC it's getting.

The release automation tried to submit some data at 19:11 for a beta release to bounceradmin.m.c (phx1 only), which failed without giving a good reason.

So I tried loading the admin UI, eg https://bounceradmin.mozilla.com/admin/, and there are intermittent requests which take > 5s, sometimes 15s, to return css files and images. This always seems to be on bouncer4.webapp.phx1.mozilla.com, and the other nodes are OK. Inventory says this on a SeaMicro chassis, so may have been rebooted today.

Could we investigate bouncer4.webapp.phx1, or at least drain it from the zlb(s) ?
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/671]
In the following LB www.zlb.ops.phx1.mozilla.com I put bouncer4.webapp.phx1.mozilla.com or 10.8.81.155:81 to draining on pool bouncer01.zlb.phx.mozilla.net:81. This didn't help so i restored it back to active.
(In reply to Nick Thomas [:nthomas] from comment #0)
> We have a nagios check on a few products to make sure they return a redirect:
> Thu 19:54:07 PDT [4453]
> buildbot-master81.srv.releng.scl3.mozilla.com:bouncer is CRITICAL:
> CHECK_NRPE: Socket timeout after 30 seconds
> This hits download.m.o, but I don't have any information about which DC it's
> getting.
> 
> The release automation tried to submit some data at 19:11 for a beta release
> to bounceradmin.m.c (phx1 only), which failed without giving a good reason.
> 
> So I tried loading the admin UI, eg https://bounceradmin.mozilla.com/admin/,
> and there are intermittent requests which take > 5s, sometimes 15s, to
> return css files and images. This always seems to be on
> bouncer4.webapp.phx1.mozilla.com, and the other nodes are OK. Inventory says
> this on a SeaMicro chassis, so may have been rebooted today.
> 
> Could we investigate bouncer4.webapp.phx1, or at least drain it from the
> zlb(s) ?

We took some steps in Bug 1045432 just now that should alleviate the traffic issues we are seeing from this node.  Worst case we will just remove bouncer4 from the cluster altogether.  Thanks for the heads up.
I'm still seeing intermittent alerts from nagios for our bouncer checks, and hangs on some assets when loading https://bounceradmin.mozilla.com/admin/. Also, if I hit 
   http://download.mozilla.org/?product=firefox-latest&os=win&lang=en-US&${RANDOM}
repeatedly a small proportion of requests hang.

I get sensible response from bouncerN.webapp.phx1.mozilla.com for n in 1..3, 5..10, but never a good response from 4.
We are also seeing intermittent status code 500 responses from download.mozilla.org in selenium tests that test the download button on www.mozilla.org. I'll paste the relevant line from http://selenium.qa.mtv2.mozilla.com:8080/job/mozilla.com.prod.saucelabs/17235/console because it requires vpn:

E       AssertionError: Expected status code 302.  Lang 'zh-CN' https://download.mozilla.org/?product=firefox-24.7.0esr-SSL&os=win&lang=zh-CN link: status 500
We're aiming to ship this 32.0b3 today, and this is blocking it.
Severity: major → critical
Assignee: server-ops-webops → rbryce
:jakem just drained the bouncers nodes in this chassis from the loadbalancer. Bouncer1 is starting to be affected by the same issue as bouncer4 on the chassis. I am currently engaged with seamicro to fix this.  You are good to ship, we have plenty of capacity on the bouncer cluster.
(In reply to Rick Bryce [:rbryce] from comment #6)
> :jakem just drained the bouncers nodes in this chassis from the
> loadbalancer. Bouncer1 is starting to be affected by the same issue as
> bouncer4 on the chassis. I am currently engaged with seamicro to fix this. 
> You are good to ship, we have plenty of capacity on the bouncer cluster.

Thanks. That seems to have cleared up the symptoms we were seeing.
I think this can be closed? We're not going to put those nodes back in service, so no recurrence is expected. The only thing that might (probably will) happen is to spin up some bouncer capacity on VMs, but that's out of scope of this bug.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.