Closed Bug 883018 Opened 11 years ago Closed 11 years ago

intermittent long response times from bugzilla.mozilla.org

Categories

(bugzilla.mozilla.org :: Infrastructure, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: glob, Assigned: fox2mike)

References

Details

Attachments

(6 files)

Attached image response times
on irc there's been a lot of reports of "slow bugzilla"-ness.

in the logs i'm seeing intermittent long response times, for any actions.  i can't see any pattern to it.

there's nothing of note in the graphite graphs, for the webheads or the databases.  sheeri has looked at the state of the databases and found nothing of interest.

examples:

time     webhead   size  time request
20:58:30 1            -  27.0 GET show_bug.cgi?id=915
20:58:54 4        41.8k   7.0 GET show_bug.cgi?id=528460
20:58:57 3       320.9k   9.3 GET show_bug.cgi?id=202468
20:59:02 5            -  16.4 GET show_bug.cgi?id=17754
20:59:05 1 u     249.9k  15.9 GET show_bug.cgi?id=870320
20:58:55 1 u     145.9k  46.5 POST attachment.cgi
21:02:53 5   Gbt  91.2k  24.3 GET buglist.cgi?bug_status=REOPENED..

this wasn't caused by the latest push, it started happened just prior to that (see attached image - the pushes are the yellow dots).
looks like "something" happened on june 12th at around 10am.
i don't know if it's related, but there's been two mysql "duplicate entry for key" errors since the 12th:

web3: [Thu Jun 13 11:17:45 2013] [error] DBD::mysql::db do failed: Duplicate entry '880605-308534' for key 'cc_bug_id_idx' [for Statement "INSERT INTO cc (bug_id, who) VALUES (?,?)"] at /data/www/bugzilla.mozilla.org/Bugzilla/Bug.pm line 842. [#1944bdfd1e1848e2b7217d1b2c6aa5f9]

web5: [Wed Jun 12 18:42:52 2013] [error] DBD::mysql::db do failed: Duplicate entry 'rae.barton@gemalto.com' for key 'profiles_login_name_idx' [for Statement "INSERT INTO profiles (is_enabled, realname, comment_count, cryptpassword, creation_ts, login_name) VALUES (?,?,?,?,?,?)"] at /data/www/bugzilla.mozilla.org/Bugzilla/Object.pm line 606. [#74afc9b706284cf987b917300c7fb407]


i just had a simple bug update take 28 seconds:

web4: 10.22.82.208 - 13647 [13/Jun/2013:23:19:11 -0700] "POST /process_bug.cgi HTTP/1.1" 200 116376 "https://bugzilla.mozilla.org/show_bug.cgi?id=873568" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20130613 Firefox/24.0" 28 bugzilla.mozilla.org 28372915
given this started between code pushes, this looks like an infrastructure issue; moving.
Severity: critical → major
Component: General → WebOps: Bugzilla
Product: bugzilla.mozilla.org → Infrastructure & Operations
QA Contact: nmaul
Version: Production → other
cc'ing Shyam in case he knows of anything that happened June 12th around 10 am to cause this slowness.
Filed bug 883432 to have Zeus log timing information about bugzilla requests.
(In reply to Sheeri Cabral [:sheeri] from comment #4)
> cc'ing Shyam in case he knows of anything that happened June 12th around 10
> am to cause this slowness.

I don't know anything off the top of my head as to what might be causing this. I'll take a deeper look on Sunday, when I'm oncall.
Unfortunately New Relic doesn't currently support Perl applications, or that would be a wonderful solution. We can get NR server monitoring... won't profile the Bugzilla app itself, but might help pinpoint various kinds of server-level problems like CPU, network I/O, disk, etc. I've set this up on all the bugzilla nodes, but it won't work until bug 883606 is resolved (firewall rules between bugzilla nodes and New Relic).

A quick look around on the web servers was unrevealing... no obvious memory, disk, or CPU concerns.

Catchpoint was also unrevealing. We do monitor bugzilla.mozilla.org, but only the homepage. I've just changed this to monitor an actual bug page, so once that has some data we might have something to go on. It can't know what's wrong within Bugzilla (being an external monitoring service that just does HTTP requests), but it can potentially tell us if the problem is location-specific, or in some way tied to a particular resource (javascript execution, etc).
Assignee: glob → server-ops-webops
Severity: major → normal
(In reply to Jake Maul [:jakem] from comment #8)
> or in some way tied to a particular resource (javascript execution, etc).

i'm using the response time from the apache logs to generate my graphs, so anything client-side can be excluded.
Attached image updated response graph
here's a graph which shows how bad things got on friday, and how times settled down over the weekend.  i took this snapshot at 9:30am monday morning and you can already see the times trending up.

i'll investigate adding some simple instrumentation to show_bug.
Depends on: 884177
i've committed bug 884177 which adds simple instrumentation to process_bug.  unfortunately i wasn't able to get a push today due to dc issues taking oncall's attention.
Severity: normal → major
sorry, the severity change was accidental.
Severity: major → normal
this definitely started on the 12th june.
there's a day/night change in performance at around [12/Jun/2013:10:08:00 -0700].
Right, so it was before the Thu 13 June bugzilla push (we didn't think it was code-related anyway). Adding Sylvie V to this bug so she can get notified of updates.
Any chance to know if that due to REST API or our official WebServices API putting too much load on bmo, or if that's due to heavy access from the web browser?

It could be someone using our API making too frequent calls to bmo.
adding mrz, corey and ravi for their review, and assistance.
(In reply to Frédéric Buclin from comment #15)
> Any chance to know if that due to REST API or our official WebServices API
> putting too much load on bmo, or if that's due to heavy access from the web
> browser?

the requests per second is at normal levels, as is the percentage of requests coming from the bzapi proxy (~15% in case you're wondering).  i also don't see anything abnormal in the requests being made.

it's worth noting that there isn't anything exhibiting high load or resource usage according to our monitoring tools, which makes this particularly perplexing.
so far there's nothing of interest in the timing logs.

bugmail generation is the cause for most of the slow process_bugs i've seen, however this is relative to the number of recipients and is a known issue (bug 877078 has a fix for that).
Depends on: 884817
finally managed to capture data for a period when bmo was being slow.

with the data we currently have, it appears to be in the 'template' section.  however this is somewhat missleading, as a lot of data is lazy-loaded, so the template triggers database activity by hitting a method on the bug object (in other words 'template' is more than just rendering time).

in bug 884817 i've moved loading of the comments from the database from the template to the preload section, but that change isn't on production yet.

i find it interesting that not every request is slow, nor is the slowdown limited to a single webhead.
This is working much better yesterday and today. :glob, are you seeing any unusual slowness at the moment?
I am currently sitting and waiting for >30s to get an enter_bug.cgi?product=Core page up, even after two reloads.
Attached image updated response graph
(In reply to Jake Maul [:jakem] from comment #20)
> This is working much better yesterday and today. :glob, are you seeing any
> unusual slowness at the moment?

yes; things are still very bad.
This is still horrible; most of the day today I've been waiting between 15 and 40 seconds to just load a bug, not even make any changes to the db.
To clarify priorities, I think this is currently a severe hindrance to effective engineering work. Please give this top priority.
Hi Andreas,

We have 3 teams working on this trying to find the core solution. Along the way we're making minor tweaks to performance issues we are finding, but we have not found anything that has been performing poorly since the 13th when this was reported.

We're playing whack-a-mole and trying to make performance acceptable, but we haven't found a smoking gun just yet. We don't think it's productive to noisily update this bug with everything we are trying. We're also trying not to just guess but actually get data to prove that yes, *this* is what caused the problems starting the 13th. So far, we haven't found what "this" is yet.

This is one of our highest priorities. We've given a lot of time and brainpower to this already, and we're continuing to do so. Remember that this affects everyone, including those of us in IT who are trying to fix the problem, and that we have frustrating intermittent problems with Bugzilla too.

I'm unsure why you think this is not a top priority. Lack of information means we have no information to give, not that we have not been working on it. Like just about every department at Mozilla, we are understaffed and prioritizing as best we can.
team -lets post regular updates on progress and what we have knocked off and what not. It helps the collective troubleshooting to see the areas that have been reviewed and feedback on them.  Albert/Shyam - lets get an incident opened and daily updates on what the status is.
(In reply to SylvieV from comment #27)
> team -lets post regular updates on progress and what we have knocked off and
> what not. It helps the collective troubleshooting to see the areas that have
> been reviewed and feedback on them.  Albert/Shyam - lets get an incident
> opened and daily updates on what the status is.

Not adding to the fire - just another input - Just my adding this note and adding a person in the CC list was real slow for me too. I haven't been active in bugzilla in the last 2 weeks so can't say if its the same or not but definitely slower then my usual experience.
I hope we can move past discussing whether this is a problem or not. Here is some data one of my engineers captured:

---


On the example bug of https://bugzilla.mozilla.org/show_bug.cgi?id=882994, window.performance.timing went from (with no bugzilla extensions installed):

{
  "fetchStart": 1,
  "domainLookupStart": 2,
  "domainLookupEnd": 2,
  "connectStart": 1,
  "connectEnd": 1,
  "secureConnectionStart": null,
  "requestStart": 273,
  "requestEnd": null,
  "domLoading": 2508,
  "domInteractive": 2965,
  "domContentLoadedEventStart": 3029,
  "domContentLoadedEventEnd": 3084,
  "loadEventStart": 3414,
  "loadEventEnd": 3421
}

to

{
  "fetchStart": 1,
  "domainLookupStart": 3,
  "domainLookupEnd": 3,
  "connectStart": 1,
  "connectEnd": 1,
  "secureConnectionStart": null,
  "requestStart": 274,
  "requestEnd": null,
  "domLoading": 21110,
  "domInteractive": 21357,
  "domContentLoadedEventStart": 21490,
  "domContentLoadedEventEnd": 21609,
  "loadEventStart": 22585,
  "loadEventEnd": 22594
}
Here's a summary of what we've done so far :

1) Verified that things haven't changed on the Bugzilla machines (from an IT standpoint alone). The only packages installed in the time period are :

Jun 03 11:39:41 Updated: audit-mozilla-libs-2.1.3-3.el6.moz13.x86_64
Jun 03 11:39:41 Updated: audit-mozilla-2.1.3-3.el6.moz13.x86_64
Jun 03 11:39:41 Updated: audispd-mozilla-plugins-2.1.3-3.el6.moz13.x86_64
Jun 03 11:39:41 Updated: audit-mozilla-libs-python-2.1.3-3.el6.moz13.x86_64
Jun 12 20:56:11 Updated: nagios-plugins-check-updates-1.6.3-1.el6.x86_64

The closest Bugzilla pushes before and after the slowness began are :

Bug 881160
Bug 882554
Bug 884670

2) Verified that the timing on the DBs (in terms of how long a backend node takes to respond) are under a second. This is true for both reads and writes. 

3) Looked through puppet reports and confirmed that there were no changes during the time period when slowness began (the only change was for Bug 879888, which was a change to /etc/hosts)

4) Spent time going through logs, both on the load balancer and individual webheads and haven't seen any obvious flags which point to any issues on the machines.

5) The DBAs have spent a bunch of time going over the Bugzilla DB and optimizing a few queries as well as tuning some parameters.


We're convening a meeting tomorrow to spend some time to go through Bugzilla ground up and see what we can find.

I'll be point from IT moving forward and will keep you all updated on what we find.
Assignee: server-ops-webops → shyam
(In reply to Sheeri Cabral [:sheeri] from comment #26)
> We're also trying not to just guess but actually get data to prove that yes,
> *this* is what caused the problems starting the 13th.

the problems started on the 12th (see comment 13).

you can see this most clearly via http://bmo.glob.com.au/?year=2013&month=06&report=auth&max=5000 by comparing june with previous months.


what we do know is that it's related to load - outside peak US working times, we're not seeing as much slowdown.  when it does happen, it will be slow for anywhere from a few minutes up to tens of minutes before returning to normal performance for a while.  there's no discernible pattern to these slow periods, nor is it limited to a single webhead.
(In reply to Andreas Gal :gal from comment #29)
> I hope we can move past discussing whether this is a problem or not. Here is
> some data one of my engineers captured:

that's from https://groups.google.com/forum/?fromgroups=#!topic/mozilla.dev.gaia/pB2mAyrmajg

> On the example bug of https://bugzilla.mozilla.org/show_bug.cgi?id=882994,
> window.performance.timing went from (with no bugzilla extensions installed):
> 
> {
>   "fetchStart": 1,
>   "domainLookupStart": 2,
>   "domainLookupEnd": 2,
>   "connectStart": 1,
>   "connectEnd": 1,
>   "secureConnectionStart": null,
>   "requestStart": 273,
>   "requestEnd": null,
>   "domLoading": 2508,
>   "domInteractive": 2965,
>   "domContentLoadedEventStart": 3029,
>   "domContentLoadedEventEnd": 3084,
>   "loadEventStart": 3414,
>   "loadEventEnd": 3421
> }
> 
> to
> 
> {
>   "fetchStart": 1,
>   "domainLookupStart": 3,
>   "domainLookupEnd": 3,
>   "connectStart": 1,
>   "connectEnd": 1,
>   "secureConnectionStart": null,
>   "requestStart": 274,
>   "requestEnd": null,
>   "domLoading": 21110,
>   "domInteractive": 21357,
>   "domContentLoadedEventStart": 21490,
>   "domContentLoadedEventEnd": 21609,
>   "loadEventStart": 22585,
>   "loadEventEnd": 22594
> }

in this case what changed to cause that slowness is the installation of the BugzillaJS firefox add-on, which isn't an addon that the bugzilla team maintain.
:glob - thanx for correcting my timeline.

:gal - I think Sylvie was just confirming that even the VP of IT sees the problem. In other words - it's a top-priority issue that we know about, and suffer from.
both bugzilla database vips are on zlb2 which is being saturated occasionally over the day. can we move these to zlb4 to determine if the saturation is causing the timeouts?  it's a very likely cause.
this shows a surge in traffic around the 12th, which corresponds with bugzilla slowness.  the peaks don't show complete link saturation however thats due to the rollup/average period in the graph.
I can correlate casey's data with what is in zlb, showing us that the releng-http VIP has seen massive amounts of traffic since the 12th, likely hitting limitations of the NIC in the zlb node.

I moved the bugzilla db VIPs from zlb2 to zlb4..  This has shown an immediate improvement in response times.  So I'm pretty confident that this bug is resolved.  We will need to file others on improving that zlb cluster to handle the needs of releng-http.
Depends on: 887304
(In reply to Sheeri Cabral [:sheeri] from comment #33)
> :glob - thanx for correcting my timeline.
> 
> :gal - I think Sylvie was just confirming that even the VP of IT sees the
> problem. In other words - it's a top-priority issue that we know about, and
> suffer from.

correct  ;-)
(In reply to Corey Shields [:cshields] from comment #37)
> I can correlate casey's data with what is in zlb, showing us that the
> releng-http VIP has seen massive amounts of traffic since the 12th, likely
> hitting limitations of the NIC in the zlb node.
> 
> I moved the bugzilla db VIPs from zlb2 to zlb4..  This has shown an
> immediate improvement in response times.  So I'm pretty confident that this
> bug is resolved.  We will need to file others on improving that zlb cluster
> to handle the needs of releng-http.

This is great Corey - Casey - thanks for this - hope to hear more about this being resolved.  But what I would like to know is what are the reasons the releng http loads went up so significantly on the 12th and stayed that way since?
I have added John and Bob from Releng to this bug.
(In reply to SylvieV from comment #39)
> This is great Corey - Casey - thanks for this - hope to hear more about this
> being resolved.  But what I would like to know is what are the reasons the
> releng http loads went up so significantly on the 12th and stayed that way
> since?

This was the point at which releng web sites were moved from a singular VM to the new relengweb cluster which runs through this zlb node.

We will talk about our troubleshooting process here and how this layer got missed.  It was a bit too obvious to have gone so long.

Also, per glob we are safe to close this bug out.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: WebOps: Bugzilla → Infrastructure
Product: Infrastructure & Operations → bugzilla.mozilla.org
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: