Closed Bug 417313 Opened 17 years ago Closed 17 years ago

graphs.mozilla.org having trouble keeping up

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: xb95)

Details

Attachments

(2 files)

adding timestamps around sending results to the graph server 17 years ago alice nodelman [:alice] [:anode] 2.66 KB, patch	rhelmer : review+	Details \| Diff \| Splinter Review
use slave database for getdata 17 years ago Mark Smith [:xb95] 3.61 KB, patch	anodelman : review+	Details \| Diff \| Splinter Review

alice nodelman [:alice] [:anode]

Reporter

Description

•

17 years ago

I'm seeing a lot of the production talos machines timing out on attempts to send data to the graph server. Can somebody have a look at the box and see if it is stuck on something (space? cpu?) I just added 6 more machines reporting to the production graph server, so the load has increased recently.

alice nodelman [:alice] [:anode]

Reporter

Comment 1

•

17 years ago

Okay, most of the production talos machines are now burning due to this. Upgrading to blocking.

Severity: critical → blocker

chizu

Updated

•

17 years ago

Assignee: server-ops → thardcastle

Dave Miller [:justdave]

Comment 2

•

17 years ago

mrdb05 caved a little bit ago and I remember seeing irc traffic mentioning that the graph server was at fault...

chizu

Comment 3

•

17 years ago

The graph server itself seems healthy enough, but mrdb05 is still loaded by graph server queries. 20-30 copies of this query are executing and it looks like mrdb05 is having trouble keeping up: SELECT dataset_values.dataset_id, dataset_values.time, dataset_values.value, dataset_branchinfo.bran

Status: NEW → ASSIGNED

alice nodelman [:alice] [:anode]

Reporter

Comment 4

•

17 years ago

Is there anything that we can do here in terms of indexing? Or some other db magic? Otherwise I'll start removing machines from production to lighten the load, which is something that I'd rather not do.

Reed Loden [:reed]

Updated

•

17 years ago

OS: Mac OS X → All

Hardware: PC → All

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 5

•

17 years ago

Wait -- is the load due to queries from the graphing client, or from submitting data? If the load is due to the graphing client, we can lighten that up to repoen the tree.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 6

•

17 years ago

Or rather, are there lots of getdata.cgi's running, or is it submit.cgi or whatever the other thing's called?

Mark Smith [:xb95]

Assignee

Comment 7

•

17 years ago

Full query: SELECT dataset_values.dataset_id, dataset_values.time, dataset_values.value, dataset_branchinfo.branchid, dataset_extra_data.data FROM dataset_values LEFT JOIN dataset_branchinfo ON dataset_values.dataset_id = dataset_branchinfo.dataset_id AND dataset_values.time = dataset_branchinfo.time LEFT JOIN dataset_extra_data ON dataset_values.dataset_id = dataset_extra_data.dataset_id AND dataset_values.time = dataset_extra_data.time WHERE dataset_values.dataset_id IN (49684) ORDER BY dataset_values.dataset_id, dataset_values.time Note that the dataset_id keeps changing, so the query cache is not buying us anything here. Plus, as we are constantly inserting into dataset_values, the query cache is useless for that table anyway. The table format is suboptimal because you're having to join against three tables for data that is all associated with the same index. So you've got 3x the index used, 3x the operations required, to get this data. In reality it should be one table. Vlad said something about this being this way because more data was just tacked on later, etc. There are several ways we can go about fixing this. As a first and likely big win, whatever is doing the queries (graph server?) should understand how to use multiple databases. We have this data replicated, the graph srever should be able to split its read traffic between multiple databases. Most of what is going on is reads, we could get a big boost to performance by splitting them up. Presently you are hitting mrdb05 with all traffic, but if you were to split half of your read traffic to mrdb06, then we could potentially avoid this. The writes still have to be done on both, but those really aren't the problem we're running into here. This is something we were talking about the other day. Is this something that can be done fairly easily in the existing system?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 8

•

17 years ago

Okay; this is all graph client stuff. For now, can we disable graphs.mozilla.org temporarily? Easiest way to do that is to just move getdata.cgi to something else, getdata-disabled.cgi. We'll still have the nubers show up on tinderbox, so we won't be blind.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 9

•

17 years ago

Oh, and kill all running getdata.cgi processes, hopefully to abort their db queries.

Mark Smith [:xb95]

Assignee

Comment 10

•

17 years ago

Mark Smith [:xb95]

Assignee

Comment 11

•

17 years ago

Stealing this bug from Trevor, as I'm working with Vlad and Alice to try to figure out what we can do here. Also dropping severity to Major so that this can be addressed ASAP, but to indicate that everything is running okay right now (at least I understand it is, please correct me if I'm wrong).

Assignee: thardcastle → mark

Severity: blocker → major

Status: ASSIGNED → NEW

Stuart Parmenter

Comment 12

•

17 years ago

please don't turn it off

alice nodelman [:alice] [:anode]

Reporter

Comment 13

•

17 years ago

Attached patch adding timestamps around sending results to the graph server — Details — Splinter Review

Adding timestamps around starting/completing sending results to the graph server so that we can get a better idea how long the whole operation takes.

Attachment #305860 - Flags: review?(ccooper)

alice nodelman [:alice] [:anode]

Reporter

Updated

•

17 years ago

Attachment #305860 - Flags: review?(ccooper) → review?(rhelmer)

Robert Helmer [:rhelmer]

Updated

•

17 years ago

Attachment #305860 - Flags: review?(rhelmer) → review+

alice nodelman [:alice] [:anode]

Reporter

Comment 14

•

17 years ago

Checking in run_tests.py; /cvsroot/mozilla/testing/performance/talos/run_tests.py,v <-- run_tests.py new revision: 1.26; previous revision: 1.25 done

Mark Smith [:xb95]

Assignee

Comment 15

•

17 years ago

Attached patch use slave database for getdata — Details — Splinter Review

Adding attachment for adding a 'dbr' to the graphsdb.py and then using this in getdata.cgi ... this is really simple. For now, it just points at db, but in production this can point at the proper location. This may or may not buy us anything but it will move the graphs being generated to another database, hopefully reducing blocks against the data inserts being done by the uploader.

Mark Smith [:xb95]

Assignee

Updated

•

17 years ago

Attachment #306154 - Flags: review?(anodelman)

alice nodelman [:alice] [:anode]

Reporter

Comment 16

•

17 years ago

Comment on attachment 306154 [details] [diff] [review] use slave database for getdata This all looks pretty non-controversial. We should be careful when the changes are pushed that we test things out for a while and be prepared to back out.

Attachment #306154 - Flags: review?(anodelman) → review+

Mark Smith [:xb95]

Assignee

Comment 17

•

17 years ago

Checking in getdata.cgi; /cvsroot/mozilla/webtools/new-graph/getdata.cgi,v <-- getdata.cgi new revision: 1.12; previous revision: 1.11 done Checking in graphsdb.py; /cvsroot/mozilla/webtools/new-graph/graphsdb.py,v <-- graphsdb.py new revision: 1.2; previous revision: 1.1 done

Mark Smith [:xb95]

Assignee

Comment 18

•

17 years ago

This is pushed, and verified. I believe this resolves most of the resource contention that we were seeing the other day. Build is fine to go ahead and add more machines to report to the graphs backend. We will watch for additional pain points, but the database should be able to handle the inserts without any problems at this point, for at least some reasonable amount of additional load. There are still improvements to be made in the graphing infrastructure, but not in the course of this particular server ops bug. I will send an email to the relevant parties and the next steps will be determined, at which point I imagine more bugs will be opened to track the progress of that project.

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

Reed Loden [:reed]

Updated

•

17 years ago

Component: Server Operations → Graph Server

Product: mozilla.org → Webtools

QA Contact: justin → graph.server

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: Webtools → Webtools Graveyard

You need to log in before you can comment on or make changes to this bug.