Closed Bug 417313 Opened 16 years ago Closed 16 years ago

graphs.mozilla.org having trouble keeping up

Categories

(Webtools Graveyard :: Graph Server, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: xb95)

Details

Attachments

(2 files)

I'm seeing a lot of the production talos machines timing out on attempts to send data to the graph server.  Can somebody have a look at the box and see if it is stuck on something (space? cpu?)

I just added 6 more machines reporting to the production graph server, so the load has increased recently.
Okay, most of the production talos machines are now burning due to this.  Upgrading to blocking.
Severity: critical → blocker
Assignee: server-ops → thardcastle
mrdb05 caved a little bit ago and I remember seeing irc traffic mentioning that
the graph server was at fault...
The graph server itself seems healthy enough, but mrdb05 is still loaded by graph server queries.

20-30 copies of this query are executing and it looks like mrdb05 is having trouble keeping up:
SELECT dataset_values.dataset_id, dataset_values.time, dataset_values.value, dataset_branchinfo.bran
Status: NEW → ASSIGNED
Is there anything that we can do here in terms of indexing?  Or some other db magic?  Otherwise I'll start removing machines from production to lighten the load, which is something that I'd rather not do.
OS: Mac OS X → All
Hardware: PC → All
Wait -- is the load due to queries from the graphing client, or from submitting data?  If the load is due to the graphing client, we can lighten that up to repoen the tree.
Or rather, are there lots of getdata.cgi's running, or is it submit.cgi or whatever the other thing's called?
Full query:

SELECT dataset_values.dataset_id, dataset_values.time, dataset_values.value, 
       dataset_branchinfo.branchid, dataset_extra_data.data
FROM dataset_values 
LEFT JOIN dataset_branchinfo 
       ON dataset_values.dataset_id = dataset_branchinfo.dataset_id
          AND dataset_values.time = dataset_branchinfo.time
LEFT JOIN dataset_extra_data
       ON dataset_values.dataset_id = dataset_extra_data.dataset_id
          AND dataset_values.time = dataset_extra_data.time
WHERE dataset_values.dataset_id IN (49684)
ORDER BY dataset_values.dataset_id, dataset_values.time

Note that the dataset_id keeps changing, so the query cache is not buying us
anything here.  Plus, as we are constantly inserting into dataset_values, the
query cache is useless for that table anyway.

The table format is suboptimal because you're having to join against three
tables for data that is all associated with the same index.  So you've got 3x
the index used, 3x the operations required, to get this data.  In reality it
should be one table.  Vlad said something about this being this way because
more data was just tacked on later, etc.

There are several ways we can go about fixing this.  As a first and likely big
win, whatever is doing the queries (graph server?) should understand how to use
multiple databases.  We have this data replicated, the graph srever should be
able to split its read traffic between multiple databases.  Most of what is
going on is reads, we could get a big boost to performance by splitting them
up.

Presently you are hitting mrdb05 with all traffic, but if you were to split
half of your read traffic to mrdb06, then we could potentially avoid this.  The
writes still have to be done on both, but those really aren't the problem we're
running into here.

This is something we were talking about the other day.  Is this something that
can be done fairly easily in the existing system?
Okay; this is all graph client stuff.  For now, can we disable graphs.mozilla.org temporarily?  Easiest way to do that is to just move getdata.cgi to something else, getdata-disabled.cgi.  We'll still have the nubers show up on tinderbox, so we won't be blind.
Oh, and kill all running getdata.cgi processes, hopefully to abort their db queries.
To describe in more detail:

mysql> describe dataset_branchinfo;
+------------+--------------+------+-----+---------+-------+
| Field      | Type         | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| dataset_id | int(11)      | YES  | MUL | NULL    |       |
| time       | int(11)      | YES  |     | NULL    |       |
| branchid   | varchar(255) | YES  |     | NULL    |       |
+------------+--------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

mysql> describe dataset_extra_data;
+------------+---------+------+-----+---------+-------+
| Field      | Type    | Null | Key | Default | Extra |
+------------+---------+------+-----+---------+-------+
| dataset_id | int(11) | YES  | MUL | NULL    |       |
| time       | int(11) | YES  |     | NULL    |       |
| data       | text    | YES  |     | NULL    |       |
+------------+---------+------+-----+---------+-------+
3 rows in set (0.00 sec)

mysql> describe dataset_values;    
+------------+---------+------+-----+---------+-------+
| Field      | Type    | Null | Key | Default | Extra |
+------------+---------+------+-----+---------+-------+
| dataset_id | int(11) | YES  | MUL | NULL    |       |
| time       | int(11) | YES  | MUL | NULL    |       |
| value      | float   | YES  |     | NULL    |       |
+------------+---------+------+-----+---------+-------+
3 rows in set (0.00 sec)

These tables all have the same keys (dataset_id, time).  They are all used in the above query in the same way.  Having to join on three tables, using three separate indexes, to pull data like this is kinda rough.  Instead, why not a table with five columns: (dataset_id, time, branchid, data, value)?

Is this data not 1:1?  It seems to be in the queries I've run.  Perhaps the historical data going all the way back isn't, but columns can be set NULL and dealt with programmatically in that case.  For modern columns, i.e., all of the ones I've sampled, the data seems to all be present.

This is a win from the database design perspective.  A quicker solution is still going to be having traffic split up among the available databases, though... or I guess as Vlad suggests, just turning it off.
Stealing this bug from Trevor, as I'm working with Vlad and Alice to try to figure out what we can do here.

Also dropping severity to Major so that this can be addressed ASAP, but to indicate that everything is running okay right now (at least I understand it is, please correct me if I'm wrong).
Assignee: thardcastle → mark
Severity: blocker → major
Status: ASSIGNED → NEW
please don't turn it off
Adding timestamps around starting/completing sending results to the graph server so that we can get a better idea how long the whole operation takes.
Attachment #305860 - Flags: review?(ccooper)
Attachment #305860 - Flags: review?(ccooper) → review?(rhelmer)
Attachment #305860 - Flags: review?(rhelmer) → review+
Checking in run_tests.py;
/cvsroot/mozilla/testing/performance/talos/run_tests.py,v  <--  run_tests.py
new revision: 1.26; previous revision: 1.25
done
Adding attachment for adding a 'dbr' to the graphsdb.py and then using this in getdata.cgi ... this is really simple.  For now, it just points at db, but in production this can point at the proper location.

This may or may not buy us anything but it will move the graphs being generated to another database, hopefully reducing blocks against the data inserts being done by the uploader.
Attachment #306154 - Flags: review?(anodelman)
Comment on attachment 306154 [details] [diff] [review]
use slave database for getdata

This all looks pretty non-controversial.  We should be careful when the changes are pushed that we test things out for a while and be prepared to back out.
Attachment #306154 - Flags: review?(anodelman) → review+
Checking in getdata.cgi;
/cvsroot/mozilla/webtools/new-graph/getdata.cgi,v  <--  getdata.cgi
new revision: 1.12; previous revision: 1.11
done
Checking in graphsdb.py;
/cvsroot/mozilla/webtools/new-graph/graphsdb.py,v  <--  graphsdb.py
new revision: 1.2; previous revision: 1.1
done
This is pushed, and verified.  I believe this resolves most of the resource contention that we were seeing the other day.

Build is fine to go ahead and add more machines to report to the graphs backend.  We will watch for additional pain points, but the database should be able to handle the inserts without any problems at this point, for at least some reasonable amount of additional load.

There are still improvements to be made in the graphing infrastructure, but not in the course of this particular server ops bug.  I will send an email to the relevant parties and the next steps will be determined, at which point I imagine more bugs will be opened to track the progress of that project.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Component: Server Operations → Graph Server
Product: mozilla.org → Webtools
QA Contact: justin → graph.server
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: