Finish running OPTIMIZE on tbpl_mozilla_org

RESOLVED FIXED

Status

Data & BI Services Team
DB: MySQL
--
major
RESOLVED FIXED
5 years ago
3 years ago

People

(Reporter: Unfocused, Assigned: mpressman)

Tracking

other
All
Other
Bug Flags:
needs-treeclosure ?

Details

Attachments

(1 attachment)

We can't seem to be ale to access any build summaries or logs (via TBPL), which is kinda a big issue :\
Blair has closed mozilla-central and mozilla-inbound for this, maybe more.

eg https://tbpl.mozilla.org/php/getParsedLog.php?id=13864233&tree=Mozilla-Inbound
the B for 'Win opt' on https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=eb4f2872a6ca

The connection to tbpl is open stays for multiple minutes waiting for a response, until Firefox says 'The connection was reset'. 

AFAICT the RelEng parts of this are working fine. In http://builddata.pub.build.mozilla.org/buildjson/builds-4hr.js.gz we have
    {
      "builder_id": 37827,
      "buildnumber": 2472,
      "endtime": 1343283972,
      "id": 13864233,
      "master_id": 31,
      "properties": {
        "basedir": "/e/builds/moz2_slave/m-in-w32",
        "branch": "mozilla-inbound",
        "builddir": "m-in-w32",
        "buildername": "WINNT 5.2 mozilla-inbound build",
        "buildid": "20120725222657",
        "buildnumber": 2472,
        "builduid": "03cd21a2173c468093d3aeac5379642d",
        "comments": "Bug 746883; fix the bustage possibly caused by 746896. r=mattwoodrow",
        "forced_clobber": false,
        "got_revision": "eb4f2872a6ca",
        "hashType": "sha512",
        "log_url": "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1343280417/mozilla-inbound-win32-bm12-build1-build2472.txt.gz",
        "master": "http://buildbot-master12.build.scl1.mozilla.com:8001/",
        "periodic_clobber": false,
        "platform": "win32",
        "product": "firefox",
        "project": "",
        "purge_actual": "57.11GB",
        "purge_target": "12GB",
        "purged_clobber": false,
        "repository": "",
        "request_ids": [
          13912112
        ],
        "revision": "eb4f2872a6ca9294d642e798b202699eef9443b2",
        "scheduler": "mozilla-inbound",
        "slavebuilddir": "m-in-w32",
        "slavename": "w64-ix-slave12",
        "stage_platform": "win32",
        "toolsdir": "e:/builds/moz2_slave/m-in-w32/tools"
      },
      "reason": "scheduler",
      "request_ids": [
        13912112
      ],
      "requesttime": 1343280417,
      "result": 2,
      "slave_id": 1308,
      "starttime": 1343280432
    },

The log_url gets a 200 response and downloads fine. I think you'll need to poke at the tbpl logs to find out what's happening.
Assignee: server-ops-releng → server-ops-devservices
Component: Server Operations: RelEng → Server Operations: Developer Services
QA Contact: arich → shyam
(In reply to Nick Thomas [:nthomas] from comment #1)
> Blair has closed mozilla-central and mozilla-inbound for this, maybe more.

Also Fx-team.
Looking.
Assignee: server-ops-devservices → shyam
Seems like an ongoing attack of sorts :

10.8.81.216 - - [26/Jul/2012:01:48:20 -0700] "GET /php/getRevisionBuilds.php?branch=try&rev=8c57277452b9&_=1343292500026 HTTP/1.1" 200 1196 "https://tbpl.mozilla.org/?tree=Try" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0"
10.8.81.216 - - [26/Jul/2012:01:48:20 -0700] "GET /php/getRevisionBuilds.php?branch=try&rev=9702de67ae1e&_=1343292500030 HTTP/1.1" 200 3022 "https://tbpl.mozilla.org/?tree=Try" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0"
10.8.81.216 - - [26/Jul/2012:01:48:20 -0700] "GET /php/getRevisionBuilds.php?branch=try&rev=b6b254f6c496&_=1343292500028 HTTP/1.1" 200 34265 "https://tbpl.mozilla.org/?tree=Try" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0"

and top hits :

[root@pp-zlb09.phx ~]# grep tbpl /var/log/zeus/generic.access_2012-07-26-01 | awk {'print $1'} | sort -n | uniq -c | sort -nr
   7637 88.114.144.236
   7525 58.85.248.221
   4586 71.58.64.76
   4059 107.10.247.225
   3429 63.245.220.240
   2558 174.0.224.76
   1742 66.207.208.98
   1595 63.235.13.3
   1083 99.115.82.101
OTOH, there's also this :

[shyam@natasha ~]$ curl -H "Host: tbpl.mozilla.org" -v "http://generic1.webapp.phx1.mozilla.com:81/php/getParsedLog.php?id=13864233&tree=Mozilla-Inbound"
* About to connect() to generic1.webapp.phx1.mozilla.com port 81
*   Trying 10.8.81.91... connected
* Connected to generic1.webapp.phx1.mozilla.com (10.8.81.91) port 81
> GET /php/getParsedLog.php?id=13864233&tree=Mozilla-Inbound HTTP/1.1
> User-Agent: curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
> Accept: */*
> Host: tbpl.mozilla.org
> 
< HTTP/1.1 200 OK
< Date: Thu, 26 Jul 2012 08:53:52 GMT
< Server: Apache
< X-Backend-Server: generic1.webapp.phx1.mozilla.com
< Strict-Transport-Security: max-age=15768000 ; includeSubDomains
< Access-Control-Allow-Origin: *
< Cache-Control: max-age=900
< Expires: Thu, 26 Jul 2012 09:08:52 GMT
< Content-Length: 63
< Content-Type: text/html, charset=utf-8
Connection #0 to host generic1.webapp.phx1.mozilla.com left intact
* Closing connection #0
SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

The DB seems to have a ton of queries that are locking a table, investigating further.
Created attachment 646069 [details]
show full processlist from generic2.db.phx1

generic1.db.phx1 is disabled in Zeus, generic2 is locked up as shown above. This would probably be the primary cause of tbpl misbehaving at this time.
Generic1 was disabled in bug 703967 comment 39 aiui, if that helps at all?
(In reply to Ed Morley [:edmorley] from comment #7)
> Generic1 was disabled in bug 703967 comment 39 aiui, if that helps at all?

It does, I wonder if the master/slave relationship was changed. If not, generic2 is probably replicating all the changes made to generic1, but then generic2 wouldn't be r/w...

I've paged the DBAs. Sheeri is probably on PTO, but mpressman should be around. This is clearly a DB related issue that's causing tbpl some pain.
Assignee: shyam → server-ops-database
Component: Server Operations: Developer Services → Server Operations: Database
QA Contact: shyam → cshields
Depends on: 703967
As noted on IRC, the culprit is:

|   61367 | system user     |                   | tbpl_mozilla_org     | Connect     |  53247 | copy to tmp table                                              | OPTIMIZE TABLE runs_logs                                                                             |


Which has been running for 53247s, i.e., the better part of a day.  And which has locked the whole table to writes for that period.  Killing that statement should unlock everything.  (Shyam said he just killed it.)
From IRC :

[02:16:51] <AryehGregor> | fox2mike, |   61367 | system user     |                   | tbpl_mozilla_org     | Connect     |  53247 | copy to tmp table                                              | OPTIMIZE TABLE runs_logs                
                         |
[02:17:09]              * | Optimizer fox2mike: Carry on soldier. Optimizer out!
[02:17:10] < AryehGregor> | For InnoDB that's equivalent to ALTER TABLE.  Which takes out a write lock on the whole table.
[02:17:21] < AryehGregor> | For as long as it takes to rebuild the entire table and its indexes from scratch.
[02:17:26] < AryehGregor> | So, um, find what's doing that and stop it.  :)
[02:17:34] <    fox2mike> | system user
[02:17:41] < AryehGregor> | Possibly a cron job or something.
[02:17:42] <    fox2mike> | isnt' that internal mysql? 
[02:17:53] < AryehGregor> | Something has to be running the query, MySQL should not be running that itself.
[02:18:02] < AryehGregor> | Although hmm, you're right that it's odd.
[02:18:07] <    fox2mike> | yeah
[02:18:11] <    fox2mike> | if something else is
[02:18:13] <   Unfocused> | Optimizer: that code doesn't look wrong, fwiw
[02:18:15] <    fox2mike> | we'd have that user.
[02:18:20] <    fox2mike> | not system user
[02:20:14] <AryehGregor> | fox2mike, as a workaround, can you try killing the OPTIMIZE TABLE thread
[02:20:15] < AryehGregor> | ?
[02:27:30] <AryehGregor> | fox2mike, you won't corrupt it.  OPTIMIZE on InnoDB is the same as ALTER, which works by locking the table, making a copy, and then atomically overwriting the original table with the copy.  If you kill it, it 
                         will just delete the copy and unlock the original.

Therefore I ran 

mysql> kill 61367;

And things are coming back to normal. 

I'm leaving this open to see if we need to re-run this optimize or to see what caused this. Lowering prio.
Severity: critical → major
No longer depends on: 703967
If you do need to optimize things, I suggest you run OPTIMIZE LOCAL TABLE instead of regular OPTIMIZE TABLE, so it's not replicated.  I was told the setup is master-master, so you can optimize the passive master while the active master handles all load, then switch the masters and optimize the other one.  The fact that this was running as "system user" indicates (thanks, #mysql!) that it was replicated -- which was possibly unintentional, but not very good for performance either way.
Blocks: 703967
(Assignee)

Comment 12

5 years ago
Shyam is correct that the setup has changed from a master-master to a master-slave, although replication is still enabled on generic1, but in a state where it is stopped on generic1 due to an error. It also appears that read_only is on for generic1 making sure that writes cannot be done on generic1 basically presenting itself as a master-slave.
(Assignee)

Comment 13

5 years ago
Also, comment 10 is correct in that a full table lock is acquired during an optimize/alter table setup. The table tbpl_mozilla_org.runs_logs is ~50GB on disk which is why it was taking so long.
Is this still being actively worked? I'm _sometimes_ able to get summary info on TBPL, but opening a full log always fails with "Timeout generating log." This is pretty terrible. :(
(In reply to Justin Dolske [:Dolske] from comment #14)
> Is this still being actively worked? I'm _sometimes_ able to get summary
> info on TBPL, but opening a full log always fails with "Timeout generating
> log." This is pretty terrible. :(

Is this always on the same log?

As I added to the TryServer TBPL status message, "Timeout generating log." is unfortunately cached in the DB, so logs imported at the time of the issue won't recover. You can either access the logs directly (link in TBPL status message) or retrigger.
Just tested opening a variety of full logs via TBPL and wfm, so suspect you are indeed trying a subset of the logs affected at the time of the issue.
Ah, I read that as meaning it was locally-cached and a shift-reload should clear it up!
(Assignee)

Comment 18

5 years ago
just got an alert that this - if the table tbpl_mozilla_org.runs_logs needs to be *optimized* - I can tune the host to prioritize on that and it should run a little faster. Additionally, we are using a little bit of swap due to assigning all the system memory to the innodb_buffer_pool. This will require a daemon restart. We are going to want to change the tuning options anyway so when would be a good time to perform the restart and the maintenance?
Assignee: server-ops-database → mpressman
John/Ravi,

Going to tag this along in this downtime, we'd like to finish running the killed optimize and Matt wants to adjust some more tuning options too.
Flags: needs-treeclosure?
Summary: Unable to access build summaries and logs → Finish running OPTIMIZE on tbpl_mozilla_org

Comment 20

5 years ago
Please update https://infra.etherpad.mozilla.org/81
(In reply to Ravi Pina [:ravi] from comment #20)
> Please update https://infra.etherpad.mozilla.org/81

I already have (after I made comment #19).
(Assignee)

Comment 22

5 years ago
tuning changes made to host and puppet
(Assignee)

Comment 23

5 years ago
alter is finished tbpl_mozilla_org.runs_logs is down from ~50GB to ~44GB. Additionally, innodb_buffer_pool was reduced from 12GB to 8GB.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.