Closed Bug 1125410 Opened 9 years ago Closed 9 years ago

Extremely slow job ingestion due to process-objects tasks being blocked on MySQL system locks

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: emorley, Assigned: scabral)

References

Details

(Whiteboard: [data:consultative])

This try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=04642793714f

Has jobs listed as still running, that are shown as completed here:
https://tbpl.mozilla.org/?tree=Try&rev=04642793714f

Something must be up with builds-4hr ingestion.

However the buildapi queue isn't backlogged:

[emorley@treeherder-rabbitmq1.private.scl3 ~]$ sudo rabbitmqctl list_queues -p treeherder
Listing queues ...
buildapi        0
celery@buildapi.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox  0
celery@buildapi.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox  0
celery@default.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox      0
celery@hp.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox   0
celery@log_parser.treeherder-processor1.private.scl3.mozilla.com.celery.pidbox  0
celery@log_parser.treeherder-processor2.private.scl3.mozilla.com.celery.pidbox  0
celery@log_parser.treeherder-processor3.private.scl3.mozilla.com.celery.pidbox  0
celery@pushlog.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox   0
celery@pushlog.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox   0
celeryev.01fa73d6-2404-4750-9779-8a4b8d45cb3c   0
celeryev.0eb8f7ba-8dc3-40f6-8957-78dd0d5fe68c   0
celeryev.262ab685-fce8-4110-86bb-1eaa6c4ba227   0
celeryev.2cca35cd-17c6-44dd-b3dd-7b02768a3eda   0
celeryev.49b12d47-2d91-403a-a6c9-aa2ff57bcb21   0
celeryev.54c84e3a-5169-45f7-bb72-a9a5ba478406   0
celeryev.6a47b627-9084-4044-9dbe-081183b5bb49   0
celeryev.872da14b-d941-4fc9-b874-5d802fbbed0c   0
celeryev.b2601f19-7c12-4fef-8dc2-043495dd5f84   0
default 3746
high_priority   0
log_parser      31929
log_parser_fail 0
log_parser_hp   0
pushlog 0
...done.

...and new relic doesn't show any spike of errors that I can see:
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors

The jobs in question finished in the last hour or so, so should still be in builds-4hr.
Could one of you look into this? We have less than 3 hours before the jobs are no longer in builds-4hr.
Flags: needinfo?(mdoglio)
Flags: needinfo?(cdawson)
I tried restarting the etl workers, to no effect.
The json from builds-4hr for a job that wasn't ingested successfully:

    {
      "builder_id": 256181, 
      "buildnumber": 469, 
      "endtime": 1422053135, 
      "id": 57302473, 
      "master_id": 80, 
      "properties": {
        "basedir": "/builds/slave/test", 
        "blobber_files": "{\"browser-chrome-chunked_raw.log\": \"http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/75f752e40c45fda90733aeb2ef77a2429dca37997d0652bb1a3c3d4121af149208dec939939a3a05e9ae3babeca80b4f17e91180bac4562277080fd03bbf14af\"}", 
        "branch": "try", 
        "build_url": "https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/try-builds/bobbyholley@gmail.com-04642793714f/try-linux64-asan/firefox-38.0a1.en-US.linux-x86_64-asan.tar.bz2", 
        "buildername": "Ubuntu ASAN VM 12.04 x64 try opt test mochitest-browser-chrome-1", 
        "buildid": "20150123105547", 
        "buildnumber": 469, 
        "builduid": "6dec03cde9df40b0b8bd2c546e076158", 
        "log_url": "http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bobbyholley@gmail.com-04642793714f/try-linux64-asan/try_ubuntu64-asan_vm_test-mochitest-browser-chrome-1-bm67-tests1-linux64-build469.txt.gz", 
        "master": "http://buildbot-master67.srv.releng.use1.mozilla.com:8201/", 
        "pgo_build": "False", 
        "platform": "linux64-asan", 
        "product": "firefox", 
        "project": "", 
        "repo_path": "try", 
        "repository": "", 
        "request_ids": [
          60052513
        ], 
        "request_times": {
          "60052513": 1422042025
        }, 
        "revision": "04642793714f93858f466a795fbf4fc83ded3b15", 
        "scheduler": "tests-try-ubuntu64-asan_vm-opt-unittest", 
        "script_repo_revision": "3ef2559bba65", 
        "script_repo_url": "https://hg.mozilla.org/build/mozharness", 
        "slavebuilddir": "test", 
        "slavename": "tst-linux64-spot-137", 
        "stage_platform": "linux64-asan"
      }, 
      "reason": "scheduler", 
      "request_ids": [
        60052513
      ], 
      "requesttime": 1422042025, 
      "result": 0, 
      "slave_id": 5551, 
      "starttime": 1422049501
    }, 

And from one that was fine:

    {
      "builder_id": 189655, 
      "buildnumber": 1219, 
      "endtime": 1422048872, 
      "id": 57297590, 
      "master_id": 74, 
      "properties": {
        "basedir": "/builds/slave/test", 
        "blobber_files": "{\"plain-chunked_raw.log\": \"http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/a841e290cf90580a288e8913f0497d64de6fd2c6a36cfe1009db33198b8b345c6eb31139b4143304896571a98710f2a93df467224df35ba7361ed174bb17d4b4\"}", 
        "branch": "try", 
        "build_url": "https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/try-builds/bobbyholley@gmail.com-04642793714f/try-linux64-asan/firefox-38.0a1.en-US.linux-x86_64-asan.tar.bz2", 
        "buildername": "Ubuntu ASAN VM 12.04 x64 try opt test mochitest-2", 
        "buildid": "20150123105547", 
        "buildnumber": 1219, 
        "builduid": "6dec03cde9df40b0b8bd2c546e076158", 
        "log_url": "http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bobbyholley@gmail.com-04642793714f/try-linux64-asan/try_ubuntu64-asan_vm_test-mochitest-2-bm53-tests1-linux64-build1219.txt.gz", 
        "master": "http://buildbot-master53.srv.releng.usw2.mozilla.com:8201/", 
        "pgo_build": "False", 
        "platform": "linux64-asan", 
        "product": "firefox", 
        "project": "", 
        "repo_path": "try", 
        "repository": "", 
        "request_ids": [
          60052522
        ], 
        "request_times": {
          "60052522": 1422042025
        }, 
        "revision": "04642793714f93858f466a795fbf4fc83ded3b15", 
        "scheduler": "tests-try-ubuntu64-asan_vm-opt-unittest", 
        "script_repo_revision": "3ef2559bba65", 
        "script_repo_url": "https://hg.mozilla.org/build/mozharness", 
        "slavebuilddir": "test", 
        "slavename": "tst-linux64-spot-379", 
        "stage_platform": "linux64-asan"
      }, 
      "reason": "scheduler", 
      "request_ids": [
        60052522
      ], 
      "requesttime": 1422042025, 
      "result": 0, 
      "slave_id": 5884, 
      "starttime": 1422046851
    },
I can't see any differences between those two.

I can't see anything major in /var/log/celery/celery_worker_buildapi.log though there is the odd error, but looks like things we see periodically.
Discussed with camd in #treeherder - I tried a |restart-jobs -p all|, the issue persists.


The celerybeat log on rabbitmq1:
/var/log/celery/celerybeat.log
[2015-01-23 15:58:00,055: INFO/MainProcess] Scheduler: Sending due task fetch-buildapi-build4h-every-3-minute (fetch-buildapi-build4h)
[2015-01-23 15:58:00,056: INFO/MainProcess] Scheduler: Sending due task fetch-push-logs-every-minute (fetch-push-logs)
[2015-01-23 15:58:00,058: INFO/MainProcess] Scheduler: Sending due task fetch-buildapi-running-every-minute (fetch-buildapi-running)
[2015-01-23 15:58:00,063: INFO/MainProcess] Scheduler: Sending due task fetch-buildapi-pending-every-minute (fetch-buildapi-pending)

...shows the tasks are being scheduled.

I've checked the buildapi ingestion logs on etl[12]:
/var/log/celery/celery_worker_buildapi.log
... and see things like:
[2015-01-23 15:58:14,947: INFO/Worker-1] collection loading request: https://treeherder.mozilla.org/api/project/try/objectstore/
... which looks normal to me.

There are a lot of:
[2015-01-23 15:55:00,949: WARNING/Worker-4] property 'revision' not found in build4h
[2015-01-23 15:55:00,949: WARNING/Worker-4] property 'revision' not found in build4h
[2015-01-23 15:55:00,949: WARNING/Worker-4] skipping job on unsupported branch idle
[2015-01-23 15:55:00,949: WARNING/Worker-4] skipping job on unsupported branch idle

...but they should just be from unsupported job types (eg fuzzer, release builds, l10n etc).

I can only think that for some reason we think we've already ingested them and so are skipping them, or ...?
I noticed that MySQL was hung up on waiting for system locks for queries on the objectstore table in the mozilla_inbound_objectstore_1 db, but only that table:

18162659 | th_admin | 10.22.70.209:28287 | mozilla_inbound_objectstore_1 | Query | 7 | System lock | UPDATE `objectstore` SET `processed_state` = 'loading', `worker_id` = CONNECTION_ID() WHERE `processed_state` = 'ready' AND `error` = 'N' ORDER BY `id` LIMIT 100 

I could do a "flush tables" to push it through, but things kept getting stuck. I did a "check table" but things looked good, but we tried an OPTIMIZE TABLE anyway and things seem to be unstuck now.
As of now, things look like they're continuing to catch up.  I'm going to keep an eye on it, but for now I think we're going to be ok.

Thanks you sheeri!
Flags: needinfo?(cdawson)
Still not doing great, however.  Checking New Relic, our normal throughput for objectstore updates are around 50 requests per minute.  At this point, we're doing more like 15.  We still have no idea why.

https://rpm.newrelic.com/accounts/677903/applications/4180461/databases#id=570791484
The process-objects transaction (which shows the DB components as well as the rest), over 24 hours:
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?show_browser=false&tw[end]=1422078098&tw[start]=1421991698&type=all#id=570791501

...it's quite clear how significant the DB slowdown is in comparison to the rest of the response time.

For the last 60 mins:
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?show_browser=false&tw[end]=1422078225&tw[start]=1422074625&type=all#id=570791501

Does look pretty similar to how it was before (seems quite spikey).
Adjusting summary to my understanding of the situation - Sheeri please correct if needed.

So my understanding from the Vidyo call, was that we still don't know why an OPTIMIZE TABLE was required?

There doesn't seem to have been a code change that caused a problem, and the size of the objectstores should be fairly constant (given we've reduced data lifecycle to 4 months) - so it's not like the tables have ballooned to the point where they are now hitting perf issues. 

Yes, we're not using the slave DB at all (!!) - but this has been the case since the beginning. Fixing that (bug 1125464) will reduce load/locks in general, but isn't the root cause here.

Sheeri did mention that a hardware problem with the disk on treeherder1 (master) was a possibility.

The only other thing I can think of is a combination of:

* Taskcluster started submitting more data yesterday (though apparently not a drastic increase).
* This puts extra load on both jobs ingestion and log parsing.
* Yesterday morning two of the three log parsers became disconnected (bug 1125124), causing a 40,000 log parsing job backlog.
* Each log parse requires a fair few inserts in the jobs artefacts tables, plus each bug suggestion search hits the bugscache (the latter had a massive spike from <200rpm to 1000rpm: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?show_browser=false&tw[end]=1422116852&tw[start]=1421857652&type=all#id=762419894&sort_by=throughput and the bugscache queries appear top in the throughput table for all DB queries for that timespan).
* Had we been using the slave for the bugscache queries (and the rest; bug 1125464), the master would have had been spared much of this load.
* This spike in log artefact inserts and bugscache selects caused extra load on the DB master, which caused issues with the objectstore inserts...?

But other than that, I'm at a loss :-(
Depends on: 1125464
Summary: Completed jobs missing from Try push → Extremely slow job ingestion due to process-objects inserts being blocked on MySQL system locks
Summary: Extremely slow job ingestion due to process-objects inserts being blocked on MySQL system locks → Extremely slow job ingestion due to process-objects tasks being blocked on MySQL system locks
Depends on: 1125569
Flags: needinfo?(mdoglio)
Correct, we don't know for sure why OPTIMIZE TABLE helped. The facts were that an InnoDB table would hang with a "System lock", but no other queries were using that table (as per SHOW ENGINE INNODB STATUS, which shows innodb internals). I tried using "flush tables" to see if that would help, and it released whatever lock was on the table.

Given that, it seemed like it had something to do with difficulty reading the table. My 2 ideas then were that there was something wrong with the table in InnoDB itself, or that there was something wrong with the disk or memory. We restarted MySQL, which would have fixed any memory issues, so then we rebuilt InnoDB to see if that worked. 

But this is guessing based on facts, not actually finding a definite root cause. Since it worked we might deduce that it was in fact something "bad" in the InnoDB table, but corruption usually ends up crashing the table/MySQL server, so that's not likely (though it is possible).

I would say that since rebuilding the InnoDB table using OPTIMIZE TABLE worked, it was probably something "bad" in InnoDB and not an underlying disk issue. Of course, if it's a bad memory sector or bad disk sector, it may crop up again in another form. So there's no definite answer, even though we're out of the woods on the actual issue.
Whiteboard: [data:consultative]
Depends on: 1126943
Thank you for the extra info - sounds like there's nothing more we should do here for now - if it reoccurs we can dig in more :-)
Assignee: nobody → scabral
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
This seems to be happening again.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Lots of jobs in the inbound objectstore waiting to be ingested:
http://pastebin.mozilla.org/8427532

Earliest loaded_timestamp is 1422564682, which corresponds to 01/29/2015 @ 8:51pm (UTC).

Looking at the process-objects transaction, there's a very clear drop-off in throughout and increase in call time starting just over 3 hours ago:
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?show_browser=false&tw[end]=1422573208&tw[start]=1422551608&type=Celery#id=570791501

Here's an example slow process-objects task, taking 27s:
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?show_browser=false&type=Celery#id=570791501&app_trace_id=5719247377&tab-detail_5719247377=trace_details
(it will take a while to fully load - the first view you see will change)

the db server cpu usage got a lot lower https://rpm.newrelic.com/accounts/263620/servers/5241973 and I don't see anything obvious on the mysql plugin page https://rpm.newrelic.com/accounts/263620/dashboard/3942659/page/2 other than a spike in slow queries. (But I see both of these as more of a symptom than the cause)
Another one, 62s:
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?show_browser=false&type=Celery#id=570791501&app_trace_id=5720065903

Unfortunately we can't see the request attributes (bug 1124278, sigh), so we can't easily tell which databases' tables are having the slow queries.
Depends on: 1124278
Depends on: 1127614
Things I tried:
* restart-jobs -p all to restart services on all nodes
* purging the 'default' rabbitmq queue, which is the one that has the process-objects tasks (amongst others), since we don't lose anything when it's purged, since it's just periodically scheduled tasks there. 
* purging the log_parser queue (note: not the log_parser_hp or log_parser_fail queues, since the jobs they handle are more critical) in case the load there was causing knock-on issues
Seems like we're now caught up? (A few jobs in ready state is expected, and when I checked a moment ago inbound was at zero jobs in ready state)

> SELECT id, loaded_timestamp FROM mozilla_inbound_objectstore_1.objectstore WHERE processed_state = 'ready'

+ ------- + --------------------- +
| id      | loaded_timestamp      |
+ ------- + --------------------- +
| 4665443 | 1422582496            |
| 4665444 | 1422582496            |
| 4665445 | 1422582496            |
| 4665446 | 1422582496            |
| 4665447 | 1422582496            |
| 4665448 | 1422582496            |
| 4665449 | 1422582496            |
| 4665450 | 1422582496            |
| 4665451 | 1422582496            |
| 4665452 | 1422582496            |
| 4665453 | 1422582496            |
| 4665454 | 1422582496            |
| 4665455 | 1422582496            |
| 4665456 | 1422582496            |
| 4665457 | 1422582496            |
| 4665458 | 1422582496            |
| 4665459 | 1422582496            |
| 4665460 | 1422582496            |
| 4665461 | 1422582496            |
| 4665462 | 1422582496            |
| 4665463 | 1422582496            |
| 4665464 | 1422582496            |
| 4665465 | 1422582496            |
| 4665466 | 1422582496            |
| 4665467 | 1422582496            |
| 4665468 | 1422582496            |
| 4665469 | 1422582496            |
| 4665470 | 1422582496            |
| 4665471 | 1422582496            |
| 4665472 | 1422582496            |
| 4665473 | 1422582496            |
| 4665474 | 1422582496            |
| 4665475 | 1422582496            |
| 4665476 | 1422582496            |
| 4665477 | 1422582496            |
| 4665478 | 1422582496            |
| 4665479 | 1422582496            |
| 4665480 | 1422582496            |
| 4665481 | 1422582496            |
| 4665482 | 1422582496            |
| 4665483 | 1422582496            |
| 4665484 | 1422582496            |
| 4665485 | 1422582496            |
| 4665486 | 1422582496            |
| 4665487 | 1422582496            |
| 4665488 | 1422582496            |
| 4665489 | 1422582496            |
| 4665490 | 1422582496            |
| 4665491 | 1422582496            |
| 4665492 | 1422582496            |
| 4665493 | 1422582496            |
| 4665494 | 1422582496            |
| 4665495 | 1422582496            |
| 4665496 | 1422582496            |
| 4665497 | 1422582496            |
| 4665498 | 1422582496            |
| 4665499 | 1422582496            |
| 4665500 | 1422582496            |
| 4665501 | 1422582496            |
| 4665502 | 1422582496            |
| 4665503 | 1422582496            |
| 4665504 | 1422582496            |
| 4665505 | 1422582496            |
| 4665506 | 1422582496            |
| 4665507 | 1422582496            |
| 4665508 | 1422582496            |
| 4665509 | 1422582496            |
| 4665510 | 1422582496            |
| 4665511 | 1422582496            |
| 4665512 | 1422582496            |
| 4665513 | 1422582496            |
| 4665514 | 1422582496            |
| 4665515 | 1422582496            |
| 4665516 | 1422582496            |
| 4665517 | 1422582496            |
| 4665518 | 1422582496            |
| 4665519 | 1422582496            |
| 4665520 | 1422582496            |
| 4665521 | 1422582496            |
| 4665522 | 1422582496            |
| 4665523 | 1422582496            |
| 4665524 | 1422582496            |
| 4665525 | 1422582496            |
| 4665526 | 1422582496            |
| 4665527 | 1422582496            |
| 4665528 | 1422582496            |
| 4665529 | 1422582496            |
| 4665530 | 1422582496            |
| 4665531 | 1422582496            |
| 4665532 | 1422582496            |
| 4665533 | 1422582496            |
| 4665534 | 1422582496            |
| 4665535 | 1422582496            |
| 4665536 | 1422582496            |
| 4665537 | 1422582496            |
| 4665538 | 1422582496            |
| 4665539 | 1422582496            |
| 4665540 | 1422582496            |
| 4665541 | 1422582496            |
| 4665542 | 1422582496            |
| NULL    | NULL                  |
+ ------- + --------------------- +
101 rows

> SELECT id, loaded_timestamp FROM try_objectstore_1.objectstore WHERE processed_state = 'ready'

+ ------- + --------------------- +
| id      | loaded_timestamp      |
+ ------- + --------------------- +
| 6886912 | 1422582465            |
| 6886913 | 1422582465            |
| 6886914 | 1422582465            |
| 6886915 | 1422582465            |
| 6886916 | 1422582465            |
| 6886917 | 1422582465            |
| 6886918 | 1422582465            |
| 6886919 | 1422582465            |
| 6886920 | 1422582465            |
| 6886921 | 1422582465            |
| 6886922 | 1422582465            |
| 6886923 | 1422582465            |
| 6886924 | 1422582465            |
| 6886925 | 1422582465            |
| 6886926 | 1422582465            |
| 6886927 | 1422582465            |
| 6886928 | 1422582465            |
| 6886929 | 1422582465            |
| 6886930 | 1422582465            |
| 6886931 | 1422582465            |
| 6886932 | 1422582465            |
| 6886933 | 1422582465            |
| 6886934 | 1422582465            |
| 6886935 | 1422582465            |
| 6886936 | 1422582465            |
| 6886937 | 1422582465            |
| 6886938 | 1422582465            |
| 6886939 | 1422582465            |
| 6886940 | 1422582465            |
| 6886941 | 1422582465            |
| 6886942 | 1422582465            |
| 6886943 | 1422582465            |
| 6886944 | 1422582465            |
| 6886945 | 1422582465            |
| 6886946 | 1422582465            |
| 6886947 | 1422582465            |
| 6886948 | 1422582465            |
| 6886949 | 1422582465            |
| 6886950 | 1422582465            |
| 6886951 | 1422582465            |
| 6886952 | 1422582465            |
| 6886953 | 1422582465            |
| 6886954 | 1422582465            |
| 6886955 | 1422582465            |
| 6886956 | 1422582465            |
| 6886957 | 1422582465            |
| NULL    | NULL                  |
+ ------- + --------------------- +
47 rows

> SELECT id, loaded_timestamp FROM fx_team_objectstore_1.objectstore WHERE processed_state = 'ready'

+ ------- + --------------------- +
| id      | loaded_timestamp      |
+ ------- + --------------------- +
| 1674055 | 1422582508            |
| 1674056 | 1422582508            |
| 1674057 | 1422582508            |
| 1674058 | 1422582508            |
| 1674059 | 1422582508            |
| 1674060 | 1422582508            |
| 1674061 | 1422582508            |
| 1674062 | 1422582508            |
| 1674063 | 1422582508            |
| 1674064 | 1422582508            |
| 1674065 | 1422582508            |
| 1674066 | 1422582508            |
| 1674067 | 1422582508            |
| 1674068 | 1422582508            |
| 1674069 | 1422582508            |
| 1674070 | 1422582508            |
| 1674071 | 1422582508            |
| 1674072 | 1422582508            |
| 1674073 | 1422582508            |
| 1674074 | 1422582508            |
| 1674075 | 1422582508            |
| 1674076 | 1422582508            |
| 1674077 | 1422582508            |
| 1674078 | 1422582508            |
| NULL    | NULL                  |
+ ------- + --------------------- +
25 rows
camd and I eyeballed Treeherder data against TBPL and the build API, and indeed m-i looks caught up now
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.