1490191 - (bmo-db-went-away) push daemon is not working properly and jobs are backing up

Reporter

Description

•

6 years ago

•

Seems the container can no longer reach the MySQL database.

[dkl@ip-172-31-27-221 raw]$ tail -f bugzilla.admin.docker.push.log
{"Fields":{"msg":"DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement \"SELECT 1 FROM push\"] at (eval 1274) line 17."},"Hostname":"ip-172-31-36-200.us-west-2.compute.internal","Logger":"STDERR","Pid":"8","Type":"Bugzilla.Extension.Push.Logger","Timestamp":1536634156000000000,"EnvVersion":2,"Severity":3}
{"Fields":{"msg":"DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement \"SELECT 1 FROM push\"] at (eval 1274) line 17."},"Hostname":"ip-172-31-36-200.us-west-2.compute.internal","Logger":"STDERR","Pid":"8","Type":"Bugzilla.Extension.Push.Logger","Timestamp":1536634186000000000,"EnvVersion":2,"Severity":3}
{"Fields":{"msg":"DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement \"SELECT 1 FROM push\"] at (eval 1274) line 17."},"Hostname":"ip-172-31-36-200.us-west-2.compute.internal","Logger":"STDERR","Pid":"8","Type":"Bugzilla.Extension.Push.Logger","Timestamp":1536634216000000000,"EnvVersion":2,"Severity":3}
{"Fields":{"msg":"DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement \"SELECT 1 FROM push\"] at (eval 1274) line 17."},"Hostname":"ip-172-31-36-200.us-west-2.compute.internal","Logger":"STDERR","Pid":"8","Type":"Bugzilla.Extension.Push.Logger","Timestamp":1536634246000000000,"EnvVersion":2,"Severity":3}
[...]

David Lawrence [:dkl]

Reporter

Comment 1

•

6 years ago

I was able to restart the container and all of the jobs are cleared out.

[root@ip-172-31-36-200 dkl]# systemctl restart docker-push.service

Leaving this bug open so we can track down how the issue happened in the first place.

Flags: needinfo?(bobm)

:glob ✱

Updated

•

6 years ago

Comment 2

•

6 years ago

The RO DB instance was restarted by AWS early on Sept 11. Below is the message logged in the RDS error log:

"
00:01:07 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
"

I have opened a support case at AWS to investigate this.

Bob Micheletto [:bobm]

Comment 3

•

6 years ago

(In reply to David Lawrence [:dkl] from comment #1)
Any chance one of the custom metrics added were for push daemon events?

Bob Micheletto [:bobm]

Updated

•

6 years ago

Flags: needinfo?(bobm)

David Lawrence [:dkl]

Reporter

Comment 4

•

6 years ago

(In reply to Bob Micheletto [:bobm] from comment #3)
> (In reply to David Lawrence [:dkl] from comment #1)
> Any chance one of the custom metrics added were for push daemon events?

That I would not know unfortunately. I can go through the logs I suppose up to the point where it occurred and maybe we can correlate it with some other event.

Chris Kolosiwsky [:ckolos] (ckolos has left the building)

Comment 5

•

6 years ago

This has been found to correspond with behavior in a known bug in a previous version of mysql, though why it's happening on our version (5.6.35) I can't say. MySQL bug details here: https://bugs.mysql.com/bug.php?id=64948 .

As shown in this graph, https://screenshots.firefox.com/no8EP17WHQqxI7ed/us-west-2.console.aws.amazon.com we see a spike in read (orange) traffic at the same time as a massive (~2.5 GB) freeing of memory (blue). At the same time, write traffic seems to drop as well (green).

This appears to be happening at/around 0000 GMT, however it doesn't seem to happen daily. Also unknown is why this has *been* happening but only now resulted in an actual outage/failure of the DB.

From the mysql bug, this is commonly induced "when running a stored routine containing CASE WHEN statements".

We will continue to investigate.

Dylan Hardison [:dylan] (he/him)

Comment 6

•

6 years ago

I'll try bumping the version of mysqlclient in the container.

But I think a good enough fix would be just taking advantage of the connection management DBIx::Connector offers.

Dylan Hardison [:dylan] (he/him)

Updated

•

6 years ago

Blocks: bmo-dbi-connector-refactor

Dylan Hardison [:dylan] (he/him)

Updated

•

6 years ago

Depends on: bmo-db-connector-fix

Dylan Hardison [:dylan] (he/him)

Updated

•

6 years ago

No longer depends on: bmo-db-connector-fix

Dylan Hardison [:dylan] (he/him)

Comment 7

•

6 years ago

This seems to be fixed

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Dylan Hardison [:dylan] (he/him)

Updated

•

6 years ago

Depends on: bmo-db-connector-fix

Dylan Hardison [:dylan] (he/him)

Updated

•

6 years ago

Blocks: phab-sync-stuck

Dylan Hardison [:dylan] (he/him)

Updated

•

6 years ago

Depends on: bmo-kills-mysql

Dylan Hardison [:dylan] (he/him)

Updated

•

6 years ago

Alias: bmo-db-went-awat

Dylan Hardison [:dylan] (he/him)

Updated

•

6 years ago

Alias: bmo-db-went-awat → bmo-db-went-away

Dylan Hardison [:dylan] (he/him)

Updated

•

6 years ago

No longer blocks: phab-sync-stuck, bmo-dbi-connector-refactor

No longer depends on: bmo-kills-mysql, bmo-db-connector-fix

Resolution: FIXED → DUPLICATE

Bugzilla

Quick Search

push daemon is not working properly and jobs are backing up

Categories

(bugzilla.mozilla.org :: Infrastructure, defect)

Tracking

()

People

(Reporter: dkl, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Updated

Updated

Updated

Comment 7

Updated

Updated

Updated

Updated

Updated

Updated