Closed Bug 1490191 (bmo-db-went-away) Opened 6 years ago Closed 6 years ago

push daemon is not working properly and jobs are backing up

Categories

(bugzilla.mozilla.org :: Infrastructure, defect)

Production
defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 1496697

People

(Reporter: dkl, Unassigned)

References

Details

Seems the container can no longer reach the MySQL database.

[dkl@ip-172-31-27-221 raw]$ tail -f bugzilla.admin.docker.push.log
{"Fields":{"msg":"DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement \"SELECT 1 FROM push\"] at (eval 1274) line 17."},"Hostname":"ip-172-31-36-200.us-west-2.compute.internal","Logger":"STDERR","Pid":"8","Type":"Bugzilla.Extension.Push.Logger","Timestamp":1536634156000000000,"EnvVersion":2,"Severity":3}
{"Fields":{"msg":"DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement \"SELECT 1 FROM push\"] at (eval 1274) line 17."},"Hostname":"ip-172-31-36-200.us-west-2.compute.internal","Logger":"STDERR","Pid":"8","Type":"Bugzilla.Extension.Push.Logger","Timestamp":1536634186000000000,"EnvVersion":2,"Severity":3}
{"Fields":{"msg":"DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement \"SELECT 1 FROM push\"] at (eval 1274) line 17."},"Hostname":"ip-172-31-36-200.us-west-2.compute.internal","Logger":"STDERR","Pid":"8","Type":"Bugzilla.Extension.Push.Logger","Timestamp":1536634216000000000,"EnvVersion":2,"Severity":3}
{"Fields":{"msg":"DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement \"SELECT 1 FROM push\"] at (eval 1274) line 17."},"Hostname":"ip-172-31-36-200.us-west-2.compute.internal","Logger":"STDERR","Pid":"8","Type":"Bugzilla.Extension.Push.Logger","Timestamp":1536634246000000000,"EnvVersion":2,"Severity":3}
[...]
I was able to restart the container and all of the jobs are cleared out.

[root@ip-172-31-36-200 dkl]# systemctl restart docker-push.service

Leaving this bug open so we can track down how the issue happened in the first place.
Flags: needinfo?(bobm)
See Also: → 1480128
The RO DB instance was restarted by AWS early on Sept 11. Below is the message logged in the RDS error log:

"
00:01:07 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
"

I have opened a support case at AWS to investigate this.
(In reply to David Lawrence [:dkl] from comment #1)
Any chance one of the custom metrics added were for push daemon events?
Flags: needinfo?(bobm)
(In reply to Bob Micheletto [:bobm] from comment #3)
> (In reply to David Lawrence [:dkl] from comment #1)
> Any chance one of the custom metrics added were for push daemon events?

That I would not know unfortunately. I can go through the logs I suppose up to the point where it occurred and maybe we can correlate it with some other event.
This has been found to correspond with behavior in a known bug in a previous version of mysql, though why it's happening on our version (5.6.35) I can't say. MySQL bug details here: https://bugs.mysql.com/bug.php?id=64948 .

As shown in this graph, https://screenshots.firefox.com/no8EP17WHQqxI7ed/us-west-2.console.aws.amazon.com we see a spike in read (orange) traffic at the same time as a massive (~2.5 GB) freeing of memory (blue). At the same time, write traffic seems to drop as well (green).

This appears to be happening at/around 0000 GMT, however it doesn't seem to happen daily. Also unknown is why this has *been* happening but only now resulted in an actual outage/failure of the DB.

From the mysql bug, this is commonly induced "when running a stored routine containing CASE WHEN statements".

We will continue to investigate.
I'll try bumping the version of mysqlclient in the container.

But I think a good enough fix would be just taking advantage of the connection management DBIx::Connector offers.
No longer depends on: bmo-db-connector-fix
This seems to be fixed
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Alias: bmo-db-went-awat
Alias: bmo-db-went-awat → bmo-db-went-away
No longer depends on: bmo-kills-mysql, bmo-db-connector-fix
Resolution: FIXED → DUPLICATE
You need to log in before you can comment on or make changes to this bug.