Created attachment 783985 [details] update_amo.log.gz Metrics is facing a problem when running a large update process on AMO DB. After the migration to new infrastructure, the job doesn't finish anymore due to a "communications link failure". I've already set the following variables, but it didn't help: SET net_read_timeout = 10000; SET net_write_timeout = 10000; SET wait_timeout = 10000; I've attached the job's log that may help to look at the times the job as running before the error. The problem may also be on the network layer, ie, firewall/loadbalancer. The server is etl2.metrics.scl3.mozilla.com and we connect to the "addons_mozilla_org" DB on 10.32.126.30:3306
Try setting the interactive_timeout to be higher. That's one you're likely to encounter when running scripts. Normally a "communcations link failure" is something you run into when trying to connect, not after connected. The only 2 things I can think of off the top of my head that would cause that here are a long running query exceeding wait_timeout or the amount of data being sent exceeds max_allowed_packet. I do see that max_allowed_packet on these servers is 32 MB. If you don't succeed by raising the interactive_timeout, try raising max_allowed_packet to '1073741824' and trying again. If neither of these work, please re-comment in this bug. Thanks!
Assignee: server-ops-database → bjohnson
Status: NEW → ASSIGNED
I've added SET interactive_timeout = 10000; but didn't help. I cannot change max_allowed_packet at the session level. The processes actually worked before 12:00 GMT but started to fail afterwards.
This afternoon, the driver reported problems with the connection at these times: Thu Aug 1 07:56:35 PDT 2013 Thu Aug 1 09:58:52 PDT 2013 These where the times when the exceptions were thrown. One thing common to all connection problems is that the last packet was sent about 28s before the error being reported: "The last packet successfully received from the server was 28,349 milliseconds ago. The last packet sent successfully to the server was 28,349 milliseconds ago"
We had a quick vidyo chat between :jason, :rfradinho, and I. It looks so far like all the mysql config between the servers is identical (except where it shouldn't be (server-id et-al)). MySQL isn't reporting anything of note...but the jdbc connection is always dying at the exact same "milliseconds since last packet". It worked in the mornings (UTC) but doesn't work in UTC afternoon so far. :jason's looking into load balancer timeouts and we'll double back once we have more info.
This issue occurs when our netscaler nodes are failing over. I found these in the logs that correspond to about the same time the driver reported issues: ns.log.3.gz:Aug 1 14:56:21 <local0.alert> 10.32.8.10 08/01/2013:14:56:21 GMT 0-PPE-0 : EVENT STATECHANGE 285 0 : Device "remote node 10.32.8.11" - State DOWN ns.log.2.gz:Aug 1 16:58:39 <local0.alert> 10.32.8.11 08/01/2013:16:58:39 GMT 0-PPE-0 : EVENT STATECHANGE 284 0 : Device "remote node 10.32.8.10" - State DOWN We have a active support ticket with Citrix and waiting on their support for a assistance/resolution with this issue. For the time being I think we should open up a network flow from etl2.metrics.scl3.mozilla.com to the current master.
Network flows are open. @rfradinho please use 10.32.126.21:3306 instead.
The netscaler should be fixed now. Let us know if you have any more issues.
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.