Closed Bug 900197 Opened 11 years ago Closed 11 years ago

Connection closed when writing to AMO db after moving to new infrastructure.

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rfradinho, Assigned: bjohnson)

Details

Attachments

(1 file)

update_amo.log.gz 11 years ago Ricardo Fradinho [:rfradinho] 6.30 KB, application/x-gzip		Details

Ricardo Fradinho [:rfradinho]

Reporter

Description

•

11 years ago

Attached file update_amo.log.gz — Details

Metrics is facing a problem when running a large update process on AMO DB.
After the migration to new infrastructure, the job doesn't finish anymore due to a "communications link failure".
I've already set the following variables, but it didn't help:
SET net_read_timeout = 10000;
SET net_write_timeout = 10000;
SET wait_timeout = 10000;
I've attached the job's log that may help to look at the times the job as running before the error.
The problem may also be on the network layer, ie, firewall/loadbalancer.
The server is etl2.metrics.scl3.mozilla.com and we connect to the "addons_mozilla_org" DB on 10.32.126.30:3306

Ricardo Fradinho [:rfradinho]

Reporter

Updated

•

11 years ago

Severity: normal → major

Brandon Johnson [:cyborgshadow]

Assignee

Comment 1

•

11 years ago

Try setting the interactive_timeout to be higher. That's one you're likely to encounter when running scripts.

Normally a "communcations link failure" is something you run into when trying to connect, not after connected.

The only 2 things I can think of off the top of my head that would  cause that here are a long running query exceeding wait_timeout or the amount of data being sent exceeds max_allowed_packet.

I do see that max_allowed_packet on these servers is 32 MB. If you don't succeed by raising the interactive_timeout, try raising max_allowed_packet to '1073741824' and trying again.

If neither of these work, please re-comment in this bug. 

Thanks!

Assignee: server-ops-database → bjohnson

Status: NEW → ASSIGNED

Ricardo Fradinho [:rfradinho]

Reporter

Comment 2

•

11 years ago

I've added SET interactive_timeout = 10000; but didn't help.
I cannot change max_allowed_packet at the session level.
The processes actually worked before 12:00 GMT but started to fail afterwards.

Ricardo Fradinho [:rfradinho]

Reporter

Comment 3

•

11 years ago

This afternoon, the driver reported problems with the connection at these times:
Thu Aug 1 07:56:35 PDT 2013
Thu Aug 1 09:58:52 PDT 2013
These where the times when the exceptions were thrown.
One thing common to all connection problems is that the last packet was sent about 28s before the error being reported:
"The last packet successfully received from the server was 28,349 milliseconds ago.  The last packet sent successfully to the server was 28,349 milliseconds ago"

Brandon Johnson [:cyborgshadow]

Assignee

Comment 4

•

11 years ago

We had a quick vidyo chat between :jason, :rfradinho, and I.

It looks so far like all the mysql config between the servers is identical (except where it shouldn't be (server-id et-al)).

MySQL isn't reporting anything of note...but the jdbc connection is always dying at the exact same "milliseconds since last packet".

It worked in the mornings (UTC) but doesn't work in UTC afternoon so far.

:jason's looking into load balancer timeouts and we'll double back once we have more info.

Jason Thomas [:jason]

Comment 5

•

11 years ago

This issue occurs when our netscaler nodes are failing over. I found these in the logs that correspond to about the same time the driver reported issues:

ns.log.3.gz:Aug  1 14:56:21 <local0.alert> 10.32.8.10 08/01/2013:14:56:21 GMT  0-PPE-0 : EVENT STATECHANGE 285 0 :  Device "remote node 10.32.8.11" - State DOWN
ns.log.2.gz:Aug  1 16:58:39 <local0.alert> 10.32.8.11 08/01/2013:16:58:39 GMT  0-PPE-0 : EVENT STATECHANGE 284 0 :  Device "remote node 10.32.8.10" - State DOWN

We have a active support ticket with Citrix and waiting on their support for a assistance/resolution with this issue. 

For the time being I think we should open up a network flow from etl2.metrics.scl3.mozilla.com to the current master.

Jason Thomas [:jason]

Comment 6

•

11 years ago

Network flows are open. @rfradinho please use 10.32.126.21:3306 instead.

Jeremy Orem [:oremj]

Comment 7

•

11 years ago

The netscaler should be fixed now. Let us know if you have any more issues.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Data & BI Services Team

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Connection closed when writing to AMO db after moving to new infrastructure.

Categories

(Data & BI Services Team :: DB: MySQL, task)

Tracking

(Not tracked)

People

(Reporter: rfradinho, Assigned: bjohnson)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Attachment

General

Description

File Name

Content Type