771218 - Ongoing problems with Socorro staging

Reporter

Description

•

12 years ago

Yesterday, the two staging processors each died, unable to get connections to PostgreSQL or HBase. This has happened several times this week. When we have noticed and restarted, things come back to life. I'm assuming network or Zeus trouble - can please someone take a look? From a timing perspective: socorro-processor2.stage.metrics.phx1 stopped at 2012-07-04 05:46:36,199 socorro-processor1.stage.metrics.phx1 stopped at 2012-07-04 18:24:04,516

Ravi Pina [:ravi]

Comment 1

•

12 years ago

Are these times UTC?

Laura Thomson :laura

Reporter

Comment 2

•

12 years ago

Those times are from the logs on the machines, so I guess it's whatever the system time is.

Ravi Pina [:ravi]

Comment 3

•

12 years ago

Who from server ops is going to examine Zeus?

K Lars Lohn [:lars] [:klohn]

Comment 4

•

12 years ago

this problem continues. On restarting the processors on both socorro-processor2.stage.metrics.phx1 and socorro-processor1.stage.metrics.phx1 at 2012-07-05 10:30:31 they ran for eleven hours and then suddenly lost their connections to PostgreSQL at 2012-07-05 21:10:17 This problem is holding us back from getting our next release ready.

Brandon Burton [:solarce]

Assignee

Comment 5

•

12 years ago

Per IRC, I'll take a look at Zeus

Assignee: server-ops → bburton

Status: NEW → ASSIGNED

Brandon Burton [:solarce]

Assignee

Comment 6

•

12 years ago

After digging through the event log on pp-zlb09 I found the following entries, which show a blip between 9:44PM and 9:57PM PST yesterday, which is after the timestamp :lars noted but in the same hour I'm adding :jakem and :cshields for some input :ravi do you see anything on the network side for the below time stamps? [05/Jul/2012:21:44:43 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.65:9090': Timeout [05/Jul/2012:21:44:44 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.65:9090 nodefail Node 10.8.100.65 has failed - A monitor has detected a failure [05/Jul/2012:21:44:48 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.62:9090': Timeout [05/Jul/2012:21:44:49 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.62:9090 nodefail Node 10.8.100.62 has failed - A monitor has detected a failure [05/Jul/2012:21:44:50 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.66:9090': Timeout [05/Jul/2012:21:44:51 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.66:9090 nodefail Node 10.8.100.66 has failed - A monitor has detected a failure [05/Jul/2012:21:44:55 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.66:9090'. [05/Jul/2012:21:44:56 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.66:9090 nodeworking Node 10.8.100.66 is working again [05/Jul/2012:21:44:56 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.63:9090': Timeout [05/Jul/2012:21:44:56 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.63:9090 nodefail Node 10.8.100.63 has failed - A monitor has detected a failure [05/Jul/2012:21:44:58 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.67:9090': Timeout [05/Jul/2012:21:44:59 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.67:9090 nodefail Node 10.8.100.67 has failed - A monitor has detected a failure [05/Jul/2012:21:45:01 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.63:9090'. [05/Jul/2012:21:45:02 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.63:9090 nodeworking Node 10.8.100.63 is working again [05/Jul/2012:21:45:02 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.69:9090': Timeout [05/Jul/2012:21:45:03 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.69:9090 nodefail Node 10.8.100.69 has failed - A monitor has detected a failure [05/Jul/2012:21:45:24 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.67:9090'. [05/Jul/2012:21:45:25 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.67:9090 nodeworking Node 10.8.100.67 is working again [05/Jul/2012:21:45:30 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.62:9090'. [05/Jul/2012:21:45:31 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.62:9090 nodeworking Node 10.8.100.62 is working again [05/Jul/2012:21:45:37 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.63:9090': Monitor exited, exit code 1, no output generated [05/Jul/2012:21:45:38 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.63:9090 nodefail Node 10.8.100.63 has failed - A monitor has detected a failure [05/Jul/2012:21:45:44 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.66:9090': Monitor exited, exit code 1, no output generated [05/Jul/2012:21:45:45 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.66:9090 nodefail Node 10.8.100.66 has failed - A monitor has detected a failure [05/Jul/2012:21:45:47 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.67:9090': Monitor exited, exit code 1, no output generated [05/Jul/2012:21:45:48 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.67:9090 nodefail Node 10.8.100.67 has failed - A monitor has detected a failure [05/Jul/2012:21:45:50 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.62:9090': Monitor exited, exit code 1, no output generated [05/Jul/2012:21:45:51 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.62:9090 nodefail Node 10.8.100.62 has failed - A monitor has detected a failure [05/Jul/2012:21:45:51 -0700] SERIOUS pools/socorro-thrift-stage:9090 pooldied Pool has no back-end nodes responding [05/Jul/2012:21:56:51 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.65:9090'. [05/Jul/2012:21:56:52 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.65:9090 nodeworking Node 10.8.100.65 is working again [05/Jul/2012:21:56:52 -0700] INFO pools/socorro-thrift-stage:9090 poolok Pool now has working nodes [05/Jul/2012:21:56:53 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.69:9090'. [05/Jul/2012:21:56:54 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.69:9090 nodeworking Node 10.8.100.69 is working again [05/Jul/2012:21:57:08 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.63:9090'. [05/Jul/2012:21:57:09 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.63:9090 nodeworking Node 10.8.100.63 is working again [05/Jul/2012:21:57:15 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.66:9090'. [05/Jul/2012:21:57:16 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.66:9090 nodeworking Node 10.8.100.66 is working again [05/Jul/2012:21:57:19 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.67:9090'. [05/Jul/2012:21:57:20 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.67:9090 nodeworking Node 10.8.100.67 is working again [05/Jul/2012:21:57:21 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.62:9090'. [05/Jul/2012:21:57:22 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.62:9090 nodeworking Node 10.8.100.62 is working again

K Lars Lohn [:lars] [:klohn]

Comment 7

•

12 years ago

please note: the problems in the logs in Comment #6 involve HBase. The problem that we're seeing is with Postgres. the postgres problem happened again at today at 2012-07-06 10:58:44. this time it affected socorro-processor2.stage.metrics.phx1, but not socorro-processor1.stage.metrics.phx1.

Ravi Pina [:ravi]

Comment 8

•

12 years ago

What is the actual check testing?

Aaron

Comment 9

•

12 years ago

Logs for the switches these hosts are connected to are clean for the times mentioned. I bumped logging level to "info" from "notice" to give us a better chance to catch something, but I don't see any indication of network issues. I also checked the core. It is clean. STP has been quiet for a number of days now. Still looking, but I'm not seeing anything on the network side.

Laura Thomson :laura

Reporter

Comment 10

•

12 years ago

Busted again right now, FWIW

Brandon Burton [:solarce]

Assignee

Comment 11

•

12 years ago

(In reply to Ravi Pina [:ravi] from comment #8) > What is the actual check testing? The check is a ping check

Brandon Burton [:solarce]

Assignee

Comment 12

•

12 years ago

Looking at socorro-processor2.stage.metrics.phx1 /var/log/messages only has puppetd run info /var/log/dmesg is auditd messages I don't see anything with yum since 6/27 and those were just audited changes I'm going to hand this to the SRE team to look at and see if there is anything in the hardware logs or maybe a network driver issue?

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Assignee: bburton → server-ops

[:jberkus] Josh Berkus

Comment 13

•

12 years ago

We got processors+stageDB working now. However, we don't know what caused the original problem; by the time we cleaned it up, the original issue had ceased.

Michael Burns [:mburns]

Updated

•

12 years ago

Priority: -- → P1

Summary: Ongoing problems with Socorro staging - network related? → Ongoing problems with Socorro staging

[:jberkus] Josh Berkus

Comment 14

•

12 years ago

Monitored all night last night. Processors reported connection failure at 4:10AM Pacific. In the pgbouncer logs, we have: 2012-07-11 04:08:31.272 2789 LOG Stats: 0 req/s, in 108 b/s, out 154 b/s,query 4383 us 2012-07-11 04:08:33.711 2789 LOG C-0x1bffa40: breakpad/processor@10.8.70.208:48083 closing because: client unexpected eof (age=57431) 2012-07-11 04:08:36.516 2789 LOG C-0x1bffe78: breakpad/processor@10.8.70.208:48107 closing because: client unexpected eof (age=57433) 2012-07-11 04:08:41.510 2789 LOG C-0x1bff770: breakpad/processor@10.8.70.208:48063 closing because: client unexpected eof (age=57438) 2012-07-11 04:08:42.519 2789 LOG C-0x1bff8d8: breakpad/processor@10.8.70.208:48075 closing because: client unexpected eof (age=57439) 2012-07-11 04:08:42.519 2789 LOG C-0x1bff338: breakpad/processor@10.8.70.208:48028 closing because: client unexpected eof (age=57439) 2012-07-11 04:08:48.380 2789 LOG C-0x1bff608: breakpad/processor@10.8.70.208:48043 closing because: client unexpected eof (age=57445) 2012-07-11 04:08:54.385 2789 LOG C-0x1bffd10: breakpad/processor@10.8.70.208:48102 closing because: client unexpected eof (age=57451) 2012-07-11 04:09:31.273 2789 LOG Stats: 0 req/s, in 111 b/s, out 156 b/s,query 4147 us 2012-07-11 04:10:31.274 2789 LOG Stats: 0 req/s, in 116 b/s, out 165 b/s,query 4073 us That seems to confirm that the problem is the loss of the network connection between the processor and the database server. Interestingly, though, the monitor doesn't lose its connection until 7AM: 2012-07-11 06:58:42.363 PDT,"monitor","breakpad",18123,"10.8.70.208:15950",4ffd7d30.46cb,1,"idle",2012-07-11 06:18:40 PDT,6/0,0,LOG,08P01,"unexpected EOF on client connection",,,,,,,,,"" So, I think we need a Zeus expert to start looking at Zeus for periodically recurring issues. Thanks!

Brandon Burton [:solarce]

Assignee

Comment 15

•

12 years ago

Current assessment is that an issue with our leap second "fix" we pushed with puppet was the source of the issue, as it was causing the second time of the clock to be reset to :20 seconds somewhat randomly, this has been fixed and staging restarted We'll monitor for 24 hours and RF or investigate further, as needed.

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Assignee: server-ops → bburton

Laura Thomson :laura

Reporter

Comment 16

•

12 years ago

processor2 is having PG connection difficulties again, via mpressman: 2012-07-11 12:17:18,488 ERROR - Thread-9 - DatabaseError: error with no message from the libpq

Matt Pressman [:mpressman]

Comment 17

•

12 years ago

2012-07-11 12:17:18,486 ERROR - Thread-9 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-11 12:17:18,486 ERROR - Thread-9 - error with no message from the libpq 2012-07-11 12:17:18,487 ERROR - Thread-9 - trace back follows: 2012-07-11 12:17:18,487 ERROR - Thread-9 - Traceback (most recent call last): 2012-07-11 12:17:18,488 ERROR - Thread-9 - File "/data/socorro/application/socorro/processor/processor.py", line 526, in processJob threadLocalCursor.execute("update jobs set starteddatetime = %s where id = %s", (startedDateTime, jobId)) 2012-07-11 12:17:18,488 ERROR - Thread-9 - DatabaseError: error with no message from the libpq Something BAD happened in Thread-9: Traceback (most recent call last): File "/data/socorro/application/socorro/lib/threadlib.py", line 128, in run aFunction(arguments) File "/data/socorro/application/socorro/processor/processor.py", line 478, in processJobWithRetry result = self.processJob(jobTuple) File "/data/socorro/application/socorro/processor/processor.py", line 651, in processJob threadLocalDatabaseConnection.rollback() InterfaceError: connection already closed connection already closed

Matt Pressman [:mpressman]

Comment 18

•

12 years ago

There was a mention during the meeting about the file leapsecond_2012_06_30 in /tmp Is this supposed to be removed? It still exists on processor2

Matt Pressman [:mpressman]

Comment 19

•

12 years ago

so just to clear this up for the bug, chields mentioned that the existence of the file is of no concern.

Brandon Burton [:solarce]

Assignee

Comment 20

•

12 years ago

There was nothing logged in Zeus during the noon hour for socorro, that I can see I'm going to ask an SRE to check the server

Laura Thomson :laura

Reporter

Comment 21

•

12 years ago

processor2 is down again, started paging at Date/Time: 07-12-2012 04:58:36

Ashish Vijayaram [:ashish]

Comment 22

•

12 years ago

These might be a red-herring but I found these messages in socorro1.stage.db.phx1:/var/log/pgbouncer/pgbouncer-processor.log: 2012-07-12 08:23:32.150 2789 LOG S-0x1c1d558: breakpad/processor@unix:5432 closing because: server lifetime over (age=3600) 2012-07-12 08:23:58.750 2789 LOG S-0x1c1d558: breakpad/processor@unix:5432 new connection to server 2012-07-12 08:24:04.843 2789 LOG S-0x1c1ca18: breakpad/processor@unix:5432 new connection to server 2012-07-12 08:30:53.063 2789 LOG S-0x1c1bc08: breakpad/processor@unix:5432 closing because: server lifetime over (age=3600) they might just be suggestive of normal activity, I'm unsure.

[:jberkus] Josh Berkus

Comment 23

•

12 years ago

Ashish, That's normal activity.

[:jberkus] Josh Berkus

Comment 24

•

12 years ago

Interesting. I only see one error in the pgBouncer log for the period 04:00:00 to 06:00:00. One client unexpectedly disconnected. Otherwise, all activity durning that period looks normal, and in fact there's evidence of continuing processing during that period. Are you sure *all* processors disconnected, rather than just one of them?

K Lars Lohn [:lars] [:klohn]

Comment 25

•

12 years ago

only processor2 had trouble with connections on all threads. processor1 continued working without trouble.

[:jberkus] Josh Berkus

Comment 26

•

12 years ago

ah, that would be this then: 2012-07-12 04:53:00.124 2789 LOG C-0x1bfec30: breakpad/processor@10.8.70.208:1746 closing because: client unexpected eof (age=41861) as per above "client unexpected eof" means that the client vanished off the network without disconnecting. it means either network breakage or that the client crashed.

K Lars Lohn [:lars] [:klohn]

Comment 27

•

12 years ago

at the same time, the client reported: "DatabaseError: error with no message from the libpq" I believe that means, from the client perspective, the server vanished off the network." In the past in similar situations, we've gotten error messages that say, "server closed the connection unexpectedly"

[:jberkus] Josh Berkus

Comment 28

•

12 years ago

More evidence that this is a network issue. The monitor -- a separate process, one which connects to PostgreSQL directly rather than using pgbouncer -- is also randomly losing connections: Monitor Log: 2012-07-12 03:43:06,138 ERROR - jobCleanupThread - Caught Error: <class 'psycopg2.InterfaceError'> 2012-07-12 03:43:06,138 ERROR - jobCleanupThread - connection already closed PostgreSQL Log: 2012-07-12 03:43:06.048 PDT,"monitor","breakpad",29015,"10.8.70.208:8589",4ffea7df.7157,1,"idle",2012-07-12 03:33:03 PDT,10/0,0,LOG,08P01,"unexpected EOF on client connection",,,,,,,,,""

Brandon Burton [:solarce]

Assignee

Comment 29

•

12 years ago

I've been double checking things, as per the above comments, connects to Postgres are dying from both socorroadm.stage and socorro-processor[1-2].stage socorroadm.stage is an ESX VM, socorro-processor[1-2].stage are HP blades Both the monitor and processor apps are using 10.8.70.126 as their databaseHost, which is a VIP on pp-zlb09 A review of the Zeus logs shows db errors only being recorded on 07/06 for the month of July [root@pp-zlb09.phx log]# grep socorro /usr/local/zeus/zxtm/log/errors | grep Jul | grep db [06/Jul/2012:15:25:02 -0700] SERIOUS pools/db-socorro01-rw-processor-pool nodes/10.8.70.100:6433 nodefail Node 10.8.70.100 has failed - A monitor has detected a failure [06/Jul/2012:15:25:02 -0700] SERIOUS pools/db-socorro01-rw-processor-pool pooldied Pool has no back-end nodes responding [06/Jul/2012:15:25:02 -0700] SERIOUS pools/db-socorro01-rw-web-pool pooldied Pool has no back-end nodes responding [06/Jul/2012:15:25:02 -0700] SERIOUS pools/db-socorro01-rw-web-pool nodes/10.8.70.100:6432 nodefail Node 10.8.70.100 has failed - A monitor has detected a failure [06/Jul/2012:15:25:02 -0700] SERIOUS pools/db-socorro01-rw-web-pool pooldied Pool has no back-end nodes responding [06/Jul/2012:15:33:23 -0700] INFO pools/db-socorro01-rw-processor-pool nodes/10.8.70.100:6433 nodeworking Node 10.8.70.100 is working again [06/Jul/2012:15:33:23 -0700] INFO pools/db-socorro01-rw-processor-pool poolok Pool now has working nodes [06/Jul/2012:15:33:23 -0700] INFO pools/db-socorro01-rw-web-pool poolok Pool now has working nodes [06/Jul/2012:15:33:23 -0700] INFO pools/db-socorro01-rw-web-pool nodes/10.8.70.100:6432 nodeworking Node 10.8.70.100 is working again [06/Jul/2012:15:33:23 -0700] INFO pools/db-socorro01-rw-web-pool poolok Pool now has working nodes All the Postgres traffic to 10.8.70.126 gets sent to 10.8.70.125 I'd been focusing on trying to find a fault on the processor blades, but given c28 and IRC discussion, I am now reviewing the Postgres host

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Depends on: 773390

Brandon Burton [:solarce]

Assignee

Comment 30

•

12 years ago

Found possible storage / driver issue in dmesg log, opened bug 773390 to have Ops look into

Brandon Burton [:solarce]

Assignee

Comment 31

•

12 years ago

Per bug 773390 and IRC, updates have been applied to the db server, the processors and monitor have been restarted and :lars confirmed stage is running. We'll wait 24 hours to see if the issue persists.

K Lars Lohn [:lars] [:klohn]

Comment 32

•

12 years ago

no joy. staging processor #2 has again lost its connection to postgres. processor log: 2012-07-12 14:54:05,356 ERROR - Thread-10 - DatabaseError: error with no message from the libpq server log: 2012-07-12 14:53:40.052 6137 LOG C-0x15fc4a0: breakpad/processor@10.8.70.208:28101 closing because: client unexpected eof (age=4500)

K Lars Lohn [:lars] [:klohn]

Comment 33

•

12 years ago

just a notification that this problem is not confined to connecting to Postgres. This morning, both staging processors are having a bad time trying to talk to Hbase at 10.8.100.209 It started at 6:05 2012-07-13 06:05:58,810 CRITICAL - Thread-5 - something's gone horribly wrong with the HBase connection 2012-07-13 06:05:58,811 CRITICAL - Thread-5 - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'> 2012-07-13 06:05:58,811 CRITICAL - Thread-5 - the connection is not viable. retries fail:

K Lars Lohn [:lars] [:klohn]

Comment 34

•

12 years ago

interestingly, the monitor appears to have no trouble at all talking to HBase.

Brandon Burton [:solarce]

Assignee

Comment 35

•

12 years ago

:mpressman made a comment on irc over the weekend that things had been stable? Is that correct?

K Lars Lohn [:lars] [:klohn]

Comment 36

•

12 years ago

Both staging processors spent all day Sunday trying to connect to HBase without success. Today, the processors are wiling away their time trying to talk to a silent Postgres.

Brandon Burton [:solarce]

Assignee

Comment 37

•

12 years ago

I'm not sure what to look at next, I'm not seeing any consistent errors in the Zeus logs and this seems not to be isolated to specific hosts, but previously netops confirmed seeing no network issues from the processors hosts. I'll try to talk with Corey, Jason, and some other webops folks tomorrow morning to discuss what's the next best step

Robert Helmer [:rhelmer]

Comment 38

•

12 years ago

I think it'd help to come up with a simpler, reproducible test case. It's hard to tell if anything is fixed since we can't make the problem happen on demand, and it feels like we're still not 100% where the problem is (please correct me if this is wrong). Would it make sense to try to have some simple persistent connection across the hosts in question (e.g. netcat?) If *that* can't stay connected then we know it's not e.g. the software involved.

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Whiteboard: [pending webops discussion]

Ravi Pina [:ravi]

Comment 39

•

12 years ago

This bug is long and confusing. Can someone provide the data flow(s) with IP and port pairs here?

Laura Thomson :laura

Reporter

Comment 40

•

12 years ago

OK. Dropped connections and timeouts have been seen with the following flows: * socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro1.stage.db.phx1.mozilla.com:5432 (or is this 6432? I don't recall whether it's direct (5432) or through pgbouncer(6432). solarce/mpressman/jberkus/sheeri can confirm) (PostgreSQL) Ravi, is that what you need?

Laura Thomson :laura

Reporter

Comment 41

•

12 years ago

IP details are here https://mana.mozilla.org/wiki/display/websites/crash-stats.mozilla.com+(Socorro)#crash-statsmozillacomSoc orro-MonitorAdminhost

[:jberkus] Josh Berkus

Comment 42

•

12 years ago

Laura, > * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> > socorro1.stage.db.phx1.mozilla.com:5432 (or is this 6432? I don't recall > whether it's direct (5432) or through pgbouncer(6432). It's port 6433 actually. We've also seen dropped connections between the monitor and dev.db (port 5432). Not sure which server the monitor is.

[:jberkus] Josh Berkus

Comment 43

•

12 years ago

BTW, since we've never seen issues like this one in production, I'm wondering if this might be related to the virtualization system staging monitor & processors run on?

Brandon Burton [:solarce]

Assignee

Comment 44

•

12 years ago

(In reply to [:jberkus] Josh Berkus from comment #43) > BTW, since we've never seen issues like this one in production, I'm > wondering if this might be related to the virtualization system staging > monitor & processors run on? socorroadm.stage is a VM socorrop-processor[1-2].stage are HP blades.

Brandon Burton [:solarce]

Assignee

Comment 45

•

12 years ago

(In reply to Laura Thomson :laura from comment #40) > OK. Dropped connections and timeouts have been seen with the following > flows: > * socorroadm.stage.private.phx1.mozilla.com (monitor) -> > socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) > * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> > socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) > * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> > socorro1.stage.db.phx1.mozilla.com:5432 (or is this 6432? I don't recall > whether it's direct (5432) or through pgbouncer(6432). > solarce/mpressman/jberkus/sheeri can confirm) (PostgreSQL) > > Ravi, is that what you need? Monitor connections on VM: * socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb10.phx.mozilla.net) * socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb09.phx.mozilla.net) Processor connections on HP blades: * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb10.phx.mozilla.net) * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb09.phx.mozilla.net)

Mark Reid [:mreid]

Comment 46

•

12 years ago

The metrics ADU export has been failing on Socorro Stage every night, and seems related to the ongoing problems. For example: ERROR 25-07 20:49:36,124 An I/O error occured while sending to the backend. Not a very informative error message, but at least it gives the time. I'm connecting to socorro-stage-zlb.db.phx1.mozilla.com, and running the job manually worked when I tried it earlier this week (24-07 11:38:57,118 to be specific). Please let me know if I can provide any other information.

K Lars Lohn [:lars] [:klohn]

Updated

•

12 years ago

Blocks: 752956

Brandon Burton [:solarce]

Assignee

Comment 47

•

12 years ago

(In reply to Ravi Pina [:ravi] from comment #39) > This bug is long and confusing. Can someone provide the data flow(s) with > IP and port pairs here? Ravi, the short is that we're seeing connection failures to Postgres and Hbase VIPs on Zeus. The VIPs are on separate Zeus servers (pp-zlb09 and pp-zlb10). When we see connection failures in the logs of the applications connecting to the VIPs, I am not seeing entries in the Zeus error logs at or around the time of the failures. The data flows are as follows Monitor connections on an ESX VM: * socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb10.phx.mozilla.net) * socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb09.phx.mozilla.net) Processor connections on two HP blades: * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb10.phx.mozilla.net) * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb09.phx.mozilla.net) I've worked with SREs to review the hardware of the Postgres master and the processor blades at points throughout diagnosing this bug and that has checked out so far. Given that and the failures from multiple servers to multiple VIPs on different Zeus nodes, it's hard not to look at the network as the commonality, but the Zeus cluster is also a common point. Is there additional info you'd need to help assess if there is a possible networking related issue? Would you recommend we try moving these VIPs to the internal (db) Zeus cluster as the next step?

Corey Shields [:cshields]

Comment 48

•

12 years ago

(In reply to Brandon Burton [:solarce] from comment #47) > Would you recommend we try moving these VIPs to the internal (db) Zeus > cluster as the next step? This needs to be done anyway, and would help in other situations as well. Are they using the same VIPs as anything prod? If not can we disrupt staging at any time to make this change?

[:jberkus] Josh Berkus

Comment 49

•

12 years ago

The staging VIPs are different from the prod ones.

[:jberkus] Josh Berkus

Comment 50

•

12 years ago

Oh, and stage is already pretty disrupted, so just let us know when you're taking it down. No confirmation needed.

Brandon Burton [:solarce]

Assignee

Comment 51

•

12 years ago

(In reply to [:jberkus] Josh Berkus from comment #50) > Oh, and stage is already pretty disrupted, so just let us know when you're > taking it down. No confirmation needed. I'll be working on this over the next 1-2 hours and you may see disruption as various points, I'll confirm in #breakpad when I'm ready to test out things

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Whiteboard: [pending webops discussion] → [moving Zeus VIPs to internal cluster]

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Whiteboard: [moving Zeus VIPs to internal cluster] → [testing VIPs on internal cluster]

Brandon Burton [:solarce]

Assignee

Comment 52

•

12 years ago

Hbase VIP moved at Noon, confirmed working by :lars PG VIPs (5432, 6432, 6433) moved at 4:34, awaiting confirmation from socorro team, :solarce confirmed telnet works.

[:jberkus] Josh Berkus

Comment 53

•

12 years ago

We need to watch the PG connections for a 24-hour period to determine if they are better. By Monday, we should know.

K Lars Lohn [:lars] [:klohn]

Comment 54

•

12 years ago

I regret to report that our problems have not been resolved. both staging processors and the monitor lost their ability to talk to both HBase and Postgres at various times over the weekend. socorro-processor1.stage.metrics.phx1: 2012-07-29 06:08:33,142 CRITICAL - Thread-2 - something's gone horribly wrong with the HBase connection 2012-07-29 06:08:33,143 CRITICAL - Thread-2 - the connection is not viable. retries fail: socorro-processor2.stage.metrics.phx1: 2012-07-30 06:04:57,996 CRITICAL - Thread-9 - something's gone horribly wrong with the HBase connection 2012-07-30 06:04:57,997 CRITICAL - Thread-9 - the connection is not viable. retries fail: on the 29th, while processor1 was struggling with no HBase connection, processor2 was unable to talk to Postgres on some (but not all) threads. Because of being unable to talk to Postgres, we don't know what what was happening with HBase connectivity, as the processor won't try to talk to HBase if it cannot talk to Postgres. the monitor, on the 29th, suffered the same HBase connectivity problem as processor1 for about 40 minutes. Then it recovered. Six hours later, the monitor lost the Postgres connection. Recovery happened about 9 hours later. The monitor lost the HBase connection again on the 30th at the same time that processor2 did. Processor1 lost the HBase connection about 5 minutes later.

Laura Thomson :laura

Reporter

Comment 55

•

12 years ago

Escalating this to blocker. If we can't get staging sorted out, we will be unable to ship support for rapid betas. The trouble we have had with crontabber supports the fact that we can't ship high-risk changes without a stable staging environment.

Blocks: daily_beta_tracking

Severity: normal → blocker

Brandon Burton [:solarce]

Assignee

Comment 56

•

12 years ago

Attached file socorro stage issues — Details

Zeus saw failures for the Postgres VIP, but not the HBase VIP It was a couple failures on 7/27 and many on 7/29. None on 7/30 so far.

Brandon Burton [:solarce]

Assignee

Comment 57

•

12 years ago

Brandon Burton [:solarce]

Assignee

Comment 58

•

12 years ago

The data flows are now Monitor connections on an ESX VM: * socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb01.phx.mozilla.net) * socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb01.phx.mozilla.net) Processor connections on two HP blades: * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb01.phx.mozilla.net) * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb01.phx.mozilla.net)

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Depends on: 778797

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Depends on: 778802

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Depends on: 778805

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Depends on: 778831

Brandon Burton [:solarce]

Assignee

Comment 59

•

12 years ago

Tracking Changes made so far 1. jberkus realized the socorro-processor virtual server was using the 5432 backend and not the 6433 backend. I fixed this [30/Jul/2012:15:02:04 -0700] INFO vservers/db-socorro-stage-processor confmod Configuration file modified [30/Jul/2012:15:02:04 -0700] INFO vservers/db-socorro-stage-processor vsreconf Virtual server reconfigured on port 6433 2. cshields found that the virtual server connection timeout for postgres is set to 0 in prod, he made the same changes on stage in Zeus [30/Jul/2012:15:02:03 -0700] INFO vservers/db-socorro-stage-web confmod Configuration file modified [30/Jul/2012:15:02:03 -0700] INFO vservers/db-socorro-stage-web vsreconf Virtual server reconfigured on port 6432 [30/Jul/2012:15:02:04 -0700] INFO vservers/db-socorro-stage-processor confmod Configuration file modified [30/Jul/2012:15:02:04 -0700] INFO vservers/db-socorro-stage-processor vsreconf Virtual server reconfigured on port 6433 [30/Jul/2012:15:02:05 -0700] INFO vservers/db-socorro-stage-pg confmod Configuration file modified [30/Jul/2012:15:02:05 -0700] INFO vservers/db-socorro-stage-pg vsreconf Virtual server reconfigured on port 5432 Tracking additional issues seen so far. 1. HBase nagios alerts 23:04:29 nagios-phx1 | [168] hp-node68.phx1.mozilla.com:Zeus - Port 9090 is CRITICAL: ERROR: Received noSuchName(2) error-status at error-index 1 23:06:49 nagios-phx1 | [172] hp-node64.phx1.mozilla.com:Zeus - Port 9090 is CRITICAL: ERROR: Received noSuchName(2) error-status at error-index 1 Neither Zeus nor the socorro processors saw an error. Aj has acked the alerts and we're waiting on feedback from :tmary about the state of the nodes 2. socorro-processor2 connections to postgres interrupted First from 13:43 - 13:47 2012-07-30 13:43:46,953 ERROR - Thread-1 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 13:43:46,955 ERROR - Thread-1 - DatabaseError: error with no message from the libpq 2012-07-30 13:44:11,010 ERROR - Thread-2 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 13:44:11,012 ERROR - Thread-2 - DatabaseError: error with no message from the libpq 2012-07-30 13:44:47,088 ERROR - Thread-8 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 13:44:47,089 ERROR - Thread-8 - DatabaseError: error with no message from the libpq 2012-07-30 13:45:29,180 ERROR - Thread-11 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 13:45:29,182 ERROR - Thread-11 - DatabaseError: error with no message from the libpq 2012-07-30 13:47:35,453 ERROR - Thread-7 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 13:47:35,455 ERROR - Thread-7 - DatabaseError: error with no message from the libpq and then at 3:07, five minutes after the time out config change, so it could be related to that 2012-07-30 15:07:23,217 ERROR - Thread-9 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 15:07:23,219 ERROR - Thread-9 - DatabaseError: error with no message from the libpq 2012-07-30 15:07:23,222 ERROR - Thread-1 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 15:07:23,224 ERROR - Thread-1 - DatabaseError: error with no message from the libpq 2012-07-30 15:07:23,227 ERROR - Thread-10 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 15:07:23,228 ERROR - Thread-10 - DatabaseError: error with no message from the libpq 2012-07-30 15:07:23,231 ERROR - Thread-3 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 15:07:23,233 ERROR - Thread-3 - DatabaseError: error with no message from the libpq 2012-07-30 15:07:23,236 ERROR - Thread-4 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 15:07:23,238 ERROR - Thread-4 - DatabaseError: error with no message from the libpq 2012-07-30 15:07:23,241 ERROR - Thread-2 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 15:07:23,242 ERROR - Thread-2 - DatabaseError: error with no message from the libpq 2012-07-30 15:07:23,245 ERROR - Thread-8 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 15:07:23,247 ERROR - Thread-8 - DatabaseError: error with no message from the libpq 2012-07-30 15:07:23,249 ERROR - Thread-7 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 15:07:23,250 ERROR - Thread-7 - DatabaseError: error with no message from the libpq 2012-07-30 15:07:23,254 ERROR - Thread-11 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 15:07:23,255 ERROR - Thread-11 - DatabaseError: error with no message from the libpq 2012-07-30 15:07:23,258 ERROR - Thread-5 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 15:07:23,259 ERROR - Thread-5 - DatabaseError: error with no message from the libpq 2012-07-30 15:07:23,262 ERROR - Thread-6 - Caught Error: <class 'psycopg2.DatabaseError'> 2012-07-30 15:07:23,264 ERROR - Thread-6 - DatabaseError: error with no message from the libpq

Brandon Burton [:solarce]

Assignee

Comment 60

•

12 years ago

Summary of today was Nagios checks added, a couple more Zeus settings (specifically postgres connection timeouts) were set on stage to make match prod, and a push to stage and prod was done (unrelated to issue, but a change none-the-less) Per IRC with cshields, lars, and jberkus, we'll monitor through tomorrow to see how it works with the Zeus changes

K Lars Lohn [:lars] [:klohn]

Comment 61

•

12 years ago

There have been neither Postgres nor HBase/Thrift connection problems overnight since Monday. This is very encouraging.

Severity: blocker → normal

K Lars Lohn [:lars] [:klohn]

Comment 62

•

12 years ago

earlier change of blocker to normal was inadvertent.

Severity: normal → blocker

K Lars Lohn [:lars] [:klohn]

Comment 63

•

12 years ago

sorry to report that soon after my comment #61, staging processors lost the connection to HBase: 2012-07-31 06:12:22,627 CRITICAL - Thread-11 - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'> 2012-07-31 06:12:22,628 CRITICAL - Thread-11 - the connection is not viable. retries fail: this lasted through about 7:24 where the processors recovered.

Severity: blocker → normal

Brandon Burton [:solarce]

Assignee

Comment 64

•

12 years ago

This was seen on processor1, processor2, and socorroadm.stage, though more times on the processors processor1 2012-07-31 07:12:51,461 CRITICAL - Thread-5 - something's gone horribly wrong with the HBase connection 2012-07-31 07:22:14,044 CRITICAL - Thread-7 - something's gone horribly wrong with the HBase connection 2012-07-31 07:22:14,313 CRITICAL - Thread-2 - something's gone horribly wrong with the HBase connection 2012-07-31 07:22:37,876 CRITICAL - Thread-10 - something's gone horribly wrong with the HBase connection 2012-07-31 07:22:51,143 CRITICAL - Thread-7 - something's gone horribly wrong with the HBase connection 2012-07-31 07:22:51,369 CRITICAL - Thread-2 - something's gone horribly wrong with the HBase connection 2012-07-31 07:24:13,510 CRITICAL - Thread-11 - something's gone horribly wrong with the HBase connection 2012-07-31 07:24:50,623 CRITICAL - Thread-11 - something's gone horribly wrong with the HBase connection processor2 2012-07-31 07:12:48,702 CRITICAL - Thread-5 - something's gone horribly wrong with the HBase connection 2012-07-31 07:12:53,481 CRITICAL - Thread-2 - something's gone horribly wrong with the HBase connection 2012-07-31 07:14:26,559 CRITICAL - Thread-7 - something's gone horribly wrong with the HBase connection 2012-07-31 07:17:04,527 CRITICAL - Thread-8 - something's gone horribly wrong with the HBase connection 2012-07-31 07:19:05,589 CRITICAL - Thread-6 - something's gone horribly wrong with the HBase connection 2012-07-31 07:22:28,071 CRITICAL - Thread-3 - something's gone horribly wrong with the HBase connection 2012-07-31 07:23:05,122 CRITICAL - Thread-3 - something's gone horribly wrong with the HBase connection supportadm.stage 2012-07-31 07:14:41,999 DEBUG - MainThread - somethings gone horribly wrong with HBase 2012-07-31 07:14:53,071 DEBUG - MainThread - somethings gone horribly wrong with HBase Nagios did not alert anything for socorro stage today. Zeus did not see any errors, the last thing is sees is the config changes from yesterday. [30/Jul/2012:15:02:03 -0700] INFO vservers/db-socorro-stage-web confmod Configuration file modified [30/Jul/2012:15:02:03 -0700] INFO vservers/db-socorro-stage-web vsreconf Virtual server reconfigured on port 6432 [30/Jul/2012:15:02:04 -0700] INFO vservers/db-socorro-stage-processor confmod Configuration file modified [30/Jul/2012:15:02:04 -0700] INFO vservers/db-socorro-stage-processor vsreconf Virtual server reconfigured on port 6433 [30/Jul/2012:15:02:05 -0700] INFO vservers/db-socorro-stage-pg confmod Configuration file modified [30/Jul/2012:15:02:05 -0700] INFO vservers/db-socorro-stage-pg vsreconf Virtual server reconfigured on port 5432 Also, we have not seen any Postgres errors since the changes yesterday. I'm going to get tcpdumps running on the active HBase nodes, the processors, and supportadm.stage

Brandon Burton [:solarce]

Assignee

Comment 65

•

12 years ago

Per :tmary on IRC, I enabled hp-node64 and hp-node68, so far they look fine to Zeus [31/Jul/2012:10:18:28 -0700] INFO pools/socorro-thrift-stage:9090 confmod Configuration file modified

Brandon Burton [:solarce]

Assignee

Comment 66

•

12 years ago

Updates on what I've done today. I started tcpdumps at 11:30am, as outlined in https://bugzilla.mozilla.org/show_bug.cgi?id=778831#c2 (each running under screen, as root) I've been monitoring the logs with the following commands pp-zlb01 - grep 2012 /usr/local/zeus/zxtm/log/errors | grep socorro socorroadm.stage - grep DatabaseError /var/log/socorro/socorro-monitor.log and grep HBase /var/log/socorro/socorro-monitor.log socorro-processor[1-2].stage - grep HBase /var/log/socorro/socorro-processor and grep DatabaseError /var/log/socorro/socorro-processor.log

Brandon Burton [:solarce]

Assignee

Comment 67

•

12 years ago

… I've been monitoring the logs and have not seen any new issues since the errors lars and I noted between 6-7:30am. If we see further errors, working with NetOps to review the tcpdumps is the next best step. I'll be out on PTO tomorrow, until 8/8, so if there is anything I was doing that anyone is unsure how to also do, please let me know tonight, I'll check bugmail before I take off for the airport tomorrow.

K Lars Lohn [:lars] [:klohn]

Comment 68

•

12 years ago

on Wednesday, August 1 - the processors lost their connection to HBase at 6:10am pdt, the problem remains on-going at 6:56am pdt. 2012-08-01 06:10:12,203 CRITICAL - Thread-11 - the connection is not viable. retries fail: Interestingly, the monitor has not lost its connection to HBase and continues blithely along assigning jobs to processors that cannot do their work.

Severity: normal → blocker

K Lars Lohn [:lars] [:klohn]

Comment 69

•

12 years ago

there is a time consistency in the HBase outages. Since July 3rd, the outages all begin within 10 minutes of 6am. prior to that: July 3 - 7:30am and 10:00am July 2 - 12:50am July 1 - 3:00am

Peter Radcliffe [:pir]

Comment 70

•

12 years ago

This debugging is using up all the disk space on socorroadm.stage.private.phx1 with tcpdump logs: 19G /root/778831 That job can't keep running and storing those logs locally. I'll compress the old ones to buy some time but it'll still need to be killed or redone to copy things off-machine in the next 24h and solarce is on PTO. If no one else has any suggestions of where to copy that data I'm going ot need to kill the tcpdump job. (3.1Gb of logs in /var/log/socorro, much of which is in /var/log/socorro/syslog, isn't helping but that's small compared to the tcpdumps).

K Lars Lohn [:lars] [:klohn]

Comment 71

•

12 years ago

right on schedule, the HBase connection to staging processors and monitor failed at 6:02am PDT. So we don't know why, but we can predict the time of failure. The error message is the same as all previous times: 2012-08-02 06:02:24,807 CRITICAL - MainThread - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'> 2012-08-02 06:02:24,807 CRITICAL - MainThread - the connection is not viable. retries fail:

Daniel Maher [:phrawzty]

Comment 72

•

12 years ago

(In reply to K Lars Lohn [:lars] [:klohn] from comment #71) > right on schedule, the HBase connection to staging processors and monitor > failed at 6:02am PDT. So we don't know why, but we can predict the time of > failure. The error message is the same as all previous times: Since we can predict the failures, could we not set up a cron job to engage and disengage the tcpdump around the time at which the behaviour is expected to occur ? (re: comment #70)

Laura Thomson :laura

Reporter

Comment 73

•

12 years ago

Update: we now know there are three separate causes of failure, and we have a plan. 1. PG connection timeouts. When these were moved to a new VIP, it was noticed that the timeout value was set too low (300ms) and different from prod. This has been fixed. 2. PG connection timeouts on a Sunday around noon. This is due to the data being refreshed from prod -> stage. Being on a Sunday, this is the least critical of the three, but I would like to address it if possible. That is in the hands of mpressman. 3. HBase connection timeouts each day at around 6am. This is due to the data being refreshed from prod->stage. This is on tmary to fix.

Peter Radcliffe [:pir]

Comment 74

•

12 years ago

:laura - if these three failure modes are now known, do you still need the tcpdump data to diagnose or can I kill the process before it fills up disk again?

Laura Thomson :laura

Reporter

Comment 75

•

12 years ago

pir: I think you can kill it for now at least. Thanks!

Peter Radcliffe [:pir]

Comment 76

•

12 years ago

laura: killed, dumps compressed. There's enough slack space after compression that the dumps don't need to be deleted immediately, but if we're sure they're not required I'll delete them too.

T [:tmary] Meyarivan

Comment 77

•

12 years ago

> 3. HBase connection timeouts each day at around 6am. This is due to the > data being refreshed from prod->stage. This is on tmary to fix. No of simultaneous mappers has been reduced to 8 - during tests with reduced number of mappers, there were no errors in logs --

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 78

•

12 years ago

The export of ADI data to Socorro staging postgres failed again last night. are there still known outstanding problems to be resolved with the database?

Laura Thomson :laura

Reporter

Comment 79

•

12 years ago

(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #78) > The export of ADI data to Socorro staging postgres failed again last night. > are there still known outstanding problems to be resolved with the database? I didn't think so, but apparently I was wrong. cshields, solarce: can somebody take a look at what happened with Postgres? dre: tmary: did the Hbase adjustments get us through the 6am work without problems this morning?

Corey Shields [:cshields]

Comment 80

•

12 years ago

(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #78) > The export of ADI data to Socorro staging postgres failed again last night. > are there still known outstanding problems to be resolved with the database? I need exact times and source/dest IPs/ports for this please... (or point me in the right direction to RTFM)

Mark Reid [:mreid]

Comment 81

•

12 years ago

(In reply to Mark Reid from comment #46) > The metrics ADU export has been failing on Socorro Stage every night, and > seems related to the ongoing problems. For example: > > ERROR 25-07 20:49:36,124 An I/O error occured while sending to the backend. > > Not a very informative error message, but at least it gives the time. > > I'm connecting to socorro-stage-zlb.db.phx1.mozilla.com, and running the job > manually worked when I tried it earlier this week (24-07 11:38:57,118 to be > specific). Please let me know if I can provide any other information. These errors are still happening each night. Here is the latest log message: ERROR 02-08 20:46:00,788 - Delete adu records - An error occurred, processing will be stopped: Couldn't execute SQL: DELETE FROM raw_adu WHERE date = '2012-08-02' An I/O error occured while sending to the backend.

Mark Reid [:mreid]

Comment 82

•

12 years ago

(In reply to Corey Shields [:cshields] from comment #80) > I need exact times and source/dest IPs/ports for this please... (or point > me in the right direction to RTFM) Time in my previous comment. Source: etl1.metrics.scl3.mozilla.com Dest: socorro-stage-zlb.db.phx1.mozilla.com Port: 5432 User: breakpad_metrics

K Lars Lohn [:lars] [:klohn]

Comment 83

•

12 years ago

the HBase trouble was not resolved. I see at least 50 HBase connection retry failures in the processors between 6am and 7:30am PDT.

[:jberkus] Josh Berkus

Comment 84

•

12 years ago

Daniel, It's more likely that our changes to the network require updates to your configuration. Please send: a) what time the ADI update runs on staging b) what IP address (or URI) you're hitting c) username/password (via email). Thanks!

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 85

•

12 years ago

Josh, All the information you are requesting is in comment 81 and comment 82 above with the exception of the password. We can e-mail you the password we are using, but given the fact we can run manually and it completes, I don't think it is likely to be a password issue.

[:jberkus] Josh Berkus

Comment 86

•

12 years ago

Daniel, Yes, Mreid found me on IRC. The password is verified correct. He says the job runs at 20:46 Pacific, +/- 5min.

Laura Thomson :laura

Reporter

Comment 87

•

12 years ago

(In reply to T [:tmary] Meyarivan from comment #77) > > 3. HBase connection timeouts each day at around 6am. This is due to the > > data being refreshed from prod->stage. This is on tmary to fix. > > No of simultaneous mappers has been reduced to 8 - during tests with reduced > number of mappers, there were no errors in logs > > -- Following up: reducing the mappers from 8 to 1 solves the problem, but it also means the refresh will take 8-10 hours each day to complete. I've asked tmary to spec out adding some more nodes and file a bug for hardware. It may be complicated since he says there is no space in the current set of switches (for hp-nodeXX servers) and rewiring will be required. He also said once prod is upgraded we may be able to borrow nodes from sec-phx1. Daniel, can you comment on that? I'm not sure I want to borrow from prod.

Laura Thomson :laura

Reporter

Comment 88

•

12 years ago

With regard to Postgres/ADI, I suggest we watch the network during the import and try to see what's causing the failure. Need an ops/pg volunteer - maybe both.

Derek Moore [:dmoore]

Comment 89

•

12 years ago

(In reply to Laura Thomson :laura from comment #87) > asked tmary to spec out adding some more nodes and file a bug for hardware. > It may be complicated since he says there is no space in the current set of > switches (for hp-nodeXX servers) and rewiring will be required. Let me know if this problem can be solved with additional space. DC Ops has additional capacity coming online soon in phx1, we can make accommodations for you.

K Lars Lohn [:lars] [:klohn]

Comment 90

•

12 years ago

I was out on Friday, Aug 3, but on returning, I find new timing to the HBase failures in the staging processor logs. On Aug 3, there are periodic CRITICAL log entries throughout the day. On Aug 4, I find HBase connection failures begin at 02:16 and lasting for three minutes. The problem recurs at 02:50 and then, in brief bursts, continues every fifteen to sixty minutes through the night. It is 07:40 now, the last failure that I see was at 07:19. I will continue to watch periodically through the weekend.

K Lars Lohn [:lars] [:klohn]

Comment 91

•

12 years ago

on Monday morning, processor2 continues to have connectivity issues with HBase. Processor1 is having no trouble at all. We're not out of the woods on this issue.

Laura Thomson :laura

Reporter

Comment 92

•

12 years ago

Another update: ADIs did not get processed again today, which is a PG/network issue.

Ravi Pina [:ravi]

Comment 93

•

12 years ago

(In reply to Laura Thomson :laura from comment #92) > Another update: ADIs did not get processed again today, which is a > PG/network issue. I don't know what that means.

Mark Reid [:mreid]

Comment 94

•

12 years ago

Re: Postgres & ADI - the export succeeded last night. Looks like it was failing due to a bug on our end. I'll keep an eye on it.

Laura Thomson :laura

Reporter

Comment 95

•

12 years ago

Summary of the four issues found (and some fixed): - (FIXED) PostgreSQL connections failing, fixed by changing virtual server connection timeout for postgres to 0 same as in prod (comment #59) - (FIXED) HBase refresh used up too many resources, making HBase drop connections. Fixed by reducing the number of mappers from 8 to 1 (comment #77). Now the refresh takes 8 hours, so we should add capacity - bug to be filed. - (FIXED) ADI import was timing out due to a bug on the Metrics end. - (WONTFIX) PostgreSQL connections timeout during data refresh, starting Sundays at noon PT. We can live with this. It took us a long time to converge on these because, since the problems all began around the same time, we assumed a single root cause. We'll try not to make that (understandable) mistake again. I'm going to call this good, and ask solarce to close the bug. Thanks very much to all ops, netops, metrics, DBAs, and Socorro team who worked on getting us stable.

Brandon Burton [:solarce]

Assignee

Comment 96

•

12 years ago

Marking as resolved. Thanks so much to everyone who helped dig into and resolve these issues.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Whiteboard: [testing VIPs on internal cluster]

Ravi Pina [:ravi]

Comment 97

•

12 years ago

I'm glad we were able to get this resolved. Awesome work, everyone!

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard