Closed Bug 771218 Opened 8 years ago Closed 8 years ago

Ongoing problems with Socorro staging

Categories

(mozilla.org Graveyard :: Server Operations, task, P1, blocker)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: laura, Assigned: bburton)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Yesterday, the two staging processors each died, unable to get connections to PostgreSQL or HBase.  This has happened several times this week.  When we have noticed and restarted, things come back to life.

I'm assuming network or Zeus trouble - can please someone take a look?

From a timing perspective:
socorro-processor2.stage.metrics.phx1 stopped at 2012-07-04 05:46:36,199
socorro-processor1.stage.metrics.phx1 stopped at 2012-07-04 18:24:04,516
Are these times UTC?
Those times are from the logs on the machines, so I guess it's whatever the system time is.
Who from server ops is going to examine Zeus?
this problem continues.  On restarting the processors on both socorro-processor2.stage.metrics.phx1 and socorro-processor1.stage.metrics.phx1 at 2012-07-05 10:30:31 they ran for eleven hours and then suddenly lost their connections to PostgreSQL at 2012-07-05 21:10:17

This problem is holding us back from getting our next release ready.
Per IRC, I'll take a look at Zeus
Assignee: server-ops → bburton
Status: NEW → ASSIGNED
After digging through the event log on pp-zlb09 I found the following entries, which show a blip between 9:44PM and 9:57PM PST yesterday, which is after the timestamp :lars noted but in the same hour

I'm adding :jakem and :cshields for some input

:ravi do you see anything on the network side for the below time stamps?


[05/Jul/2012:21:44:43 -0700]	WARN	monitors/socorro-thrift-check	monitorfail	Monitor has detected a failure in node '10.8.100.65:9090': Timeout
[05/Jul/2012:21:44:44 -0700]	SERIOUS	pools/socorro-thrift-stage:9090	nodes/10.8.100.65:9090	nodefail	Node 10.8.100.65 has failed - A monitor has detected a failure
[05/Jul/2012:21:44:48 -0700]	WARN	monitors/socorro-thrift-check	monitorfail	Monitor has detected a failure in node '10.8.100.62:9090': Timeout
[05/Jul/2012:21:44:49 -0700]	SERIOUS	pools/socorro-thrift-stage:9090	nodes/10.8.100.62:9090	nodefail	Node 10.8.100.62 has failed - A monitor has detected a failure
[05/Jul/2012:21:44:50 -0700]	WARN	monitors/socorro-thrift-check	monitorfail	Monitor has detected a failure in node '10.8.100.66:9090': Timeout
[05/Jul/2012:21:44:51 -0700]	SERIOUS	pools/socorro-thrift-stage:9090	nodes/10.8.100.66:9090	nodefail	Node 10.8.100.66 has failed - A monitor has detected a failure
[05/Jul/2012:21:44:55 -0700]	INFO	monitors/socorro-thrift-check	monitorok	Monitor is working for node '10.8.100.66:9090'.
[05/Jul/2012:21:44:56 -0700]	INFO	pools/socorro-thrift-stage:9090	nodes/10.8.100.66:9090	nodeworking	Node 10.8.100.66 is working again
[05/Jul/2012:21:44:56 -0700]	WARN	monitors/socorro-thrift-check	monitorfail	Monitor has detected a failure in node '10.8.100.63:9090': Timeout
[05/Jul/2012:21:44:56 -0700]	SERIOUS	pools/socorro-thrift-stage:9090	nodes/10.8.100.63:9090	nodefail	Node 10.8.100.63 has failed - A monitor has detected a failure
[05/Jul/2012:21:44:58 -0700]	WARN	monitors/socorro-thrift-check	monitorfail	Monitor has detected a failure in node '10.8.100.67:9090': Timeout
[05/Jul/2012:21:44:59 -0700]	SERIOUS	pools/socorro-thrift-stage:9090	nodes/10.8.100.67:9090	nodefail	Node 10.8.100.67 has failed - A monitor has detected a failure
[05/Jul/2012:21:45:01 -0700]	INFO	monitors/socorro-thrift-check	monitorok	Monitor is working for node '10.8.100.63:9090'.
[05/Jul/2012:21:45:02 -0700]	INFO	pools/socorro-thrift-stage:9090	nodes/10.8.100.63:9090	nodeworking	Node 10.8.100.63 is working again
[05/Jul/2012:21:45:02 -0700]	WARN	monitors/socorro-thrift-check	monitorfail	Monitor has detected a failure in node '10.8.100.69:9090': Timeout
[05/Jul/2012:21:45:03 -0700]	SERIOUS	pools/socorro-thrift-stage:9090	nodes/10.8.100.69:9090	nodefail	Node 10.8.100.69 has failed - A monitor has detected a failure
[05/Jul/2012:21:45:24 -0700]	INFO	monitors/socorro-thrift-check	monitorok	Monitor is working for node '10.8.100.67:9090'.
[05/Jul/2012:21:45:25 -0700]	INFO	pools/socorro-thrift-stage:9090	nodes/10.8.100.67:9090	nodeworking	Node 10.8.100.67 is working again
[05/Jul/2012:21:45:30 -0700]	INFO	monitors/socorro-thrift-check	monitorok	Monitor is working for node '10.8.100.62:9090'.
[05/Jul/2012:21:45:31 -0700]	INFO	pools/socorro-thrift-stage:9090	nodes/10.8.100.62:9090	nodeworking	Node 10.8.100.62 is working again
[05/Jul/2012:21:45:37 -0700]	WARN	monitors/socorro-thrift-check	monitorfail	Monitor has detected a failure in node '10.8.100.63:9090': Monitor exited, exit code 1, no output generated
[05/Jul/2012:21:45:38 -0700]	SERIOUS	pools/socorro-thrift-stage:9090	nodes/10.8.100.63:9090	nodefail	Node 10.8.100.63 has failed - A monitor has detected a failure
[05/Jul/2012:21:45:44 -0700]	WARN	monitors/socorro-thrift-check	monitorfail	Monitor has detected a failure in node '10.8.100.66:9090': Monitor exited, exit code 1, no output generated
[05/Jul/2012:21:45:45 -0700]	SERIOUS	pools/socorro-thrift-stage:9090	nodes/10.8.100.66:9090	nodefail	Node 10.8.100.66 has failed - A monitor has detected a failure
[05/Jul/2012:21:45:47 -0700]	WARN	monitors/socorro-thrift-check	monitorfail	Monitor has detected a failure in node '10.8.100.67:9090': Monitor exited, exit code 1, no output generated
[05/Jul/2012:21:45:48 -0700]	SERIOUS	pools/socorro-thrift-stage:9090	nodes/10.8.100.67:9090	nodefail	Node 10.8.100.67 has failed - A monitor has detected a failure
[05/Jul/2012:21:45:50 -0700]	WARN	monitors/socorro-thrift-check	monitorfail	Monitor has detected a failure in node '10.8.100.62:9090': Monitor exited, exit code 1, no output generated
[05/Jul/2012:21:45:51 -0700]	SERIOUS	pools/socorro-thrift-stage:9090	nodes/10.8.100.62:9090	nodefail	Node 10.8.100.62 has failed - A monitor has detected a failure
[05/Jul/2012:21:45:51 -0700]	SERIOUS	pools/socorro-thrift-stage:9090	pooldied	Pool has no back-end nodes responding
[05/Jul/2012:21:56:51 -0700]	INFO	monitors/socorro-thrift-check	monitorok	Monitor is working for node '10.8.100.65:9090'.
[05/Jul/2012:21:56:52 -0700]	INFO	pools/socorro-thrift-stage:9090	nodes/10.8.100.65:9090	nodeworking	Node 10.8.100.65 is working again
[05/Jul/2012:21:56:52 -0700]	INFO	pools/socorro-thrift-stage:9090	poolok	Pool now has working nodes
[05/Jul/2012:21:56:53 -0700]	INFO	monitors/socorro-thrift-check	monitorok	Monitor is working for node '10.8.100.69:9090'.
[05/Jul/2012:21:56:54 -0700]	INFO	pools/socorro-thrift-stage:9090	nodes/10.8.100.69:9090	nodeworking	Node 10.8.100.69 is working again
[05/Jul/2012:21:57:08 -0700]	INFO	monitors/socorro-thrift-check	monitorok	Monitor is working for node '10.8.100.63:9090'.
[05/Jul/2012:21:57:09 -0700]	INFO	pools/socorro-thrift-stage:9090	nodes/10.8.100.63:9090	nodeworking	Node 10.8.100.63 is working again
[05/Jul/2012:21:57:15 -0700]	INFO	monitors/socorro-thrift-check	monitorok	Monitor is working for node '10.8.100.66:9090'.
[05/Jul/2012:21:57:16 -0700]	INFO	pools/socorro-thrift-stage:9090	nodes/10.8.100.66:9090	nodeworking	Node 10.8.100.66 is working again
[05/Jul/2012:21:57:19 -0700]	INFO	monitors/socorro-thrift-check	monitorok	Monitor is working for node '10.8.100.67:9090'.
[05/Jul/2012:21:57:20 -0700]	INFO	pools/socorro-thrift-stage:9090	nodes/10.8.100.67:9090	nodeworking	Node 10.8.100.67 is working again
[05/Jul/2012:21:57:21 -0700]	INFO	monitors/socorro-thrift-check	monitorok	Monitor is working for node '10.8.100.62:9090'.
[05/Jul/2012:21:57:22 -0700]	INFO	pools/socorro-thrift-stage:9090	nodes/10.8.100.62:9090	nodeworking	Node 10.8.100.62 is working again
please note: the problems in the logs in Comment #6 involve HBase.  The problem that we're seeing is with Postgres.

the postgres problem happened again at today at 2012-07-06 10:58:44.  this time it affected socorro-processor2.stage.metrics.phx1, but not socorro-processor1.stage.metrics.phx1.
What is the actual check testing?
Logs for the switches these hosts are connected to are clean for the times mentioned.  I bumped logging level to "info" from "notice" to give us a better chance to catch something, but I don't see any indication of network issues.  

I also checked the core.  It is clean.  STP has been quiet for a number of days now. Still looking, but I'm not seeing anything on the network side.
Busted again right now, FWIW
(In reply to Ravi Pina [:ravi] from comment #8)
> What is the actual check testing?

The check is a ping check
Looking at socorro-processor2.stage.metrics.phx1

/var/log/messages only has puppetd run info
/var/log/dmesg is auditd messages
I don't see anything with yum since 6/27 and those were just audited changes

I'm going to hand this to the SRE team to look at and see if there is anything in the hardware logs or maybe a network driver issue?
Assignee: bburton → server-ops
We got processors+stageDB working now.  However, we don't know what caused the original problem; by the time we cleaned it up, the original issue had ceased.
Priority: -- → P1
Summary: Ongoing problems with Socorro staging - network related? → Ongoing problems with Socorro staging
Monitored all night last night.  Processors reported connection failure at 4:10AM Pacific.  In the pgbouncer logs, we have:

2012-07-11 04:08:31.272 2789 LOG Stats: 0 req/s, in 108 b/s, out 154 b/s,query 4383 us
2012-07-11 04:08:33.711 2789 LOG C-0x1bffa40: breakpad/processor@10.8.70.208:48083 closing because: client unexpected eof (age=57431)
2012-07-11 04:08:36.516 2789 LOG C-0x1bffe78: breakpad/processor@10.8.70.208:48107 closing because: client unexpected eof (age=57433)
2012-07-11 04:08:41.510 2789 LOG C-0x1bff770: breakpad/processor@10.8.70.208:48063 closing because: client unexpected eof (age=57438)
2012-07-11 04:08:42.519 2789 LOG C-0x1bff8d8: breakpad/processor@10.8.70.208:48075 closing because: client unexpected eof (age=57439)
2012-07-11 04:08:42.519 2789 LOG C-0x1bff338: breakpad/processor@10.8.70.208:48028 closing because: client unexpected eof (age=57439)
2012-07-11 04:08:48.380 2789 LOG C-0x1bff608: breakpad/processor@10.8.70.208:48043 closing because: client unexpected eof (age=57445)
2012-07-11 04:08:54.385 2789 LOG C-0x1bffd10: breakpad/processor@10.8.70.208:48102 closing because: client unexpected eof (age=57451)
2012-07-11 04:09:31.273 2789 LOG Stats: 0 req/s, in 111 b/s, out 156 b/s,query 4147 us
2012-07-11 04:10:31.274 2789 LOG Stats: 0 req/s, in 116 b/s, out 165 b/s,query 4073 us

That seems to confirm that the problem is the loss of the network connection between the processor and the database server.  

Interestingly, though, the monitor doesn't lose its connection until 7AM:

2012-07-11 06:58:42.363 PDT,"monitor","breakpad",18123,"10.8.70.208:15950",4ffd7d30.46cb,1,"idle",2012-07-11 06:18:40 PDT,6/0,0,LOG,08P01,"unexpected EOF on client connection",,,,,,,,,""

So, I think we need a Zeus expert to start looking at Zeus for periodically recurring issues.  Thanks!
Blocks: 761171
Current assessment is that an issue with our leap second "fix" we pushed with puppet was the source of the issue, as it was causing the second time of the clock to be reset to :20 seconds somewhat randomly, this has been fixed and staging restarted

We'll monitor for 24 hours and RF or investigate further, as needed.
Assignee: server-ops → bburton
processor2 is having PG connection difficulties again, via mpressman:
2012-07-11 12:17:18,488 ERROR - Thread-9 - DatabaseError: error with no message from the libpq
2012-07-11 12:17:18,486 ERROR - Thread-9 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-11 12:17:18,486 ERROR - Thread-9 - error with no message from the libpq
2012-07-11 12:17:18,487 ERROR - Thread-9 - trace back follows:
2012-07-11 12:17:18,487 ERROR - Thread-9 - Traceback (most recent call last):
2012-07-11 12:17:18,488 ERROR - Thread-9 - File "/data/socorro/application/socorro/processor/processor.py", line 526, in processJob
    threadLocalCursor.execute("update jobs set starteddatetime = %s where id = %s", (startedDateTime, jobId))
2012-07-11 12:17:18,488 ERROR - Thread-9 - DatabaseError: error with no message from the libpq
Something BAD happened in Thread-9:
Traceback (most recent call last):
  File "/data/socorro/application/socorro/lib/threadlib.py", line 128, in run
    aFunction(arguments)
  File "/data/socorro/application/socorro/processor/processor.py", line 478, in processJobWithRetry
    result = self.processJob(jobTuple)
  File "/data/socorro/application/socorro/processor/processor.py", line 651, in processJob
    threadLocalDatabaseConnection.rollback()
InterfaceError: connection already closed
connection already closed
There was a mention during the meeting about the file leapsecond_2012_06_30 in /tmp 

Is this supposed to be removed? It still exists on processor2
so just to clear this up for the bug, chields mentioned that the existence of the file is of no concern.
There was nothing logged in Zeus during the noon hour for socorro, that I can see

I'm going to ask an SRE to check the server
processor2 is down again, started paging at
Date/Time: 07-12-2012 04:58:36
These might be a red-herring but I found these messages in socorro1.stage.db.phx1:/var/log/pgbouncer/pgbouncer-processor.log:

2012-07-12 08:23:32.150 2789 LOG S-0x1c1d558: breakpad/processor@unix:5432 closing because: server lifetime over (age=3600)
2012-07-12 08:23:58.750 2789 LOG S-0x1c1d558: breakpad/processor@unix:5432 new connection to server
2012-07-12 08:24:04.843 2789 LOG S-0x1c1ca18: breakpad/processor@unix:5432 new connection to server
2012-07-12 08:30:53.063 2789 LOG S-0x1c1bc08: breakpad/processor@unix:5432 closing because: server lifetime over (age=3600)

they might just be suggestive of normal activity, I'm unsure.
Ashish,

That's normal activity.
Interesting.  I only see one error in the pgBouncer log for the period 04:00:00 to 06:00:00.  One client unexpectedly disconnected.  Otherwise, all activity durning that period looks normal, and in fact there's evidence of continuing processing during that period.

Are you sure *all* processors disconnected, rather than just one of them?
only processor2 had trouble with connections on all threads.  processor1 continued working without trouble.
ah, that would be this then:

2012-07-12 04:53:00.124 2789 LOG C-0x1bfec30: breakpad/processor@10.8.70.208:1746 closing because: client unexpected eof (age=41861)

as per above "client unexpected eof" means that the client vanished off the network without disconnecting.  it means either network breakage or that the client crashed.
at the same time, the client reported: "DatabaseError: error with no message from the libpq"

I believe that means, from the client perspective, the server vanished off the network."  
In the past in similar situations,  we've gotten error messages that say, "server closed the connection unexpectedly"
More evidence that this is a network issue.  The monitor -- a separate process, one which connects to PostgreSQL directly rather than using pgbouncer -- is also randomly losing connections:

Monitor Log:

2012-07-12 03:43:06,138 ERROR - jobCleanupThread - Caught Error: <class 'psycopg2.InterfaceError'>
2012-07-12 03:43:06,138 ERROR - jobCleanupThread - connection already closed

PostgreSQL Log:

2012-07-12 03:43:06.048 PDT,"monitor","breakpad",29015,"10.8.70.208:8589",4ffea7df.7157,1,"idle",2012-07-12 03:33:03 PDT,10/0,0,LOG,08P01,"unexpected EOF on client connection",,,,,,,,,""
I've been double checking things, as per the above comments, connects to Postgres are dying from both socorroadm.stage and socorro-processor[1-2].stage

socorroadm.stage is an ESX VM, socorro-processor[1-2].stage are HP blades

Both the monitor and processor apps are using 10.8.70.126 as their databaseHost, which is a VIP on pp-zlb09

A review of the Zeus logs shows db errors only being recorded on 07/06 for the month of July

[root@pp-zlb09.phx log]# grep socorro /usr/local/zeus/zxtm/log/errors | grep Jul | grep db
[06/Jul/2012:15:25:02 -0700]	SERIOUS	pools/db-socorro01-rw-processor-pool	nodes/10.8.70.100:6433	nodefail	Node 10.8.70.100 has failed - A monitor has detected a failure
[06/Jul/2012:15:25:02 -0700]	SERIOUS	pools/db-socorro01-rw-processor-pool	pooldied	Pool has no back-end nodes responding
[06/Jul/2012:15:25:02 -0700]	SERIOUS	pools/db-socorro01-rw-web-pool	pooldied	Pool has no back-end nodes responding
[06/Jul/2012:15:25:02 -0700]	SERIOUS	pools/db-socorro01-rw-web-pool	nodes/10.8.70.100:6432	nodefail	Node 10.8.70.100 has failed - A monitor has detected a failure
[06/Jul/2012:15:25:02 -0700]	SERIOUS	pools/db-socorro01-rw-web-pool	pooldied	Pool has no back-end nodes responding
[06/Jul/2012:15:33:23 -0700]	INFO	pools/db-socorro01-rw-processor-pool	nodes/10.8.70.100:6433	nodeworking	Node 10.8.70.100 is working again
[06/Jul/2012:15:33:23 -0700]	INFO	pools/db-socorro01-rw-processor-pool	poolok	Pool now has working nodes
[06/Jul/2012:15:33:23 -0700]	INFO	pools/db-socorro01-rw-web-pool	poolok	Pool now has working nodes
[06/Jul/2012:15:33:23 -0700]	INFO	pools/db-socorro01-rw-web-pool	nodes/10.8.70.100:6432	nodeworking	Node 10.8.70.100 is working again
[06/Jul/2012:15:33:23 -0700]	INFO	pools/db-socorro01-rw-web-pool	poolok	Pool now has working nodes

All the Postgres traffic to 10.8.70.126 gets sent to 10.8.70.125

I'd been focusing on trying to find a fault on the processor blades, but given c28 and IRC discussion, I am now reviewing the Postgres host
Depends on: 773390
Found possible storage / driver issue in dmesg log, opened bug 773390 to have Ops look into
Per bug 773390 and IRC, updates have been applied to the db server, the processors and monitor have been restarted and :lars confirmed stage is running.

We'll wait 24 hours to see if the issue persists.
no joy.

staging processor #2 has again lost its connection to postgres.

processor log:
2012-07-12 14:54:05,356 ERROR - Thread-10 - DatabaseError: error with no message from the libpq

server log:
2012-07-12 14:53:40.052 6137 LOG C-0x15fc4a0: breakpad/processor@10.8.70.208:28101 closing because: client unexpected eof (age=4500)
just a notification that this problem is not confined to connecting to Postgres.  This morning, both staging processors are having a bad time trying to talk to Hbase at 10.8.100.209

It started at 6:05

2012-07-13 06:05:58,810 CRITICAL - Thread-5 - something's gone horribly wrong with the HBase connection
2012-07-13 06:05:58,811 CRITICAL - Thread-5 - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'>
2012-07-13 06:05:58,811 CRITICAL - Thread-5 - the connection is not viable.  retries fail:
interestingly, the monitor appears to have no trouble at all talking to HBase.
:mpressman made a comment on irc over the weekend that things had been stable?

Is that correct?
Both staging processors spent all day Sunday trying to connect to HBase without success.  Today, the processors are wiling away their time trying to talk to a silent Postgres.
I'm not sure what to look at next, I'm not seeing any consistent errors in the Zeus logs and this seems not to be isolated to specific hosts, but previously netops confirmed seeing no network issues from the processors hosts.

I'll try to talk with Corey, Jason, and some other webops folks tomorrow morning to discuss what's the next best step
I think it'd help to come up with a simpler, reproducible test case. It's hard to tell if anything is fixed since we can't make the problem happen on demand, and it feels like we're still not 100% where the problem is (please correct me if this is wrong).

Would it make sense to try to have some simple persistent connection across the hosts in question (e.g. netcat?) If *that* can't stay connected then we know it's not e.g. the software involved.
Whiteboard: [pending webops discussion]
This bug is long and confusing.  Can someone provide the data flow(s) with IP and port pairs here?
OK.  Dropped connections and timeouts have been seen with the following flows:
* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090  (Thrift/HBase)
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase)
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro1.stage.db.phx1.mozilla.com:5432 (or is this 6432?  I don't recall whether it's direct (5432) or through pgbouncer(6432). solarce/mpressman/jberkus/sheeri can confirm) (PostgreSQL)

Ravi, is that what you need?
Laura,

> * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com ->
> socorro1.stage.db.phx1.mozilla.com:5432 (or is this 6432?  I don't recall
> whether it's direct (5432) or through pgbouncer(6432).

It's port 6433 actually.

We've also seen dropped connections between the monitor and dev.db (port 5432).  Not sure which server the monitor is.
BTW, since we've never seen issues like this one in production, I'm wondering if this might be related to the virtualization system staging monitor & processors run on?
(In reply to [:jberkus] Josh Berkus from comment #43)
> BTW, since we've never seen issues like this one in production, I'm
> wondering if this might be related to the virtualization system staging
> monitor & processors run on?

socorroadm.stage is a VM
socorrop-processor[1-2].stage are HP blades.
(In reply to Laura Thomson :laura from comment #40)
> OK.  Dropped connections and timeouts have been seen with the following
> flows:
> * socorroadm.stage.private.phx1.mozilla.com (monitor) ->
> socorro-thrift-stage.zlb.phx1.mozilla.com:9090  (Thrift/HBase)
> * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com ->
> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase)
> * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com ->
> socorro1.stage.db.phx1.mozilla.com:5432 (or is this 6432?  I don't recall
> whether it's direct (5432) or through pgbouncer(6432).
> solarce/mpressman/jberkus/sheeri can confirm) (PostgreSQL)
> 
> Ravi, is that what you need?

Monitor connections on VM:

* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090  (Thrift/HBase) (on pp-zlb10.phx.mozilla.net)
* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb09.phx.mozilla.net)

Processor connections on HP blades:

* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb10.phx.mozilla.net)
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb09.phx.mozilla.net)
The metrics ADU export has been failing on Socorro Stage every night, and seems related to the ongoing problems.  For example:

ERROR 25-07 20:49:36,124 An I/O error occured while sending to the backend.

Not a very informative error message, but at least it gives the time.

I'm connecting to socorro-stage-zlb.db.phx1.mozilla.com, and running the job manually worked when I tried it earlier this week (24-07 11:38:57,118 to be specific).  Please let me know if I can provide any other information.
(In reply to Ravi Pina [:ravi] from comment #39)
> This bug is long and confusing.  Can someone provide the data flow(s) with
> IP and port pairs here?

Ravi, the short is that we're seeing connection failures to Postgres and Hbase VIPs on Zeus. The VIPs are on separate Zeus servers (pp-zlb09 and pp-zlb10). When we see connection failures in the logs of the applications connecting to the VIPs, I am not seeing entries in the Zeus error logs at or around the time of the failures.

The data flows are as follows

Monitor connections on an ESX VM:

* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090  (Thrift/HBase) (on pp-zlb10.phx.mozilla.net)
* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb09.phx.mozilla.net)

Processor connections on two HP blades:

* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb10.phx.mozilla.net)
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb09.phx.mozilla.net)

I've worked with SREs to review the hardware of the Postgres master and the processor blades at points throughout diagnosing this bug and that has checked out so far. Given that and the failures from multiple servers to multiple VIPs on different Zeus nodes, it's hard not to look at the network as the commonality, but the Zeus cluster is also a common point.

Is there additional info you'd need to help assess if there is a possible networking related issue?

Would you recommend we try moving these VIPs to the internal (db) Zeus cluster as the next step?
(In reply to Brandon Burton [:solarce] from comment #47)
> Would you recommend we try moving these VIPs to the internal (db) Zeus
> cluster as the next step?

This needs to be done anyway, and would help in other situations as well.  Are they using the same VIPs as anything prod?  If not can we disrupt staging at any time to make this change?
The staging VIPs are different from the prod ones.
Oh, and stage is already pretty disrupted, so just let us know when you're taking it down.  No confirmation needed.
(In reply to [:jberkus] Josh Berkus from comment #50)
> Oh, and stage is already pretty disrupted, so just let us know when you're
> taking it down.  No confirmation needed.

I'll be working on this over the next 1-2 hours and you may see disruption as various points, I'll confirm in #breakpad when I'm ready to test out things
Whiteboard: [pending webops discussion] → [moving Zeus VIPs to internal cluster]
Whiteboard: [moving Zeus VIPs to internal cluster] → [testing VIPs on internal cluster]
Hbase VIP moved at Noon, confirmed working by :lars

PG VIPs (5432, 6432, 6433) moved at 4:34, awaiting confirmation from socorro team, :solarce confirmed telnet works.
We need to watch the PG connections for a 24-hour period to determine if they are better.

By Monday, we should know.
I regret to report that our problems have not been resolved.

both staging processors and the monitor lost their ability to talk to both HBase and Postgres at various times over the weekend.

socorro-processor1.stage.metrics.phx1:

2012-07-29 06:08:33,142 CRITICAL - Thread-2 - something's gone horribly wrong with the HBase connection
2012-07-29 06:08:33,143 CRITICAL - Thread-2 - the connection is not viable.  retries fail: 

socorro-processor2.stage.metrics.phx1:
2012-07-30 06:04:57,996 CRITICAL - Thread-9 - something's gone horribly wrong with the HBase connection
2012-07-30 06:04:57,997 CRITICAL - Thread-9 - the connection is not viable.  retries fail:  


on the 29th, while processor1 was struggling with no HBase connection, processor2 was unable to talk to Postgres on some (but not all) threads.  Because of being unable to talk to Postgres, we don't know what what was happening with HBase connectivity, as the processor won't try to talk to HBase if it cannot talk to Postgres.

the monitor, on the 29th, suffered the same HBase connectivity problem as processor1 for about 40 minutes.  Then it recovered.

Six hours later, the monitor lost the Postgres connection.  Recovery happened about 9 hours later.

The monitor lost the HBase connection again on the 30th at the same time that processor2 did.  Processor1 lost the HBase connection about 5 minutes later.
Escalating this to blocker.  If we can't get staging sorted out, we will be unable to ship support for rapid betas.  The trouble we have had with crontabber supports the fact that we can't ship high-risk changes without a stable staging environment.
Severity: normal → blocker
Attached file socorro stage issues
Zeus saw failures for the Postgres VIP, but not the HBase VIP

It was a couple failures on 7/27 and many on 7/29. None on 7/30 so far.
Hardware involved

Postgres

socorro1.stage.db.phx1 - https://inventory.mozilla.org/en-US/systems/show/1550/

Processors

socorro-processor1.stage.metrics.phx1 - https://inventory.mozilla.org/en-US/systems/show/3921/
socorro-processor2.stage.metrics.phx1 - https://inventory.mozilla.org/en-US/systems/show/3923/

HBase
+ means disabled node in Zeus

10.8.100.62:9090 - hp-node62.phx1 - https://inventory.mozilla.org/en-US/systems/show/2568/
10.8.100.63:9090 - hp-node63.phx1 - https://inventory.mozilla.org/en-US/systems/show/2569/
+10.8.100.64:9090 - hp-node64.phx1 - https://inventory.mozilla.org/en-US/systems/show/2570/
10.8.100.65:9090 - hp-node65.phx1 - https://inventory.mozilla.org/en-US/systems/show/2571/
10.8.100.66:9090 - hp-node66.phx1 - https://inventory.mozilla.org/en-US/systems/show/2572/
10.8.100.67:9090 - hp-node67.phx1 - https://inventory.mozilla.org/en-US/systems/show/2572/
+10.8.100.68:9090 - hp-node68.phx1 - https://inventory.mozilla.org/en-US/systems/show/2574/
10.8.100.69:9090 - hp-node69.phx1 - https://inventory.mozilla.org/en-US/systems/show/2575/

Zeus

pp-zlb01.phx1 - https://inventory.mozilla.org/en-US/systems/show/1545/
pp-zlb02.phx1 - https://inventory.mozilla.org/en-US/systems/show/1561/
The data flows are now

Monitor connections on an ESX VM:

* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090  (Thrift/HBase) (on pp-zlb01.phx.mozilla.net)
* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb01.phx.mozilla.net)

Processor connections on two HP blades:

* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb01.phx.mozilla.net)
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb01.phx.mozilla.net)
Depends on: 778797
Depends on: 778802
Depends on: 778805
Depends on: 778831
Tracking Changes made so far

1. jberkus realized the socorro-processor virtual server was using the 5432 backend and not the 6433 backend. I fixed this

[30/Jul/2012:15:02:04 -0700]	INFO	vservers/db-socorro-stage-processor	confmod	Configuration file modified
[30/Jul/2012:15:02:04 -0700]	INFO	vservers/db-socorro-stage-processor	vsreconf	Virtual server reconfigured on port 6433

2. cshields found that the virtual server connection timeout for postgres is set to 0 in prod, he made the same changes on stage in Zeus

[30/Jul/2012:15:02:03 -0700]	INFO	vservers/db-socorro-stage-web	confmod	Configuration file modified
[30/Jul/2012:15:02:03 -0700]	INFO	vservers/db-socorro-stage-web	vsreconf	Virtual server reconfigured on port 6432
[30/Jul/2012:15:02:04 -0700]	INFO	vservers/db-socorro-stage-processor	confmod	Configuration file modified
[30/Jul/2012:15:02:04 -0700]	INFO	vservers/db-socorro-stage-processor	vsreconf	Virtual server reconfigured on port 6433
[30/Jul/2012:15:02:05 -0700]	INFO	vservers/db-socorro-stage-pg	confmod	Configuration file modified
[30/Jul/2012:15:02:05 -0700]	INFO	vservers/db-socorro-stage-pg	vsreconf	Virtual server reconfigured on port 5432

Tracking additional issues seen so far.

1. HBase nagios alerts

23:04:29     nagios-phx1 | [168] hp-node68.phx1.mozilla.com:Zeus - Port 9090 is CRITICAL: ERROR: Received noSuchName(2) error-status at error-index 1
23:06:49     nagios-phx1 | [172] hp-node64.phx1.mozilla.com:Zeus - Port 9090 is CRITICAL: ERROR: Received noSuchName(2) error-status at error-index 1

Neither Zeus nor the socorro processors saw an error. Aj has acked the alerts and we're waiting on feedback from :tmary about the state of the nodes

2. socorro-processor2 connections to postgres interrupted

First from 13:43 - 13:47

2012-07-30 13:43:46,953 ERROR - Thread-1 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 13:43:46,955 ERROR - Thread-1 - DatabaseError: error with no message from the libpq
2012-07-30 13:44:11,010 ERROR - Thread-2 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 13:44:11,012 ERROR - Thread-2 - DatabaseError: error with no message from the libpq
2012-07-30 13:44:47,088 ERROR - Thread-8 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 13:44:47,089 ERROR - Thread-8 - DatabaseError: error with no message from the libpq
2012-07-30 13:45:29,180 ERROR - Thread-11 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 13:45:29,182 ERROR - Thread-11 - DatabaseError: error with no message from the libpq
2012-07-30 13:47:35,453 ERROR - Thread-7 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 13:47:35,455 ERROR - Thread-7 - DatabaseError: error with no message from the libpq

and then at 3:07, five minutes after the time out config change, so it could be related to that

2012-07-30 15:07:23,217 ERROR - Thread-9 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,219 ERROR - Thread-9 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,222 ERROR - Thread-1 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,224 ERROR - Thread-1 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,227 ERROR - Thread-10 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,228 ERROR - Thread-10 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,231 ERROR - Thread-3 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,233 ERROR - Thread-3 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,236 ERROR - Thread-4 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,238 ERROR - Thread-4 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,241 ERROR - Thread-2 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,242 ERROR - Thread-2 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,245 ERROR - Thread-8 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,247 ERROR - Thread-8 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,249 ERROR - Thread-7 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,250 ERROR - Thread-7 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,254 ERROR - Thread-11 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,255 ERROR - Thread-11 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,258 ERROR - Thread-5 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,259 ERROR - Thread-5 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,262 ERROR - Thread-6 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,264 ERROR - Thread-6 - DatabaseError: error with no message from the libpq
Summary of today was Nagios checks added, a couple more Zeus settings (specifically postgres connection timeouts) were set on stage to make match prod, and a push to stage and prod was done (unrelated to issue, but a change none-the-less)

Per IRC with cshields, lars, and jberkus, we'll monitor through tomorrow to see how it works with the Zeus changes
There have been neither Postgres nor HBase/Thrift connection problems overnight since Monday.  This is very encouraging.
Severity: blocker → normal
earlier change of blocker to normal was inadvertent.
Severity: normal → blocker
sorry to report that soon after my comment #61, staging processors lost the connection to HBase: 

2012-07-31 06:12:22,627 CRITICAL - Thread-11 - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'>
2012-07-31 06:12:22,628 CRITICAL - Thread-11 - the connection is not viable.  retries fail: 

this lasted through about 7:24 where the processors recovered.
Severity: blocker → normal
This was seen on processor1, processor2, and socorroadm.stage, though more times on the processors

processor1

2012-07-31 07:12:51,461 CRITICAL - Thread-5 - something's gone horribly wrong with the HBase connection
2012-07-31 07:22:14,044 CRITICAL - Thread-7 - something's gone horribly wrong with the HBase connection
2012-07-31 07:22:14,313 CRITICAL - Thread-2 - something's gone horribly wrong with the HBase connection
2012-07-31 07:22:37,876 CRITICAL - Thread-10 - something's gone horribly wrong with the HBase connection
2012-07-31 07:22:51,143 CRITICAL - Thread-7 - something's gone horribly wrong with the HBase connection
2012-07-31 07:22:51,369 CRITICAL - Thread-2 - something's gone horribly wrong with the HBase connection
2012-07-31 07:24:13,510 CRITICAL - Thread-11 - something's gone horribly wrong with the HBase connection
2012-07-31 07:24:50,623 CRITICAL - Thread-11 - something's gone horribly wrong with the HBase connection

processor2

2012-07-31 07:12:48,702 CRITICAL - Thread-5 - something's gone horribly wrong with the HBase connection
2012-07-31 07:12:53,481 CRITICAL - Thread-2 - something's gone horribly wrong with the HBase connection
2012-07-31 07:14:26,559 CRITICAL - Thread-7 - something's gone horribly wrong with the HBase connection
2012-07-31 07:17:04,527 CRITICAL - Thread-8 - something's gone horribly wrong with the HBase connection
2012-07-31 07:19:05,589 CRITICAL - Thread-6 - something's gone horribly wrong with the HBase connection
2012-07-31 07:22:28,071 CRITICAL - Thread-3 - something's gone horribly wrong with the HBase connection
2012-07-31 07:23:05,122 CRITICAL - Thread-3 - something's gone horribly wrong with the HBase connection

supportadm.stage

2012-07-31 07:14:41,999 DEBUG - MainThread - somethings gone horribly wrong with HBase
2012-07-31 07:14:53,071 DEBUG - MainThread - somethings gone horribly wrong with HBase

Nagios did not alert anything for socorro stage today.

Zeus did not see any errors, the last thing is sees is the config changes from yesterday.

[30/Jul/2012:15:02:03 -0700]	INFO	vservers/db-socorro-stage-web	confmod	Configuration file modified
[30/Jul/2012:15:02:03 -0700]	INFO	vservers/db-socorro-stage-web	vsreconf	Virtual server reconfigured on port 6432
[30/Jul/2012:15:02:04 -0700]	INFO	vservers/db-socorro-stage-processor	confmod	Configuration file modified
[30/Jul/2012:15:02:04 -0700]	INFO	vservers/db-socorro-stage-processor	vsreconf	Virtual server reconfigured on port 6433
[30/Jul/2012:15:02:05 -0700]	INFO	vservers/db-socorro-stage-pg	confmod	Configuration file modified
[30/Jul/2012:15:02:05 -0700]	INFO	vservers/db-socorro-stage-pg	vsreconf	Virtual server reconfigured on port 5432

Also, we have not seen any Postgres errors since the changes yesterday.

I'm going to get tcpdumps running on the active HBase nodes, the processors, and supportadm.stage
Per :tmary on IRC, I enabled hp-node64 and hp-node68, so far they look fine to Zeus

[31/Jul/2012:10:18:28 -0700]	INFO	pools/socorro-thrift-stage:9090	confmod	Configuration file modified
Updates on what I've done today.

I started tcpdumps at 11:30am, as outlined in https://bugzilla.mozilla.org/show_bug.cgi?id=778831#c2 (each running under screen, as root)

I've been monitoring the logs with the following commands

pp-zlb01 - grep 2012 /usr/local/zeus/zxtm/log/errors | grep socorro
socorroadm.stage - grep DatabaseError /var/log/socorro/socorro-monitor.log and grep HBase /var/log/socorro/socorro-monitor.log
socorro-processor[1-2].stage - grep HBase /var/log/socorro/socorro-processor and grep DatabaseError /var/log/socorro/socorro-processor.log
… I've been monitoring the logs and have not seen any new issues since the errors lars and I noted between 6-7:30am.

If we see further errors, working with NetOps to review the tcpdumps is the next best step. 

I'll be out on PTO tomorrow, until 8/8, so if there is anything I was doing that anyone is unsure how to also do, please let me know tonight, I'll check bugmail before I take off for the airport tomorrow.
on Wednesday, August 1 - the processors lost their connection to HBase at 6:10am pdt, the problem remains on-going at 6:56am pdt.

2012-08-01 06:10:12,203 CRITICAL - Thread-11 - the connection is not viable.  retries fail: 

Interestingly, the monitor has not lost its connection to HBase and continues blithely along assigning jobs to processors that cannot do their work.
Severity: normal → blocker
there is a time consistency in the HBase outages.  

Since July 3rd, the outages all begin within 10 minutes of 6am.

prior to that:

July 3 - 7:30am  and  10:00am
July 2 - 12:50am
July 1 - 3:00am
This debugging is using up all the disk space on socorroadm.stage.private.phx1 with tcpdump logs:
19G     /root/778831

That job can't keep running and storing those logs locally. I'll compress the old ones to buy some time but it'll still need to be killed or redone to copy things off-machine in the next 24h and solarce is on PTO. If no one else has any suggestions of where to copy that data I'm going ot need to kill the tcpdump job.

(3.1Gb of logs in /var/log/socorro, much of which is in /var/log/socorro/syslog, isn't helping but that's small compared to the tcpdumps).
right on schedule, the HBase connection to staging processors and monitor failed at 6:02am PDT.  So we don't know why, but we can predict the time of failure.  The error message is the same as all previous times:

2012-08-02 06:02:24,807 CRITICAL - MainThread - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'>
2012-08-02 06:02:24,807 CRITICAL - MainThread - the connection is not viable.  retries fail:
(In reply to K Lars Lohn [:lars] [:klohn] from comment #71)
> right on schedule, the HBase connection to staging processors and monitor
> failed at 6:02am PDT.  So we don't know why, but we can predict the time of
> failure.  The error message is the same as all previous times:

Since we can predict the failures, could we not set up a cron job to engage and disengage the tcpdump around the time at which the behaviour is expected to occur ? (re: comment #70)
Update: we now know there are three separate causes of failure, and we have a plan.

1.  PG connection timeouts.  When these were moved to a new VIP, it was noticed that the timeout value was set too low (300ms) and different from prod.  This has been fixed.
2.  PG connection timeouts on a Sunday around noon.  This is due to the data being refreshed from prod -> stage.  Being on a Sunday, this is the least critical of the three, but I would like to address it if possible.  That is in the hands of mpressman.
3.  HBase connection timeouts each day at around 6am.  This is due to the data being refreshed from prod->stage.  This is on tmary to fix.
:laura - if these three failure modes are now known, do you still need the tcpdump data to diagnose or can I kill the process before it fills up disk again?
pir: I think you can kill it for now at least.  Thanks!
laura: killed, dumps compressed. There's enough slack space after compression that the dumps don't need to be deleted immediately, but if we're sure they're not required I'll delete them too.
> 3.  HBase connection timeouts each day at around 6am.  This is due to the
> data being refreshed from prod->stage.  This is on tmary to fix.

No of simultaneous mappers has been reduced to 8 - during tests with reduced number of mappers, there were no errors in logs

--
The export of ADI data to Socorro staging postgres failed again last night.  are there still known outstanding problems to be resolved with the database?
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #78)
> The export of ADI data to Socorro staging postgres failed again last night. 
> are there still known outstanding problems to be resolved with the database?

I didn't think so, but apparently I was wrong.

cshields, solarce: can somebody take a look at what happened with Postgres?

dre: tmary: did the Hbase adjustments get us through the 6am work without problems this morning?
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #78)
> The export of ADI data to Socorro staging postgres failed again last night. 
> are there still known outstanding problems to be resolved with the database?

I need exact times and source/dest IPs/ports for this please...  (or point me in the right direction to RTFM)
(In reply to Mark Reid from comment #46)
> The metrics ADU export has been failing on Socorro Stage every night, and
> seems related to the ongoing problems.  For example:
> 
> ERROR 25-07 20:49:36,124 An I/O error occured while sending to the backend.
> 
> Not a very informative error message, but at least it gives the time.
> 
> I'm connecting to socorro-stage-zlb.db.phx1.mozilla.com, and running the job
> manually worked when I tried it earlier this week (24-07 11:38:57,118 to be
> specific).  Please let me know if I can provide any other information.

These errors are still happening each night.  Here is the latest log message:

ERROR 02-08 20:46:00,788 - Delete adu records - An error occurred, processing will be stopped:
Couldn't execute SQL: DELETE FROM raw_adu WHERE date = '2012-08-02'

An I/O error occured while sending to the backend.
(In reply to Corey Shields [:cshields] from comment #80)
> I need exact times and source/dest IPs/ports for this please...  (or point
> me in the right direction to RTFM)

Time in my previous comment.

Source: etl1.metrics.scl3.mozilla.com
Dest: socorro-stage-zlb.db.phx1.mozilla.com
Port: 5432
User: breakpad_metrics
the HBase trouble was not resolved.  I see at least 50 HBase connection retry failures in the processors between 6am and 7:30am PDT.
Daniel,

It's more likely that our changes to the network require updates to your configuration.  Please send:

a) what time the ADI update runs on staging
b) what IP address (or URI) you're hitting
c) username/password (via email).

Thanks!
Josh,

All the information you are requesting is in comment 81 and comment 82 above with the exception of the password.  We can e-mail you the password we are using, but given the fact we can run manually and it completes, I don't think it is likely to be a password issue.
Daniel,

Yes, Mreid found me on IRC.  The password is verified correct.  

He says the job runs at 20:46 Pacific, +/- 5min.
(In reply to T [:tmary] Meyarivan from comment #77)
> > 3.  HBase connection timeouts each day at around 6am.  This is due to the
> > data being refreshed from prod->stage.  This is on tmary to fix.
> 
> No of simultaneous mappers has been reduced to 8 - during tests with reduced
> number of mappers, there were no errors in logs
> 
> --

Following up: reducing the mappers from 8 to 1 solves the problem, but it also means the refresh will take 8-10 hours each day to complete.  I've asked tmary to spec out adding some more nodes and file a bug for hardware.  It may be complicated since he says there is no space in the current set of switches (for hp-nodeXX servers) and rewiring will be required.  He also said once prod is upgraded we may be able to borrow nodes from sec-phx1.  Daniel, can you comment on that?  I'm not sure I want to borrow from prod.
With regard to Postgres/ADI, I suggest we watch the network during the import and try to see what's causing the failure.   Need an ops/pg volunteer - maybe both.
(In reply to Laura Thomson :laura from comment #87)
> asked tmary to spec out adding some more nodes and file a bug for hardware. 
> It may be complicated since he says there is no space in the current set of
> switches (for hp-nodeXX servers) and rewiring will be required. 

Let me know if this problem can be solved with additional space. DC Ops has additional capacity coming online soon in phx1, we can make accommodations for you.
I was out on Friday, Aug 3, but on returning, I find new timing to the HBase failures in the staging processor logs.  

On Aug 3, there are periodic CRITICAL log entries throughout the day.  On Aug 4, I find HBase connection failures begin at 02:16 and lasting for three minutes.  The problem recurs at 02:50 and then, in brief bursts, continues every fifteen to sixty minutes through the night.  It is 07:40 now, the last failure that I see was at 07:19.

I will continue to watch periodically through the weekend.
on Monday morning, processor2 continues to have connectivity issues with HBase.  Processor1 is having no trouble at all.  We're not out of the woods on this issue.
Another update: ADIs did not get processed again today, which is a PG/network issue.
(In reply to Laura Thomson :laura from comment #92)
> Another update: ADIs did not get processed again today, which is a
> PG/network issue.

I don't know what that means.
Re: Postgres & ADI - the export succeeded last night.  Looks like it was failing due to a bug on our end.  I'll keep an eye on it.
Summary of the four issues found (and some fixed):
- (FIXED) PostgreSQL connections failing, fixed by changing virtual server connection timeout for postgres to 0 same as in prod (comment #59)
- (FIXED) HBase refresh used up too many resources, making HBase drop connections.  Fixed by reducing the number of mappers from 8 to 1 (comment #77).  Now the refresh takes 8 hours, so we should add capacity - bug to be filed.
- (FIXED) ADI import was timing out due to a bug on the Metrics end.
- (WONTFIX) PostgreSQL connections timeout during data refresh, starting Sundays at noon PT.  We can live with this.

It took us a long time to converge on these because, since the problems all began around the same time, we assumed a single root cause.  We'll try not to make that (understandable) mistake again.

I'm going to call this good, and ask solarce to close the bug.  Thanks very much to all ops, netops, metrics, DBAs, and Socorro team who worked on getting us stable.
Marking as resolved. Thanks so much to everyone who helped dig into and resolve these issues.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Whiteboard: [testing VIPs on internal cluster]
I'm glad we were able to get this resolved.  Awesome work, everyone!
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.