Closed Bug 822661 Opened 12 years ago Closed 12 years ago

Can't connect from crashanalysis.dmz.phx1 to tp-socorro01-ro-zeus any more

Categories

(Socorro :: Database, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kairo, Assigned: selenamarie)

References

Details

For creating my custom crash analysis reports, I need to connect from crashanalysis.dmz.phx1 to the secondary Socorro DB at tp-socorro01-ro-zeus.phx1.mozilla.com:6432 - this worked fine until yesterday but today I get this error: Warning: pg_pconnect(): Unable to connect to PostgreSQL server This blocks us from doing stability analysis for our upcoming Firefox releases, so I'm filing it with blocker severity.
Moving to webops
Assignee: server-ops → server-ops-webops
Component: Server Operations → Server Operations: Web Operations
QA Contact: shyam → nmaul
Assignee: server-ops-webops → dgherman
[root@crashanalysis.dmz.phx1 ~]# nc -vz tp-socorro01-ro-zeus.phx1.mozilla.com 6432 Connection to tp-socorro01-ro-zeus.phx1.mozilla.com 6432 port [tcp/pgbouncer] succeeded! Can you give more details please?
Severity: blocker → normal
[rkaiser@crashanalysis.dmz.phx1 crash-report-tools]$ psql -h tp-socorro01-ro-zeus.phx1.mozilla.com -p 6432 -U analyst breakpad psql: [rkaiser@crashanalysis.dmz.phx1 crash-report-tools]$ Note how it doesn't even ask me for a password. Maybe it's actually the PostgreSQL instance there that is unhappy.
The VIP points to tp-socorro01-master02.phx1.mozilla.com:6432. But on master02, even if I see postgres processes running, this port is not open: [root@tp-socorro01-master02.phx1 ~]# ps aux | grep postg postgres 20183 0.0 0.2 8949436 202328 ? S Dec17 0:12 /usr/pgsql-9.2/bin/postmaster -p 5432 -D /pgdata/9.2/data postgres 20185 0.0 0.0 177220 1552 ? Ss Dec17 0:00 postgres: logger process postgres 20186 0.5 11.5 8953632 8558908 ? Ss Dec17 5:34 postgres: startup process recovering 0000001C00000EE50000004B postgres 20191 0.0 9.5 8954192 7052992 ? Ss Dec17 0:46 postgres: checkpointer process postgres 20192 0.0 6.1 8953536 4579428 ? Ss Dec17 0:14 postgres: writer process postgres 20194 0.0 0.0 179620 1792 ? Ss Dec17 0:14 postgres: stats collector process postgres 20593 0.3 0.0 8964492 5156 ? Ss Dec17 3:17 postgres: wal receiver process streaming EE5/4B9934A0 [root@tp-socorro01-master02.phx1 ~]# netstat -tunap | grep 6432 [root@tp-socorro01-master02.phx1 ~]#
The replica is currently broken. Matt's not in yet. I'm going to try to kick off a base backup to try to fix this. The underlying problem is that something is deleting the WAL before it can be replayed on the replica. I'm guessing this is a PgX script. In lieu of documentation, I'm pulling Josh Berkus in to see if we can sort out why the WAL disappears prematurely.
Component: Server Operations: Web Operations → Database
Product: mozilla.org → Socorro
QA Contact: nmaul
Depends on: 822382
Replica is now working and access to _ro is restored.
Assignee: dgherman → sdeckelmann
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.