Backups failed due to bug in 9.2.1 through 9.2.4 - coordinate debugging with a debugging patch

RESOLVED FIXED

Status

Data & BI Services Team
DB: MySQL
RESOLVED FIXED
5 years ago
3 years ago

People

(Reporter: selenamarie, Assigned: selenamarie)

Tracking

Details

We had a security and bugfix release for PostgreSQL today. These include fixing issues with routine maintenance commands like CREATE INDEX CONCURRENTLY and some replication bugs.

At the earliest convenient time, we should upgrade to 9.2.2.

This process is just: 

* Install new packaged version of 9.2
* Stop socorro
* Stop/start databases
* Restart socorro
Assignee: server-ops-database → mpressman
9.2.3 will be out by the time it's possible to upgrade.
Summary: Schedule 9.2.2 upgrade for Socorro databases → Schedule 9.2.3 upgrade for Socorro databases
We hit a bug in 9.2.1 today: 

# Make a backup!
Tue Feb 19 00:02:01 UTC 2013
## Destructively clean up local /data/pgbasebackups/9.2/2013021900/data
## Making base backup
transaction log start point: 1097/BC7A0460
pg_basebackup: starting background WAL receiver
pg_basebackup: streaming header too small: 25
transaction log end point: 1098/14940E60
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: child process exited with error 1


This bug was fixed here:
http://www.postgresql.org/message-id/CABUevEyv+YCQ26eXFKgBetDekeSp_YCJs80m1N=bFy3a47-GOg@mail.gmail.com


And the commit is present in 9.2.3


We have 9.2.3 available on backup4, but the server must be upgraded to fix this problem.
Summary: Schedule 9.2.3 upgrade for Socorro databases → Backups failed due to bug in 9.2.1 - Schedule 9.2.3 upgrade for Socorro databases
Blocks: 823507
Tue Feb 19 18:04:10 UTC 2013
## Destructively clean up local /data/pgbasebackups/9.2/2013021918/data
## Making base backup
transaction log start point: 109D/3F26C6D0
pg_basebackup: starting background WAL receiver
pg_basebackup: streaming header too small: 25

Ugh. Still happening.
Blocks: 848498
Matt --

Could you confirm that the pg_basebackup is now working since the upgrade.

Thanks!
-selena
Flags: needinfo?(mpressman)
Since the update to 9.2.4 on master01 we have not hit this issue, here is the output from /var/log/base_backup.log

Fri Apr  5 17:58:18 UTC 2013
# Cleaning up old backup directories
Nothing to remove! Only 5 backups present.
# Done!
Fri Apr  5 17:58:18 UTC 2013
# Make a backup!
Sat Apr  6 16:02:01 UTC 2013
## Destructively clean up local /data/pgbasebackups/9.2/2013040616/data
## Making base backup
transaction log start point: 1211/AF8076C0
pg_basebackup: starting background WAL receiver
transaction log end point: 1212/25DB3D20
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: base backup completed
# Done with backup!
Sat Apr  6 17:58:13 UTC 2013
# Cleaning up old backup directories
2013032800
# Done!
Sat Apr  6 17:58:14 UTC 2013
# Make a backup!
Sun Apr  7 16:02:01 UTC 2013
## Destructively clean up local /data/pgbasebackups/9.2/2013040716/data
## Making base backup
transaction log start point: 1218/B33B490
pg_basebackup: starting background WAL receiver
transaction log end point: 1218/768B8788
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: base backup completed
# Done with backup!
Sun Apr  7 17:58:44 UTC 2013
# Cleaning up old backup directories
2013032900
# Done!
Sun Apr  7 17:58:45 UTC 2013
# Make a backup!
Mon Apr  8 16:02:01 UTC 2013
## Destructively clean up local /data/pgbasebackups/9.2/2013040816/data
## Making base backup
transaction log start point: 121D/3022C9C0
pg_basebackup: starting background WAL receiver
transaction log end point: 121D/9ABB6538
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: base backup completed
# Done with backup!
Mon Apr  8 17:58:52 UTC 2013
# Cleaning up old backup directories
2013040100
# Done!
Mon Apr  8 17:58:52 UTC 2013
# Make a backup!
Tue Apr  9 16:02:01 UTC 2013
## Destructively clean up local /data/pgbasebackups/9.2/2013040916/data
## Making base backup
transaction log start point: 1222/58023B00
pg_basebackup: starting background WAL receiver
transaction log end point: 1222/D65F8F58
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: base backup completed
# Done with backup!
Tue Apr  9 17:59:01 UTC 2013
# Cleaning up old backup directories
2013040300
# Done!
Flags: needinfo?(mpressman)
Great! Thank you
apparently this is still a problem, since the upgrade we have not had an issue until today's backup:

# Make a backup!
Wed Apr 10 16:02:01 UTC 2013
## Destructively clean up local /data/pgbasebackups/9.2/2013041016/data
## Making base backup
transaction log start point: 1226/CC2A4360
pg_basebackup: starting background WAL receiver
pg_basebackup: streaming header too small: 25
transaction log end point: 1228/48FFE918
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: child process exited with error 1
(In reply to Matt Pressman [:mpressman] from comment #7)
> apparently this is still a problem, since the upgrade we have not had an
> issue until today's backup:

Ugh. Alright. I'm going to have a look at compiling a debugging version of PostgreSQL to see if we can catch this in the act. 

I have to address a couple other tickets before that unfortunately.
Flags: needinfo?(sdeckelmann)
Matt, Selena - is this still a problem? backups failing seems like a non-trivial thing, is there an update on this bug?
(In reply to Sheeri Cabral [:sheeri] from comment #9)
> Matt, Selena - is this still a problem? backups failing seems like a
> non-trivial thing, is there an update on this bug?

I'm not sure -- maybe matt could check on the backups? Also, monitoring for the backups through nagios would be great.
Flags: needinfo?(sdeckelmann)
Still happening: 

# Make a backup!
Wed Sep  4 16:02:01 UTC 2013
## Making base backup
transaction log start point: 14D5/E1440C28
pg_basebackup: starting background WAL receiver
pg_basebackup: streaming header too small: 25
transaction log end point: 14D6/53FBDBF8
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: child process exited with error 1
# Done with backup!
Wed Sep  4 18:37:03 UTC 2013
# Cleaning up old backup directories
2013090116
# Done!


Looks like we get a few backups succeeding, sometimes.
Summary: Backups failed due to bug in 9.2.1 - Schedule 9.2.3 upgrade for Socorro databases → Backups failed due to bug in 9.2.1 through 9.2.4 - coordinate debugging with a debugging patch
I modified the base_backup.sh script so that if it catches an error, it will re-run until it completes successfully. There is a nagios check for the postgres backups, however it doesn't look like it is applied. Additionally, that check it could/should probably be more robust. I'll work on that as well.
Blocks: 914155
Blocks: 922156
Assignee: mpressman → sdeckelmann
No longer blocks: 922156
Depends on: 922156
Progress today:

* installed the following on backup4 to support a build: 
> sudo yum install -y bison-devel readline-devel zlib-devel openssl-devel wget
> sudo yum groupinstall -y 'Development Tools'

* Built a copy of Postgres from commit 061b88c732952c59741374806e1e41c1ec845d50, which includes a fix for pg_basebackup. 

:mpressman is kicking off a backup now to test it.
Had to recompile the 9.2 version. 9.3 is not backward compatible for streaming replication.
No longer depends on: 922156
Confirmed that this fix is in. 

Version 9.2.6 of Postgres will include the fix.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.