If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Stage database refresh fails to complete, doesn't apply passwords/config, since August 25

RESOLVED FIXED

Status

Data & BI Services Team
DB: MySQL
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: selenamarie, Assigned: sheeri)

Tracking

(Blocks: 1 bug)

Details

Sun Aug 25 02:00:01 PDT 2013
# Make a backup!
## Ensure directories /pgdata/9.2/conf and /pgdata/9.2/data for  exist
No config files found
waiting for server to shut down...... done
server stopped
## Destructively clean up local /pgdata/9.2/data
## Making base backup
transaction log start point: 14A7/3266F2D8
pg_basebackup: starting background WAL receiver
pg_basebackup: streaming header too small: 25
transaction log end point: 14A8/B75C2F30
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: child process exited with error 1
Sun Sep  1 02:00:01 PDT 2013
# Make a backup!
## Ensure directories /pgdata/9.2/conf and /pgdata/9.2/data for  exist
No config files found
waiting for server to shut down...... done
server stopped
## Destructively clean up local /pgdata/9.2/data
## Making base backup
transaction log start point: 14C7/F04A8A10
pg_basebackup: starting background WAL receiver
pg_basebackup: streaming header too small: 25
transaction log end point: 14C9/65F73228
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: child process exited with error 1
Sun Sep  8 02:00:01 PDT 2013
# Make a backup!
## Ensure directories /pgdata/9.2/conf and /pgdata/9.2/data for  exist
No config files found
waiting for server to shut down...... done
server stopped
## Destructively clean up local /pgdata/9.2/data
## Making base backup
transaction log start point: 14E5/C8B32330
pg_basebackup: starting background WAL receiver
pg_basebackup: streaming header too small: 25
transaction log end point: 14E7/DC7BB2F8
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: child process exited with error 1


Looks like the basebackup isn't finishing successfully with that 'streaming header too small' error from bug.

We need a workaround for this so that stage comes up successfully. Because the stage passwords weren't applied, the database is effectively offline for socorro stage.
Solution might be similar to what is done for bug 904200
Blocks: 823507
The stage refresh failing to apply passwords and config an issue of the stage environment, not backup reliability -- the database that's created is usable, but because the script doesn't catch the error on early termination of the pg_basebackup, the passwords don't get applied. 

At that point, the rest of the stating environment assumes postgres is completely down, because password auth fails.
Summary: Stage database refresh fails to complete since August 25 → Stage database refresh fails to complete, doesn't apply passwords/config, since August 25
I discussed with :sheeri the current status, and clarified that the refresh script running on stage is not retrying on failure.
Blocks: 914039
Depends on: 914226
I will apply the same fix on stage as was applied on backup4 to account for the pg_basebackup bug
(Assignee)

Comment 5

4 years ago
I double-checked and I see that the refresh was run in a screen session yesterday, and it looks pretty successful:

[postgres@socorro1.stage.db.phx1 ~]$ /var/lib/pgsql/bin/create_replica_stage.sh /var/lib/pgsql/bin/refresh_variables.sh 
Sun Sep  8 16:15:48 PDT 2013
# Make a backup!
## Ensure directories /pgdata/9.2/conf and /pgdata/9.2/data for  exist
No config files found
## Destructively clean up local /pgdata/9.2/data
## Making base backup
transaction log start point: 14E9/33A03B08
pg_basebackup: starting background WAL receiver
transaction log end point: 14E9/9AE82D60
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: base backup completed
## Copy config back
`/pgdata/9.2/conf/pg_hba.conf' -> `/pgdata/9.2/data/pg_hba.conf'
`/pgdata/9.2/conf/pg_hba-test.conf' -> `/pgdata/9.2/data/pg_hba-test.conf'
`/pgdata/9.2/conf/pg_ident.conf' -> `/pgdata/9.2/data/pg_ident.conf'
`/pgdata/9.2/conf/postgresql.conf' -> `/pgdata/9.2/data/postgresql.conf'
`/pgdata/9.2/conf/postgresql_test.conf' -> `/pgdata/9.2/data/postgresql_test.conf'
`/pgdata/9.2/conf/recovery.conf' -> `/pgdata/9.2/data/recovery.conf'
## Starting replica back up
server starting
Waiting for postgres to start up
LOG:  loaded library "pg_stat_statements"
psql: FATAL:  the database system is starting up
[postgres@socorro1.stage.db.phx1 ~]$ cd /data/socorro/application
[postgres@socorro1.stage.db.phx1 application]$ PYTHONPATH=. /usr/bin/python /usr/bin/alembic -c /var/lib/pgsql/.alembic.ini upgrade head
INFO  [alembic.migration] Context impl PostgresqlImpl.
INFO  [alembic.migration] Will assume transactional DDL.
INFO  [alembic.migration] Running upgrade 2209ca57dcc6 -> 389f5501023b, bug 812536 drop hang report
[postgres@socorro1.stage.db.phx1 application]$ cd -
/var/lib/pgsql
[postgres@socorro1.stage.db.phx1 ~]$ psql breakpad -f bin/update_staging_passwords.sql
ALTER ROLE
ALTER ROLE
ALTER ROLE
ALTER ROLE
ALTER ROLE
ALTER ROLE
[postgres@socorro1.stage.db.phx1 ~]$


I'm not sure what's missing from that set of commands, since it seems the staging passwords were in fact updated?
all should be working, as I mentioned earlier, I will apply the same retry
Assignee: server-ops-database → mpressman
Put basebackup into while loop checking status and rerunning until it finishes successfully
Committed revision 74592.
(In reply to Sheeri Cabral [:sheeri] from comment #5)
> I double-checked and I see that the refresh was run in a screen session
> yesterday, and it looks pretty successful:

Interesting!  Well, here's what the Postgres logs say (times in UTC):

2013-09-09 04:51:42.878 GMT,,,32037,,522d537d.7d25,11,,2013-09-09 04:50:05 GMT,1/0,0,LOG,00000,"archive recovery complete",,,,,,,,,""

2013-09-09 04:55:01.339 GMT,"breakpad_rw","breakpad",32564,"10.8.70.202:16463",522d54a5.7f34,1,"authentication",2013-09-09 04:55:01 GMT,3/47,0,FATAL,28P01,"password authentication failed for user ""break
pad_rw""",,,,,,,,,""

The last logline related to passwords just before I applied the passwords SQL file: 

2013-09-09 19:16:25.268 GMT,"breakpad_metrics","breakpad",8080,"10.8.70.202:1668",522e1e89.1f90,1,"authentication",2013-09-09 19:16:25 GMT,4/11117,0,FATAL,28P01,"password authentication failed for user "
"breakpad_metrics""",,,,,,,,,""

And the first expected SQL errors start getting logged very shortly thereafter: 

2013-09-09 19:22:48.903 GMT,"breakpad_rw","breakpad",8896,"10.8.70.202:53352",522e2008.22c0,1,"INSERT",2013-09-09 19:22:48 GMT,2/18494,309809491,ERROR,23505,"duplicate key value violates unique constraint ""raw_crashes_20130909_uuid""","Key (uuid)=(21970e5d-469f-4537-afa8-4c5162130909) already exists."
We have the screen log, the Postgres logs and we know this cron job is in place: 

0 2 * * 7 postgres /var/lib/pgsql/bin/create_replica_stage.sh /var/lib/pgsql/bin/refresh_variables.sh >> /var/log/postgresql/stage_refresh.log 2>&1


Could you help reconstruct a timeline of what happened on Sunday through Monday, starting with the cron job?
(Assignee)

Comment 10

4 years ago
Sure:

Sunday, 8 Sept UTC
At 0900 UTC (0200 Pacific), the automated backup began to run, and failed due to the pg_basebackup bug in bug 818989
At 2315 UTC (1615 Pacific), Matt logged in, saw that the refresh had died, and restarted it. 
Monday 9 Sept UTC 
At 0450 UTC, archive recovery complete (about 5.5 hours after it began)
At 0455 UTC, breakpad_rw authentication failed
At 1915 (1215 Pacific, 1515 Eastern) UTC, breakpad_metrics authentication failed for the last time before passwords were applied.
Great, thanks. 

In my mind, there are a few options for making the refresh process completely automated: 

1) Restore from our backups rather than creating a new basebackup on Sunday
** Based on what I was told about the failures + retries, it seems like using Saturday's backups for this would work
2) Fix the cronjob so that even if the base backup fails at the end, the passwords get applied without having to re-run the entire restore again manually

Matt & I had discussed #1 at one point. 

What option can be implemented for this Sunday?
(Assignee)

Comment 12

4 years ago
3) The fix that was put in place on Monday, as per comment 7, is sufficient, as you and Laura agreed to on Monday in IRC.
Just want to check: in comment #7, that's the retry in the script, but I thought the script was still being run manually and the password fix then applied?  I read comment #11 as being about making everything completely automated, so no human has to do anything on Sundays. Is that right?
(Assignee)

Comment 14

4 years ago
The script *was* being run manually, because the retry *wasn't* in the script. We put the retry in the script on Monday.

The retry has been in the *backups* for a while. It was not in the *refresh* on stage, because the refresh on stage did not start having this problem until a few weeks ago. Once it started having the problem and we identified it, we put the fix in place.
(Assignee)

Updated

4 years ago
Assignee: mpressman → scabral
Status: NEW → ASSIGNED
(Assignee)

Comment 15

4 years ago
The refresh failed yesterday due to pg_hba.conf on socorro1 not matching socorro3. I'm working on making them the same without any loss from either (adding what's on 1 to 3, and adding what's on 3 to 1).
(Assignee)

Comment 16

4 years ago
transaction log start point: 1551/36E20338
pg_basebackup: starting background WAL receiver
pg_basebackup: could not create directory "/pgdata/9.2/data/pg_log": File exists

This is odd, because /pgdata/9.2/data was empty before the pg_basebackup was started. :(
(Assignee)

Updated

4 years ago
Depends on: 922203
(Assignee)

Comment 17

4 years ago
I changed the refresh time from 2 am to 8 am for the stage refresh, because it seems to actually complete later in the day, but not at 2 am.
revision 76166.
(Assignee)

Comment 18

4 years ago
That didn't help, changing the refresh time back to 2am; matt is putting in a tested fix for this weekend's stage.
This is now working, the stage refresh has completed successfully the last two weeks and now that we have an updated binary of pg_basebackup from 9.3 the script fix shouldn't be needed as the binary now handles the zero length packet.
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.