Closed Bug 914155 Opened 11 years ago Closed 11 years ago

Stage database refresh fails to complete, doesn't apply passwords/config, since August 25

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: selenamarie, Assigned: scabral)

References

Details

Sun Aug 25 02:00:01 PDT 2013 # Make a backup! ## Ensure directories /pgdata/9.2/conf and /pgdata/9.2/data for exist No config files found waiting for server to shut down...... done server stopped ## Destructively clean up local /pgdata/9.2/data ## Making base backup transaction log start point: 14A7/3266F2D8 pg_basebackup: starting background WAL receiver pg_basebackup: streaming header too small: 25 transaction log end point: 14A8/B75C2F30 pg_basebackup: waiting for background process to finish streaming... pg_basebackup: child process exited with error 1 Sun Sep 1 02:00:01 PDT 2013 # Make a backup! ## Ensure directories /pgdata/9.2/conf and /pgdata/9.2/data for exist No config files found waiting for server to shut down...... done server stopped ## Destructively clean up local /pgdata/9.2/data ## Making base backup transaction log start point: 14C7/F04A8A10 pg_basebackup: starting background WAL receiver pg_basebackup: streaming header too small: 25 transaction log end point: 14C9/65F73228 pg_basebackup: waiting for background process to finish streaming... pg_basebackup: child process exited with error 1 Sun Sep 8 02:00:01 PDT 2013 # Make a backup! ## Ensure directories /pgdata/9.2/conf and /pgdata/9.2/data for exist No config files found waiting for server to shut down...... done server stopped ## Destructively clean up local /pgdata/9.2/data ## Making base backup transaction log start point: 14E5/C8B32330 pg_basebackup: starting background WAL receiver pg_basebackup: streaming header too small: 25 transaction log end point: 14E7/DC7BB2F8 pg_basebackup: waiting for background process to finish streaming... pg_basebackup: child process exited with error 1 Looks like the basebackup isn't finishing successfully with that 'streaming header too small' error from bug. We need a workaround for this so that stage comes up successfully. Because the stage passwords weren't applied, the database is effectively offline for socorro stage.
Solution might be similar to what is done for bug 904200
Blocks: 823507
The stage refresh failing to apply passwords and config an issue of the stage environment, not backup reliability -- the database that's created is usable, but because the script doesn't catch the error on early termination of the pg_basebackup, the passwords don't get applied. At that point, the rest of the stating environment assumes postgres is completely down, because password auth fails.
Summary: Stage database refresh fails to complete since August 25 → Stage database refresh fails to complete, doesn't apply passwords/config, since August 25
I discussed with :sheeri the current status, and clarified that the refresh script running on stage is not retrying on failure.
Blocks: 914039
Depends on: 914226
I will apply the same fix on stage as was applied on backup4 to account for the pg_basebackup bug
I double-checked and I see that the refresh was run in a screen session yesterday, and it looks pretty successful: [postgres@socorro1.stage.db.phx1 ~]$ /var/lib/pgsql/bin/create_replica_stage.sh /var/lib/pgsql/bin/refresh_variables.sh Sun Sep 8 16:15:48 PDT 2013 # Make a backup! ## Ensure directories /pgdata/9.2/conf and /pgdata/9.2/data for exist No config files found ## Destructively clean up local /pgdata/9.2/data ## Making base backup transaction log start point: 14E9/33A03B08 pg_basebackup: starting background WAL receiver transaction log end point: 14E9/9AE82D60 pg_basebackup: waiting for background process to finish streaming... pg_basebackup: base backup completed ## Copy config back `/pgdata/9.2/conf/pg_hba.conf' -> `/pgdata/9.2/data/pg_hba.conf' `/pgdata/9.2/conf/pg_hba-test.conf' -> `/pgdata/9.2/data/pg_hba-test.conf' `/pgdata/9.2/conf/pg_ident.conf' -> `/pgdata/9.2/data/pg_ident.conf' `/pgdata/9.2/conf/postgresql.conf' -> `/pgdata/9.2/data/postgresql.conf' `/pgdata/9.2/conf/postgresql_test.conf' -> `/pgdata/9.2/data/postgresql_test.conf' `/pgdata/9.2/conf/recovery.conf' -> `/pgdata/9.2/data/recovery.conf' ## Starting replica back up server starting Waiting for postgres to start up LOG: loaded library "pg_stat_statements" psql: FATAL: the database system is starting up [postgres@socorro1.stage.db.phx1 ~]$ cd /data/socorro/application [postgres@socorro1.stage.db.phx1 application]$ PYTHONPATH=. /usr/bin/python /usr/bin/alembic -c /var/lib/pgsql/.alembic.ini upgrade head INFO [alembic.migration] Context impl PostgresqlImpl. INFO [alembic.migration] Will assume transactional DDL. INFO [alembic.migration] Running upgrade 2209ca57dcc6 -> 389f5501023b, bug 812536 drop hang report [postgres@socorro1.stage.db.phx1 application]$ cd - /var/lib/pgsql [postgres@socorro1.stage.db.phx1 ~]$ psql breakpad -f bin/update_staging_passwords.sql ALTER ROLE ALTER ROLE ALTER ROLE ALTER ROLE ALTER ROLE ALTER ROLE [postgres@socorro1.stage.db.phx1 ~]$ I'm not sure what's missing from that set of commands, since it seems the staging passwords were in fact updated?
all should be working, as I mentioned earlier, I will apply the same retry
Assignee: server-ops-database → mpressman
Put basebackup into while loop checking status and rerunning until it finishes successfully Committed revision 74592.
(In reply to Sheeri Cabral [:sheeri] from comment #5) > I double-checked and I see that the refresh was run in a screen session > yesterday, and it looks pretty successful: Interesting! Well, here's what the Postgres logs say (times in UTC): 2013-09-09 04:51:42.878 GMT,,,32037,,522d537d.7d25,11,,2013-09-09 04:50:05 GMT,1/0,0,LOG,00000,"archive recovery complete",,,,,,,,,"" 2013-09-09 04:55:01.339 GMT,"breakpad_rw","breakpad",32564,"10.8.70.202:16463",522d54a5.7f34,1,"authentication",2013-09-09 04:55:01 GMT,3/47,0,FATAL,28P01,"password authentication failed for user ""break pad_rw""",,,,,,,,,"" The last logline related to passwords just before I applied the passwords SQL file: 2013-09-09 19:16:25.268 GMT,"breakpad_metrics","breakpad",8080,"10.8.70.202:1668",522e1e89.1f90,1,"authentication",2013-09-09 19:16:25 GMT,4/11117,0,FATAL,28P01,"password authentication failed for user " "breakpad_metrics""",,,,,,,,,"" And the first expected SQL errors start getting logged very shortly thereafter: 2013-09-09 19:22:48.903 GMT,"breakpad_rw","breakpad",8896,"10.8.70.202:53352",522e2008.22c0,1,"INSERT",2013-09-09 19:22:48 GMT,2/18494,309809491,ERROR,23505,"duplicate key value violates unique constraint ""raw_crashes_20130909_uuid""","Key (uuid)=(21970e5d-469f-4537-afa8-4c5162130909) already exists."
We have the screen log, the Postgres logs and we know this cron job is in place: 0 2 * * 7 postgres /var/lib/pgsql/bin/create_replica_stage.sh /var/lib/pgsql/bin/refresh_variables.sh >> /var/log/postgresql/stage_refresh.log 2>&1 Could you help reconstruct a timeline of what happened on Sunday through Monday, starting with the cron job?
Sure: Sunday, 8 Sept UTC At 0900 UTC (0200 Pacific), the automated backup began to run, and failed due to the pg_basebackup bug in bug 818989 At 2315 UTC (1615 Pacific), Matt logged in, saw that the refresh had died, and restarted it. Monday 9 Sept UTC At 0450 UTC, archive recovery complete (about 5.5 hours after it began) At 0455 UTC, breakpad_rw authentication failed At 1915 (1215 Pacific, 1515 Eastern) UTC, breakpad_metrics authentication failed for the last time before passwords were applied.
Great, thanks. In my mind, there are a few options for making the refresh process completely automated: 1) Restore from our backups rather than creating a new basebackup on Sunday ** Based on what I was told about the failures + retries, it seems like using Saturday's backups for this would work 2) Fix the cronjob so that even if the base backup fails at the end, the passwords get applied without having to re-run the entire restore again manually Matt & I had discussed #1 at one point. What option can be implemented for this Sunday?
3) The fix that was put in place on Monday, as per comment 7, is sufficient, as you and Laura agreed to on Monday in IRC.
Just want to check: in comment #7, that's the retry in the script, but I thought the script was still being run manually and the password fix then applied? I read comment #11 as being about making everything completely automated, so no human has to do anything on Sundays. Is that right?
The script *was* being run manually, because the retry *wasn't* in the script. We put the retry in the script on Monday. The retry has been in the *backups* for a while. It was not in the *refresh* on stage, because the refresh on stage did not start having this problem until a few weeks ago. Once it started having the problem and we identified it, we put the fix in place.
Assignee: mpressman → scabral
Status: NEW → ASSIGNED
The refresh failed yesterday due to pg_hba.conf on socorro1 not matching socorro3. I'm working on making them the same without any loss from either (adding what's on 1 to 3, and adding what's on 3 to 1).
transaction log start point: 1551/36E20338 pg_basebackup: starting background WAL receiver pg_basebackup: could not create directory "/pgdata/9.2/data/pg_log": File exists This is odd, because /pgdata/9.2/data was empty before the pg_basebackup was started. :(
I changed the refresh time from 2 am to 8 am for the stage refresh, because it seems to actually complete later in the day, but not at 2 am. revision 76166.
That didn't help, changing the refresh time back to 2am; matt is putting in a tested fix for this weekend's stage.
This is now working, the stage refresh has completed successfully the last two weeks and now that we have an updated binary of pg_basebackup from 9.3 the script fix shouldn't be needed as the binary now handles the zero length packet.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.