914155 - Stage database refresh fails to complete, doesn't apply passwords/config, since August 25

Reporter

Description

•

11 years ago

Sun Aug 25 02:00:01 PDT 2013 # Make a backup! ## Ensure directories /pgdata/9.2/conf and /pgdata/9.2/data for exist No config files found waiting for server to shut down...... done server stopped ## Destructively clean up local /pgdata/9.2/data ## Making base backup transaction log start point: 14A7/3266F2D8 pg_basebackup: starting background WAL receiver pg_basebackup: streaming header too small: 25 transaction log end point: 14A8/B75C2F30 pg_basebackup: waiting for background process to finish streaming... pg_basebackup: child process exited with error 1 Sun Sep 1 02:00:01 PDT 2013 # Make a backup! ## Ensure directories /pgdata/9.2/conf and /pgdata/9.2/data for exist No config files found waiting for server to shut down...... done server stopped ## Destructively clean up local /pgdata/9.2/data ## Making base backup transaction log start point: 14C7/F04A8A10 pg_basebackup: starting background WAL receiver pg_basebackup: streaming header too small: 25 transaction log end point: 14C9/65F73228 pg_basebackup: waiting for background process to finish streaming... pg_basebackup: child process exited with error 1 Sun Sep 8 02:00:01 PDT 2013 # Make a backup! ## Ensure directories /pgdata/9.2/conf and /pgdata/9.2/data for exist No config files found waiting for server to shut down...... done server stopped ## Destructively clean up local /pgdata/9.2/data ## Making base backup transaction log start point: 14E5/C8B32330 pg_basebackup: starting background WAL receiver pg_basebackup: streaming header too small: 25 transaction log end point: 14E7/DC7BB2F8 pg_basebackup: waiting for background process to finish streaming... pg_basebackup: child process exited with error 1 Looks like the basebackup isn't finishing successfully with that 'streaming header too small' error from bug. We need a workaround for this so that stage comes up successfully. Because the stage passwords weren't applied, the database is effectively offline for socorro stage.

Selena Deckelmann :selenamarie :selena

Reporter

Comment 1

•

11 years ago

Solution might be similar to what is done for bug 904200

Selena Deckelmann :selenamarie :selena

Reporter

Updated

•

11 years ago

Blocks: 823507

Selena Deckelmann :selenamarie :selena

Reporter

Comment 2

•

11 years ago

The stage refresh failing to apply passwords and config an issue of the stage environment, not backup reliability -- the database that's created is usable, but because the script doesn't catch the error on early termination of the pg_basebackup, the passwords don't get applied. At that point, the rest of the stating environment assumes postgres is completely down, because password auth fails.

Summary: Stage database refresh fails to complete since August 25 → Stage database refresh fails to complete, doesn't apply passwords/config, since August 25

Selena Deckelmann :selenamarie :selena

Reporter

Comment 3

•

11 years ago

I discussed with :sheeri the current status, and clarified that the refresh script running on stage is not retrying on failure.

Selena Deckelmann :selenamarie :selena

Reporter

Updated

•

11 years ago

Blocks: 914039

Selena Deckelmann :selenamarie :selena

Reporter

Updated

•

11 years ago

Depends on: 914226

Matt Pressman [:mpressman]

Comment 4

•

11 years ago

I will apply the same fix on stage as was applied on backup4 to account for the pg_basebackup bug

Sheeri Cabral [:sheeri]

Assignee

Comment 5

•

11 years ago

I double-checked and I see that the refresh was run in a screen session yesterday, and it looks pretty successful: [postgres@socorro1.stage.db.phx1 ~]$ /var/lib/pgsql/bin/create_replica_stage.sh /var/lib/pgsql/bin/refresh_variables.sh Sun Sep 8 16:15:48 PDT 2013 # Make a backup! ## Ensure directories /pgdata/9.2/conf and /pgdata/9.2/data for exist No config files found ## Destructively clean up local /pgdata/9.2/data ## Making base backup transaction log start point: 14E9/33A03B08 pg_basebackup: starting background WAL receiver transaction log end point: 14E9/9AE82D60 pg_basebackup: waiting for background process to finish streaming... pg_basebackup: base backup completed ## Copy config back `/pgdata/9.2/conf/pg_hba.conf' -> `/pgdata/9.2/data/pg_hba.conf' `/pgdata/9.2/conf/pg_hba-test.conf' -> `/pgdata/9.2/data/pg_hba-test.conf' `/pgdata/9.2/conf/pg_ident.conf' -> `/pgdata/9.2/data/pg_ident.conf' `/pgdata/9.2/conf/postgresql.conf' -> `/pgdata/9.2/data/postgresql.conf' `/pgdata/9.2/conf/postgresql_test.conf' -> `/pgdata/9.2/data/postgresql_test.conf' `/pgdata/9.2/conf/recovery.conf' -> `/pgdata/9.2/data/recovery.conf' ## Starting replica back up server starting Waiting for postgres to start up LOG: loaded library "pg_stat_statements" psql: FATAL: the database system is starting up [postgres@socorro1.stage.db.phx1 ~]$ cd /data/socorro/application [postgres@socorro1.stage.db.phx1 application]$ PYTHONPATH=. /usr/bin/python /usr/bin/alembic -c /var/lib/pgsql/.alembic.ini upgrade head INFO [alembic.migration] Context impl PostgresqlImpl. INFO [alembic.migration] Will assume transactional DDL. INFO [alembic.migration] Running upgrade 2209ca57dcc6 -> 389f5501023b, bug 812536 drop hang report [postgres@socorro1.stage.db.phx1 application]$ cd - /var/lib/pgsql [postgres@socorro1.stage.db.phx1 ~]$ psql breakpad -f bin/update_staging_passwords.sql ALTER ROLE ALTER ROLE ALTER ROLE ALTER ROLE ALTER ROLE ALTER ROLE [postgres@socorro1.stage.db.phx1 ~]$ I'm not sure what's missing from that set of commands, since it seems the staging passwords were in fact updated?

Matt Pressman [:mpressman]

Comment 6

•

11 years ago

all should be working, as I mentioned earlier, I will apply the same retry

Assignee: server-ops-database → mpressman

Matt Pressman [:mpressman]

Comment 7

•

11 years ago

Put basebackup into while loop checking status and rerunning until it finishes successfully Committed revision 74592.

Selena Deckelmann :selenamarie :selena

Reporter

Comment 8

•

11 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #5) > I double-checked and I see that the refresh was run in a screen session > yesterday, and it looks pretty successful: Interesting! Well, here's what the Postgres logs say (times in UTC): 2013-09-09 04:51:42.878 GMT,,,32037,,522d537d.7d25,11,,2013-09-09 04:50:05 GMT,1/0,0,LOG,00000,"archive recovery complete",,,,,,,,,"" 2013-09-09 04:55:01.339 GMT,"breakpad_rw","breakpad",32564,"10.8.70.202:16463",522d54a5.7f34,1,"authentication",2013-09-09 04:55:01 GMT,3/47,0,FATAL,28P01,"password authentication failed for user ""break pad_rw""",,,,,,,,,"" The last logline related to passwords just before I applied the passwords SQL file: 2013-09-09 19:16:25.268 GMT,"breakpad_metrics","breakpad",8080,"10.8.70.202:1668",522e1e89.1f90,1,"authentication",2013-09-09 19:16:25 GMT,4/11117,0,FATAL,28P01,"password authentication failed for user " "breakpad_metrics""",,,,,,,,,"" And the first expected SQL errors start getting logged very shortly thereafter: 2013-09-09 19:22:48.903 GMT,"breakpad_rw","breakpad",8896,"10.8.70.202:53352",522e2008.22c0,1,"INSERT",2013-09-09 19:22:48 GMT,2/18494,309809491,ERROR,23505,"duplicate key value violates unique constraint ""raw_crashes_20130909_uuid""","Key (uuid)=(21970e5d-469f-4537-afa8-4c5162130909) already exists."

Selena Deckelmann :selenamarie :selena

Reporter

Comment 9

•

11 years ago

We have the screen log, the Postgres logs and we know this cron job is in place: 0 2 * * 7 postgres /var/lib/pgsql/bin/create_replica_stage.sh /var/lib/pgsql/bin/refresh_variables.sh >> /var/log/postgresql/stage_refresh.log 2>&1 Could you help reconstruct a timeline of what happened on Sunday through Monday, starting with the cron job?

Sheeri Cabral [:sheeri]

Assignee

Comment 10

•

11 years ago

Sure: Sunday, 8 Sept UTC At 0900 UTC (0200 Pacific), the automated backup began to run, and failed due to the pg_basebackup bug in bug 818989 At 2315 UTC (1615 Pacific), Matt logged in, saw that the refresh had died, and restarted it. Monday 9 Sept UTC At 0450 UTC, archive recovery complete (about 5.5 hours after it began) At 0455 UTC, breakpad_rw authentication failed At 1915 (1215 Pacific, 1515 Eastern) UTC, breakpad_metrics authentication failed for the last time before passwords were applied.

Selena Deckelmann :selenamarie :selena

Reporter

Comment 11

•

11 years ago

Great, thanks. In my mind, there are a few options for making the refresh process completely automated: 1) Restore from our backups rather than creating a new basebackup on Sunday ** Based on what I was told about the failures + retries, it seems like using Saturday's backups for this would work 2) Fix the cronjob so that even if the base backup fails at the end, the passwords get applied without having to re-run the entire restore again manually Matt & I had discussed #1 at one point. What option can be implemented for this Sunday?

Sheeri Cabral [:sheeri]

Assignee

Comment 12

•

11 years ago

3) The fix that was put in place on Monday, as per comment 7, is sufficient, as you and Laura agreed to on Monday in IRC.

Laura Thomson :laura

Comment 13

•

11 years ago

Just want to check: in comment #7, that's the retry in the script, but I thought the script was still being run manually and the password fix then applied? I read comment #11 as being about making everything completely automated, so no human has to do anything on Sundays. Is that right?

Sheeri Cabral [:sheeri]

Assignee

Comment 14

•

11 years ago

The script *was* being run manually, because the retry *wasn't* in the script. We put the retry in the script on Monday. The retry has been in the *backups* for a while. It was not in the *refresh* on stage, because the refresh on stage did not start having this problem until a few weeks ago. Once it started having the problem and we identified it, we put the fix in place.

Sheeri Cabral [:sheeri]

Assignee

Updated

•

11 years ago

Assignee: mpressman → scabral

Status: NEW → ASSIGNED

Sheeri Cabral [:sheeri]

Assignee

Comment 15

•

11 years ago

The refresh failed yesterday due to pg_hba.conf on socorro1 not matching socorro3. I'm working on making them the same without any loss from either (adding what's on 1 to 3, and adding what's on 3 to 1).

Sheeri Cabral [:sheeri]

Assignee

Comment 16

•

11 years ago

transaction log start point: 1551/36E20338 pg_basebackup: starting background WAL receiver pg_basebackup: could not create directory "/pgdata/9.2/data/pg_log": File exists This is odd, because /pgdata/9.2/data was empty before the pg_basebackup was started. :(

Sheeri Cabral [:sheeri]

Assignee

Comment 17

•

11 years ago

I changed the refresh time from 2 am to 8 am for the stage refresh, because it seems to actually complete later in the day, but not at 2 am. revision 76166.

Sheeri Cabral [:sheeri]

Assignee

Comment 18

•

11 years ago

That didn't help, changing the refresh time back to 2am; matt is putting in a tested fix for this weekend's stage.

Matt Pressman [:mpressman]

Comment 19

•

11 years ago

This is now working, the stage refresh has completed successfully the last two weeks and now that we have an updated binary of pg_basebackup from 9.3 the script fix shouldn't be needed as the binary now handles the zero length packet.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Data & BI Services Team

Bugzilla

Stage database refresh fails to complete, doesn't apply passwords/config, since August 25

Categories

(Data & BI Services Team :: DB: MySQL, task)

Tracking

(Not tracked)

People

(Reporter: selenamarie, Assigned: scabral)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Updated

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Updated