Closed Bug 720755 Opened 13 years ago Closed 12 years ago

Prepare staging DB servers in PHX

Categories

(Socorro :: Database, task)

x86
macOS
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jberkus, Assigned: mpressman)

References

Details

(Whiteboard: [qa-])

We need to complete setup of PG database servers in PHX. This includes: 1) getting information about the number and configuration of servers 2) configuring server OS / SW / FS setup to match production 3) installing and configuring PostgreSQL to match production sans replication 4) modifying snapshot scripts to work on new infrastructure Jason tells me we are somewhere in 1-3 and that mpressman is in charge. So, Matt, where are we?
1-4 are completed, the final step is to get the data over to the new dev/stage servers and replicating from prod. For a full data transfer it takes about six hours. We'd like to get the full data set for backup purposes and to minimize the need for constant refreshes. Having one constantly replicating will allow a simple failover to "refresh" while continuously providing a full backup.
Matt, Can we meet up about this? At this point, I have no idea what the actual server congituration is ... how many servers we have, what storage they have, etc.
Josh, Sure, I'm free anytime tomorrow and Friday, just ping me and we can set something up
Could I get a status update here please?
mpressman, jason and I are chatting this morning.
I've done a writeup of the setup plan on Mana: https://mana.mozilla.org/wiki/display/websites/PHX+Staging+And+Dev+DB+Config Note that this wiki page as a long list of TODOs to make the new setup a reality. Also, I'm not personally dealing at all with the requirements for moving the rest of the dev infrastucture to PHX.
Matt, can you please update this bug?
As I said, 1-3 are completed 1) We have two servers that match production in disk layout and size with 60GB of memory 2) done 3) done 4) I am currently loading the latest refresh onto socorro1.stage.db.phx1 using the current snapshot scripts
How's this going?
postgresql for socorro stage is up and running in phx and it matches what we have in sjc
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Mpressman, Please give me access to these servers so that I can check them out. Thanks!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I cannot access either of these servers by name. [pgexperts@cm-vpn01 ~]$ ssh -A socorro1.stage.db.phx1.mozilla.com ssh: connect to host socorro1.stage.db.phx1.mozilla.com port 22: Connection timed out
Ping? Do I need to open a separate IT bug about the routes?
Opened up bug 730897 for network flows to the db hosts.
Whiteboard: [qa-]
You should now be able to access the stage hosts using pgexperts username.
OK, now waiting for sudo.
Just updating here: my understanding from discussions with mpressman and jberkus is that stage is ready to go, but devdb is not all the way there (it's insanely complicated). mpressman can you update this bug with an ETA?
stage is definitely ready to go, the architecture for dev is a little different and will be running multiple instances of pg. At this point one instance is up. This one is similar to devdb in sjc and I'm currently loading the most recent refresh to make it similar. The next step is to bring up the second instance replicating directly from master01. This will allow for much simpler and quicker snapshots from prod. I will continuously update here each step until completed. Each step will be transparent and will not affect any work on the host. So please, log in and if you notice anything wrong with something I have said is completed let me know.
Matt, The database names you've used on dev are "breakpad" and "breakpad2w". These aren't the names on the wiki doc; was there a reason for the change of name? If so, which database is which?
Target Milestone: 2.4.4 → 2.5.3
Where are we on this?
stage in phx matches sjc as was discussed last Wednesday. We have the two instances installed and running on dev in phx, however there are some port issues causing script problems in puppet which should be resolved shortly.
Target Milestone: 3 → 2.4.4
(In reply to Matt Pressman [:mpressman] from comment #21) > stage in phx matches sjc as was discussed last Wednesday. We have the two > instances installed and running on dev in phx, however there are some port > issues causing script problems in puppet which should be resolved shortly. Updated port information in puppet.
So is this good to go? If so, I'd like to close this bug and open an "train josh on the new dev/stage infrastructure" bug.
I just tried to use the new DevDB for testing. The only running instance of PostgreSQL is ReplayDB currently. So apparently this server isn't finished yet.
So, status? Still waiting to hear on this.
Since this bug has taken the place not only of stage, but the rest of the dev hosts, I will provide the current status here. The sjc hosts and their phx equivalents are -DevDB socorro1.dev.dmz.sjc1.mozilla.com -> socorro1.dev.db.phx1.mozilla.com -StageDB socorro1.dev.db.sjc1.mozilla.com-> socorro1.stage.db.phx1.mozilla.com -CrashDev socorro1.dev.dmz.sjc1.mozilla.com -> socorro1.dev.dmz.phx1.mozilla.com StageDB is up and running, CrashDev was brought online yesterday and was a new host that wasn't planned on as DevDB was going to contain both the DevDB and CrashDev instances. Because of this change, DevDB needs to have some configuration changes which will occur shortly and that will make DevDB ready for work. Finally, the CrashDev instance is being loaded right now. All told, these hosts should be ready within the next 2-3 hours.
correction from above: -DevDB socorro1.dev.db.sjc1.mozilla.com -> socorro1.dev.db.phx1.mozilla.com -StageDB socorro1.dev.stage.sjc1.mozilla.com-> socorro1.stage.db.phx1.mozilla.com -CrashDev socorro1.dev.dmz.sjc1.mozilla.com -> socorro1.dev.dmz.phx1.mozilla.com
socorro1.dev.dmz.phx1 is now ready to go, thanks to jason for his work and getting the rhn issue figured out
Matt, I'm confused. Is the new DMZ running its own Postgres? I though the idea was to have it connect to the devDB database server in order to avoid the current issues we have in SJC with available memory and disk space.
The new DMZ server in phx is a much improved server from what was running in sjc. Not only is the cpu much more powerful, 8 cores vs. 4 it's also significantly faster which will allow for the multiple services to run without impact. The RAM is also doubled.
The final host, socorro1.dev.db.phx1.mozilla.com is now up and to complete testing on my end, I'm now running through all the steps for a data refresh. This will mean that all current scripts are working for managing the data and will provide a fresh set of data for work on in phx. Fortunately, the speed of the refresh will now be completed in 1-2 hours, rather than the 6-8 in sjc.
The full process for Updating Staging and Dev Snapshots of Postgres Database has been successfully run using the phx hosts. All are now up and available for use and I will be modifying the wiki docs to reflect the phx hosts in place of the sjc hosts after I get some sleep :). Here are the hosts now available, the naming conventions are the same with the modification of phx in place sjc: socorro1.dev.db.phx1.mozilla.com socorro1.stage.db.phx1.mozilla.com socorro1.dev.dmz.phx1.mozilla.com Please note, there are multiple instances running on several of the hosts,but using different ports. This is for use of the former replaydb real-time replication from the master, as well as a backup instance.
I understand we have a couple of issues here: 1. Our original plan, because of the chronic disk space issues with crash-stats-dev, called for the database for that to be on dev.db. This database (for crash-stats-dev) is on dev.dmz instead, the server which has all the app code an the mini-hbase as well - and it's out of disk space and down now. 2. git pull on stageDB is broken. This used to work automatically (hourly) on oldstage; it now doesn't even work manually. mpressman, what is the ETA for these two things to be fixed? These items are blocking us from moving forward, and item 2. is specifically blocking us from releasing tomorrow.
It was my understanding in finding this out yesterday, that the stage code pull process is done by devs. that there was a cron on the old stage was an errant process for convenience. I am willing to go either way, but I made the call to keep the process the same to make sure that the code that is pushed is the code that is on the box to better debug. Regardless, jason pushed the latest bits anyway to get past any release issues. With regards to the space issues for crash-stats-dev. I have another instance on devdb that we can use. It's already up and running and only requires me to load the latest dump.
Ok. We're moving toward auto-updating stage for all components (not just DB) anyway, so we will want to get that set up. How long till we can be up and running with the new dump on devdb?
I'm loading it now, should be about 30 minutes and we'll be ready to go, I'll update here and post in breakpad
The loading is complete on the second instance running on socorro1.dev.db.phx1.mozilla.com, it will now replace the crash-stats-dev instance that was previously running on socorro1.dev.dmz.phx1.mozilla.com. This new instance runs on port 4321 on the server instead of the standard 5432. Please let me know what I can have done to make this a smoother transition.
So, because of issues discovered by the developers, I've just found out that while a new database was spun up on socorro1.dev.dmz.phx1.mozilla.com, the application on crash-stats-dev was never pointed at it, nor were ports opened and passwords set. At this point, it's gone critical because dev.dmz is out of disk space again. Will do what I can to fix it ...
Laura, git pull issue has been resolved on an ad-hoc basis. Still waiting for the long-term solution from mpressman.
These are the remaining tasks: 1. change the socorro config on crash-stats-dev to point at socorro1.dev.db 2. change postgres passwords for this access 3. check ports and connections and make sure they're all clear so this works I was under the impression these were already done as per comment 37. (I didn't realize until today that this hadn't been done, because we didn't ship user-facing DB changes in the last couple of releases.) We are now blocked. mpressman are you, or someone else from IT available to do this today? jberkus is available to help.
Severity: normal → blocker
Bug 754403 to open up network flows from crash-stats-dev to socorro1.dev.db.
At laura's request, I have broken this down into a bunch of seperate bugs, for better tracking. Bugs are listed in the dependancies, except for 754466, which does not block this even though it's related.
Depends on: 754465, 754458, 754460, 754461, 754462
Depends on: 772540
[jumping on an old bug to get the Cc:s mostly, please let me know if people would prefer a new bug] socorro1.dev.dmz.phx1.mozilla.com filled up it's root disk with directories like: 62916 socorro-install-9946-LqU 131016 socorro-crashstats-install-16823-LpR they're 60-130Mb or so each, there was 27Gb in /tmp. I checked that none of these were currently open/running and cleaned them up. The process that installs these really should clean them up afterwards. Is it automated currently or would a cron job removing them if they're over a certain age work?
(In reply to Peter Radcliffe [:pir] from comment #44) > [jumping on an old bug to get the Cc:s mostly, please let me know if people > would prefer a new bug] > > socorro1.dev.dmz.phx1.mozilla.com filled up it's root disk with directories > like: > 62916 socorro-install-9946-LqU > 131016 socorro-crashstats-install-16823-LpR > they're 60-130Mb or so each, there was 27Gb in /tmp. > > I checked that none of these were currently open/running and cleaned them up. > The process that installs these really should clean them up afterwards. Is > it automated currently or would a cron job removing them if they're over a > certain age work? New bug would be best, I'll go ahead and clean these up.. there are update scripts in /data/bin that should be doing this. Thanks!
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.