Closed
Bug 1311022
Opened 8 years ago
Closed 7 years ago
Disable backing up up the hg users directories
Categories
(Infrastructure & Operations :: Infrastructure: Tools, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rtucker, Assigned: rtucker)
Details
Backing up of hg has become problematic in that it can take > 24 hours to do so. Due to the amount of files 63 million or so, as well as the size ~500GB. If we can eliminate backing up hg_qtree/mozilla/users/* then we can keep the backup window where it needs to be, by hopefully not overly reducing the redundancy of the backups by too much since things in the users subdirectory should still be able to be reclaimed from that user themselves. We always have the previous 2 nights of everything (including users/*) as a snapshot on the filers) Thoughts?
Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(gps)
Comment 1•8 years ago
|
||
The users/* directory is (in my mind at least) a lower SLA than other parts of hg.mozilla.org. Nothing of production importance should be running from a path with users/* in it. (Although there are some exceptions.) I'm tentatively OK with removing backups from users/. Although I'd prefer to have a backup policy slightly better than 2 days retention of filer snapshots. If someone deletes a user repo or a user repo becomes corrupt, we'll want to have more than 2 days of backups. Is there any way we can decrease the backup interval of the users/ directory to say once or twice a week? Could we put users/ on a separate export and/or change its filer snapshot retention policy to be longer than 2 days?
Flags: needinfo?(gps) → needinfo?(rtucker)
Assignee | ||
Comment 2•8 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #1) > The users/* directory is (in my mind at least) a lower SLA than other parts > of hg.mozilla.org. Nothing of production importance should be running from a > path with users/* in it. (Although there are some exceptions.) > > I'm tentatively OK with removing backups from users/. Although I'd prefer to > have a backup policy slightly better than 2 days retention of filer > snapshots. If someone deletes a user repo or a user repo becomes corrupt, > we'll want to have more than 2 days of backups. > > Is there any way we can decrease the backup interval of the users/ directory > to say once or twice a week? Could we put users/ on a separate export and/or > change its filer snapshot retention policy to be longer than 2 days? I can rework this to just do a full backup once a week of users/*.
Flags: needinfo?(rtucker)
Comment 3•8 years ago
|
||
I'm fine with doing a full backup of users/* once a week and having filer snapshots for 2 days. But I'd like some combination of {fubar, arr, hwine} to sign off on this as well. I'm not sure if there are historical expectations around backup policy we might be violating.
Flags: needinfo?(klibby)
Flags: needinfo?(hwine)
Flags: needinfo?(arich)
Assignee | ||
Comment 4•8 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #3) > I'm fine with doing a full backup of users/* once a week and having filer > snapshots for 2 days. > > But I'd like some combination of {fubar, arr, hwine} to sign off on this as > well. I'm not sure if there are historical expectations around backup policy > we might be violating. Perfect, thanks so much for your attention to this. I'll wait for additional comment before reworking the backup jobs.
Comment 5•8 years ago
|
||
If we're only backing up once a week, we should probably keep snapshots for that full week so that we have coverage of incremental changes in between full backups. How much churn do we see in those directories (e.g. how much would it cost us to keep a week of snapshots instead of two days)?
Flags: needinfo?(arich)
Comment 6•8 years ago
|
||
I can't answer the question in comment 5 as asked, because snapshots are volume-level, not directory-level. That said, 2 nightlies and 6 'every 4 hours from 0800-2000 PT' / "hourlies" is 2270MB. So, tiny. I'll just toss this FYI out there: if we need some vastly different policy on the mozilla/users directory, we can split that out into a 'junction mount' on the filer side so it becomes a separate volume to me/same path to you, but that's a disruptive move (takes a downtime to manage the copyover). It's been done on the bundles directory before.
Assignee | ||
Comment 7•8 years ago
|
||
(In reply to Amy Rich [:arr] [:arich] from comment #5) > If we're only backing up once a week, we should probably keep snapshots for > that full week so that we have coverage of incremental changes in between > full backups. How much churn do we see in those directories (e.g. how much > would it cost us to keep a week of snapshots instead of two days)? Do we have any stats for when we've had to restore anything from the users portion of the hg repo?
Comment 8•8 years ago
|
||
NI on the storage guys for their input on both increasing snapshots and junction mount. I'd *like* to have longer snapshots in general, but they can get ugly with longer time frames and unusual update patterns (e.g. a user making multiple clones of m-c and nuking them). I am generally inclined toward the junction mount idea as we've toyed with making the user repo stuff separate for a long time and this is a good step in that direction. (In reply to Rob Tucker [:rtucker] from comment #7) > Do we have any stats for when we've had to restore anything from the users > portion of the hg repo? I don't think we've EVER had to do a restore for any portion of hg.
Flags: needinfo?(klibby) → needinfo?(cknowles)
Comment 9•8 years ago
|
||
I don't have a particular problem with increasing the snapshot retention on /vol/hg. The concerns you stated, I share, but with the amount of runway we have on the snap reserve, I'm not too worried. What're you thinking on schedule? (Q weeklies, R dailies, S 'hourlies'), and, if we junctionmount, what kind of retention for 'main' vs 'users'? As to junction mounts, I'll take a swing at what kind of outage time frame we'll need and get back with the bug after some experiments (likely next week).
Flags: needinfo?(cknowles)
Comment 10•8 years ago
|
||
I'm fine with a lower backup SLA for user repos. That is long overdue. (Releng uses are reasonable with any of the proposals.) However, that begs the question of where do we publish the SLA? This will be a change in inferred expectations. I wouldn't expect anyone to have any issue with that, as long as we don't surprise folks. Since it's semi-related, hg.m.o is the only public hosting service for hg repos the size of m-c. (bitbucket is now enforcing warnings above 1G, with a hard cap at 2G.)
Flags: needinfo?(hwine)
Comment 11•8 years ago
|
||
(In reply to Greg Cox [:gcox] from comment #9) > I don't have a particular problem with increasing the snapshot retention on > /vol/hg. The concerns you stated, I share, but with the amount of runway we > have on the snap reserve, I'm not too worried. What're you thinking on > schedule? (Q weeklies, R dailies, S 'hourlies'), and, if we junctionmount, > what kind of retention for 'main' vs 'users'? At the very least, daily coverage all the way back to whatever the backup frequency is; if we go with weekly on user repos, than 6 (7?) dailies. I'm not sure if a weekly there is worthwhile or not. I'd also like to have 4 dailies on the main repos for coverage over a long weekend. (In reply to Hal Wine [:hwine] (use NI) from comment #10) > However, that begs the question of where do we publish the SLA? This will > be a change in inferred expectations. I wouldn't expect anyone to have any > issue with that, as long as we don't surprise folks. What is the inferred expectation?
Comment 12•8 years ago
|
||
volume snapshot policy create -vserver devservices -policy hg -enabled true -comment "bug 1311022 comment 11" -schedule1 daily -count1 4 -schedule2 4hour8to20 -count2 6 volume snapshot policy create -vserver devservices -policy hg_users -enabled true -comment "bug 1311022 comment 11" -schedule1 daily -count1 7 -schedule2 4hour8to20 -count2 6 volume modify -vserver devservices -volume hg -snapshot-policy hg volume modify -vserver devservices -volume hg_users -snapshot-policy hg_users # ^ future volume for the junction mount, currently rsync'ing in
Comment 13•8 years ago
|
||
We've had bug 1238145 on file to establish a real-time replica of hg.mozilla.org in AWS for a while. That's blocked on connectivity from AWS back into SCL3. Once that is deployed, we can start taking taking and retaining EBS snapshots for backups. That could *potentially* lead to not having to run backups in SCL3. So if these backups/snapshots in SCL3 become too problematic, that's likely our exit vector.
Assignee | ||
Comment 14•8 years ago
|
||
Per conversation with :gcox, I did the work in bacula to split out users into it's own separate backup job to be ran every Sunday as a full backup. The backup job for the rest of hg remained the same, except for obviously excluding the users/* path.
Assignee | ||
Comment 15•8 years ago
|
||
The work completed in comment #14 allowed for the backups to complete within the window as hoped. The 2 jobs ran concurrently to backup hg and the users/* subsection conncurrently and took ~16 hours. I looked through what was being backed up out of the hg mount (excluding users/* as it is now separate) and found a few things that I'm wondering if we can eliminate backing up?: /mnt/hg/hg_qtree/dead_repositories/ /mnt/hg/hg_qtree/bak/
Assignee | ||
Comment 16•7 years ago
|
||
Everything here has been working perfectly. Thanks for all the help, going to R/F
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•