Closed Bug 1311022 Opened 8 years ago Closed 7 years ago

Disable backing up up the hg users directories

Categories

(Infrastructure & Operations :: Infrastructure: Tools, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rtucker, Assigned: rtucker)

Details

Backing up of hg has become problematic in that it can take > 24 hours to do so.
Due to the amount of files 63 million or so, as well as the size ~500GB.

If we can eliminate backing up  	hg_qtree/mozilla/users/* then we can keep the backup window where it needs to be, by hopefully not overly reducing the redundancy of the backups by too much since things in the users subdirectory should still be able to be reclaimed from that user themselves.

We always have the previous 2 nights of everything (including users/*) as a snapshot on the filers)

Thoughts?
Flags: needinfo?(gps)
The users/* directory is (in my mind at least) a lower SLA than other parts of hg.mozilla.org. Nothing of production importance should be running from a path with users/* in it. (Although there are some exceptions.)

I'm tentatively OK with removing backups from users/. Although I'd prefer to have a backup policy slightly better than 2 days retention of filer snapshots. If someone deletes a user repo or a user repo becomes corrupt, we'll want to have more than 2 days of backups.

Is there any way we can decrease the backup interval of the users/ directory to say once or twice a week? Could we put users/ on a separate export and/or change its filer snapshot retention policy to be longer than 2 days?
Flags: needinfo?(gps) → needinfo?(rtucker)
(In reply to Gregory Szorc [:gps] from comment #1)
> The users/* directory is (in my mind at least) a lower SLA than other parts
> of hg.mozilla.org. Nothing of production importance should be running from a
> path with users/* in it. (Although there are some exceptions.)
> 
> I'm tentatively OK with removing backups from users/. Although I'd prefer to
> have a backup policy slightly better than 2 days retention of filer
> snapshots. If someone deletes a user repo or a user repo becomes corrupt,
> we'll want to have more than 2 days of backups.
> 
> Is there any way we can decrease the backup interval of the users/ directory
> to say once or twice a week? Could we put users/ on a separate export and/or
> change its filer snapshot retention policy to be longer than 2 days?

I can rework this to just do a full backup once a week of users/*.
Flags: needinfo?(rtucker)
I'm fine with doing a full backup of users/* once a week and having filer snapshots for 2 days.

But I'd like some combination of {fubar, arr, hwine} to sign off on this as well. I'm not sure if there are historical expectations around backup policy we might be violating.
Flags: needinfo?(klibby)
Flags: needinfo?(hwine)
Flags: needinfo?(arich)
(In reply to Gregory Szorc [:gps] from comment #3)
> I'm fine with doing a full backup of users/* once a week and having filer
> snapshots for 2 days.
> 
> But I'd like some combination of {fubar, arr, hwine} to sign off on this as
> well. I'm not sure if there are historical expectations around backup policy
> we might be violating.

Perfect, thanks so much for your attention to this. I'll wait for additional comment before reworking the backup jobs.
If we're only backing up once a week, we should probably keep snapshots for that full week so that we have coverage of incremental changes in between full backups. How much churn do we see in those directories (e.g. how much would it cost us to keep a week of snapshots instead of two days)?
Flags: needinfo?(arich)
I can't answer the question in comment 5 as asked, because snapshots are volume-level, not directory-level.  That said, 2 nightlies and 6 'every 4 hours from 0800-2000 PT' / "hourlies" is 2270MB.
So, tiny.

I'll just toss this FYI out there: if we need some vastly different policy on the mozilla/users directory, we can split that out into a 'junction mount' on the filer side so it becomes a separate volume to me/same path to you, but that's a disruptive move (takes a downtime to manage the copyover).  It's been done on the bundles directory before.
(In reply to Amy Rich [:arr] [:arich] from comment #5)
> If we're only backing up once a week, we should probably keep snapshots for
> that full week so that we have coverage of incremental changes in between
> full backups. How much churn do we see in those directories (e.g. how much
> would it cost us to keep a week of snapshots instead of two days)?

Do we have any stats for when we've had to restore anything from the users portion of the hg repo?
NI on the storage guys for their input on both increasing snapshots and junction mount. I'd *like* to have longer snapshots in general, but they can get ugly with longer time frames and unusual update patterns (e.g. a user making multiple clones of m-c and nuking them).

I am generally inclined toward the junction mount idea as we've toyed with making the user repo stuff separate for a long time and this is a good step in that direction.


(In reply to Rob Tucker [:rtucker] from comment #7)
> Do we have any stats for when we've had to restore anything from the users
> portion of the hg repo?

I don't think we've EVER had to do a restore for any portion of hg.
Flags: needinfo?(klibby) → needinfo?(cknowles)
I don't have a particular problem with increasing the snapshot retention on /vol/hg.  The concerns you stated, I share, but with the amount of runway we have on the snap reserve, I'm not too worried.  What're you thinking on schedule?  (Q weeklies, R dailies, S 'hourlies'), and, if we junctionmount, what kind of retention for 'main' vs 'users'?

As to junction mounts, I'll take a swing at what kind of outage time frame we'll need and get back with the bug after some experiments (likely next week).
Flags: needinfo?(cknowles)
I'm fine with a lower backup SLA for user repos. That is long overdue. (Releng uses are reasonable with any of the proposals.)

However, that begs the question of where do we publish the SLA?  This will be a change in inferred expectations. I wouldn't expect anyone to have any issue with that, as long as we don't surprise folks.

Since it's semi-related, hg.m.o is the only public hosting service for hg repos the size of m-c. (bitbucket is now enforcing warnings above 1G, with a hard cap at 2G.)
Flags: needinfo?(hwine)
(In reply to Greg Cox [:gcox] from comment #9)
> I don't have a particular problem with increasing the snapshot retention on
> /vol/hg.  The concerns you stated, I share, but with the amount of runway we
> have on the snap reserve, I'm not too worried.  What're you thinking on
> schedule?  (Q weeklies, R dailies, S 'hourlies'), and, if we junctionmount,
> what kind of retention for 'main' vs 'users'?

At the very least, daily coverage all the way back to whatever the backup frequency is; if we go with weekly on user repos, than 6 (7?) dailies. I'm not sure if a weekly there is worthwhile or not. I'd also like to have 4 dailies on the main repos for coverage over a long weekend.

(In reply to Hal Wine [:hwine] (use NI) from comment #10)
> However, that begs the question of where do we publish the SLA?  This will
> be a change in inferred expectations. I wouldn't expect anyone to have any
> issue with that, as long as we don't surprise folks.

What is the inferred expectation?
volume snapshot policy create -vserver devservices -policy hg -enabled true -comment "bug 1311022 comment 11" -schedule1 daily -count1 4 -schedule2 4hour8to20 -count2 6
volume snapshot policy create -vserver devservices -policy hg_users -enabled true -comment "bug 1311022 comment 11" -schedule1 daily -count1 7 -schedule2 4hour8to20 -count2 6

volume modify -vserver devservices -volume hg -snapshot-policy hg
volume modify -vserver devservices -volume hg_users -snapshot-policy hg_users
# ^ future volume for the junction mount, currently rsync'ing in
We've had bug 1238145 on file to establish a real-time replica of hg.mozilla.org in AWS for a while. That's blocked on connectivity from AWS back into SCL3. Once that is deployed, we can start taking taking and retaining EBS snapshots for backups. That could *potentially* lead to not having to run backups in SCL3. So if these backups/snapshots in SCL3 become too problematic, that's likely our exit vector.
Per conversation with :gcox, I did the work in bacula to split out users into it's own separate backup job to be ran every Sunday as a full backup.

The backup job for the rest of hg remained the same, except for obviously excluding the users/* path.
The work completed in comment #14 allowed for the backups to complete within the window as hoped.

The 2 jobs ran concurrently to backup hg and the users/* subsection conncurrently and took ~16 hours.

I looked through what was being backed up out of the hg mount (excluding users/* as it is now separate) and found a few things that I'm wondering if we can eliminate backing up?:

/mnt/hg/hg_qtree/dead_repositories/
/mnt/hg/hg_qtree/bak/
Everything here has been working perfectly.
Thanks for all the help, going to R/F
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.