Closed Bug 1079642 Opened 5 years ago Closed 5 years ago

Prod Collectors using wrong filesystem storage class

Categories

(Socorro :: Infra, task, major)

x86_64
Linux
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lars, Assigned: dmaher)

References

Details

Attachments

(1 file)

It appears that on October 1 of 2013, all the Production collectors got reverted to using the old style Crash Storage classes that don't cleanup after themselves. 

They need to be migrated back to FSTemporaryStorage.
I've mitigated the problem for three months on C1, C2, C3 with:

    sudo rm -rf ~socorro/primaryCrashStore/2013????

I'll hit the other collectors tomorrow when these are done.
all six prod collectors have now had 90 days of old empty directories purged.  We will have to clear out all of it eventually, but now we've got a 90 day buffer before we encounter the problem again.  

Each evening when the load is low, I'm going to continue this process of clearing out old directories.
New plan -- disable puppet on all, remove half of the collectors from zeus, let them drain, upgrade them, return them to zeus, then repeat the last 4 steps for the other half and re-enable puppet on all.

It's a little more manually intensive and it involves touching zeus. It also cuts our capacity in half, but we can handle 3x normal peak load so if we do it off peak hours it should be fine. If phrawzty can help with zeus we can do it early PST and everything will work out.

There's no rush; we can do it when Phrawzty gets back next week iff he has permissions to modify zeus still. If not, we can schedule with webops.
the steps of the plan (see https://etherpad.mozilla.org/Atkhh07tOK for discussion and plan alternatives)

1) Disable puppet on all
2) Update svn config with the new configuration
2.1) in both 'collector.ini'  and 'crashmover.ini' replace   socorro.external.fs.crashstorage.FSLegacyDatedRadixTreeStorage'  with    'socorro.external.fs.crashstorage.FSTemporaryStorage' 
3) Remove half of the collectors from zeus, let them drain
4) after complete drainage
4.1) mv $SOCORRO_HOME/primaryCrashStore $SOCORRO_HOME/retired/primaryCrashStore
4.2) mkdir $SOCORRO_HOME/primaryCrashStore
4.3) chown -R apache:socorro primary_crash_store
4.4) chmod -R g+ws primary_crash_store
4.5) chmod -R o+rx primary_crash_store
5) Run puppet manually on the removed set
5.1) restart Apache & crashmovers
5.2) verify that logged current config is correct
6) Return them to Zeus
7) Watch for trouble
8) repeat 3-7 with the other set of collectors
9) Enable puppet on all
(In reply to Chris Lonnen :lonnen from comment #3)
> There's no rush; we can do it when Phrawzty gets back next week iff he has
> permissions to modify zeus still. If not, we can schedule with webops.

It would appear that I can still log into the Zeus admin panel, so that's good.
This manipulation is currently scheduled for Wednesday, 22 October 2014, 07:00:00 UTC [1].

[1] http://www.timeanddate.com/worldclock/meetingdetails.html?year=2014&month=10&day=22&hour=7&min=0&sec=0&p1=195&p2=179&p3=224
Assignee: nobody → dmaher
Severity: normal → major
Attached patch updated FS typeSplinter Review
$ svn ci -m 'update FS type; bug 1079642'
Sending        collector.ini
Sending        crashmover.ini
Transmitting file data ..
Committed revision 95255.
Status: NEW → ASSIGNED
The manipulation[1] is complete.


[1] https://etherpad.mozilla.org/ep/pad/view/ro.DcUxGDURaBv/rev.1629
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Blocks: 1087311
Blocks: 1087414
You need to log in before you can comment on or make changes to this bug.