Prod Collectors using wrong filesystem storage class

RESOLVED FIXED

Status

Socorro
Infra
--
major
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: lars, Assigned: phrawzty)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

3 years ago
It appears that on October 1 of 2013, all the Production collectors got reverted to using the old style Crash Storage classes that don't cleanup after themselves. 

They need to be migrated back to FSTemporaryStorage.
(Reporter)

Comment 1

3 years ago
I've mitigated the problem for three months on C1, C2, C3 with:

    sudo rm -rf ~socorro/primaryCrashStore/2013????

I'll hit the other collectors tomorrow when these are done.
(Reporter)

Comment 2

3 years ago
all six prod collectors have now had 90 days of old empty directories purged.  We will have to clear out all of it eventually, but now we've got a 90 day buffer before we encounter the problem again.  

Each evening when the load is low, I'm going to continue this process of clearing out old directories.

Comment 3

3 years ago
New plan -- disable puppet on all, remove half of the collectors from zeus, let them drain, upgrade them, return them to zeus, then repeat the last 4 steps for the other half and re-enable puppet on all.

It's a little more manually intensive and it involves touching zeus. It also cuts our capacity in half, but we can handle 3x normal peak load so if we do it off peak hours it should be fine. If phrawzty can help with zeus we can do it early PST and everything will work out.

There's no rush; we can do it when Phrawzty gets back next week iff he has permissions to modify zeus still. If not, we can schedule with webops.
(Reporter)

Comment 4

3 years ago
the steps of the plan (see https://etherpad.mozilla.org/Atkhh07tOK for discussion and plan alternatives)

1) Disable puppet on all
2) Update svn config with the new configuration
2.1) in both 'collector.ini'  and 'crashmover.ini' replace   socorro.external.fs.crashstorage.FSLegacyDatedRadixTreeStorage'  with    'socorro.external.fs.crashstorage.FSTemporaryStorage' 
3) Remove half of the collectors from zeus, let them drain
4) after complete drainage
4.1) mv $SOCORRO_HOME/primaryCrashStore $SOCORRO_HOME/retired/primaryCrashStore
4.2) mkdir $SOCORRO_HOME/primaryCrashStore
4.3) chown -R apache:socorro primary_crash_store
4.4) chmod -R g+ws primary_crash_store
4.5) chmod -R o+rx primary_crash_store
5) Run puppet manually on the removed set
5.1) restart Apache & crashmovers
5.2) verify that logged current config is correct
6) Return them to Zeus
7) Watch for trouble
8) repeat 3-7 with the other set of collectors
9) Enable puppet on all
(Assignee)

Comment 5

3 years ago
(In reply to Chris Lonnen :lonnen from comment #3)
> There's no rush; we can do it when Phrawzty gets back next week iff he has
> permissions to modify zeus still. If not, we can schedule with webops.

It would appear that I can still log into the Zeus admin panel, so that's good.
(Assignee)

Comment 6

3 years ago
This manipulation is currently scheduled for Wednesday, 22 October 2014, 07:00:00 UTC [1].

[1] http://www.timeanddate.com/worldclock/meetingdetails.html?year=2014&month=10&day=22&hour=7&min=0&sec=0&p1=195&p2=179&p3=224
Assignee: nobody → dmaher
Severity: normal → major
(Assignee)

Comment 7

3 years ago
Created attachment 8509298 [details] [diff] [review]
updated FS type
(Assignee)

Comment 8

3 years ago
$ svn ci -m 'update FS type; bug 1079642'
Sending        collector.ini
Sending        crashmover.ini
Transmitting file data ..
Committed revision 95255.
Status: NEW → ASSIGNED
(Assignee)

Comment 9

3 years ago
The manipulation[1] is complete.


[1] https://etherpad.mozilla.org/ep/pad/view/ro.DcUxGDURaBv/rev.1629
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(Assignee)

Updated

3 years ago
Blocks: 1087311
(Assignee)

Updated

3 years ago
Blocks: 1087414
You need to log in before you can comment on or make changes to this bug.