Need Data Ageing Policy for Socorro

RESOLVED FIXED in 2.1

Status

Socorro
General
--
major
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: jberkus, Assigned: laura)

Tracking

Trunk
x86
Mac OS X

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

7 years ago
Before we can go "full throttle" on Socorro, or indeed before we can last another 6 months, we need to have a policy on "data ageing", i.e. what data we keep for how long, and what data we get rid of.  This is not a single number, but a set of numbers across the various different kinds of data we keep.  At the least, we need to decide how long we keep:

1. Base-level data
   a. raw dumps on HBase (Daniel suggest 6 months)
   b. processed dumps on HBase
   c. processed reports in PostgreSQL
2. Summary Data
   a. hourly summary and counts (TCBS etc.)
   b. daily smmary and counts (would be a rollup of (a))
   c. monthly/per product summary and counts (would be a rollup of (a))

Of the two, expiring (1) is far more important that expiring (2), which is comparatively much smaller.  However, both kinds of data need to expire eventually and be purged, so let's set a policy and automate it.

Expiration times could be based on the calendar, or based on product release dates.
(Reporter)

Comment 1

7 years ago
Oh, I forgot:

3. Other data
   a. e-mail campaigns
   b. raw_adu
   c. probably other stuff I'm not thinking about right now ...
Does this need to be global or per product - ie have a policy for Ff a different one for product with less users (Camino, SeaMonkey and Thunderbird) ?
(Assignee)

Comment 3

7 years ago
(In reply to comment #2)
> Does this need to be global or per product - ie have a policy for Ff a
> different one for product with less users (Camino, SeaMonkey and Thunderbird) ?

Right now everything but Fx is a drop in the data bucket, so I'm less concerned about those.  If you have input though Ludo, please let us know.
(Reporter)

Comment 4

7 years ago
Data sizes:

Currently the PG database is ~~330GB in size.  With 10% throttling of the major release versions of FF, that grows at about 12GB per week.  This means that if we were concerned about disk space on master01 alone, we could allow data to persist for about another 30 weeks before we started running out of disk space.

HOWEVER, there are other considerations:

1) We currently make full data copies for the relay server and devDB.  These servers have less disk space, and in fact only have room for another 8 weeks of data.

2) The size of the PG database adds to the following:
   a) amount of time required to resync if the relay server gets out of sync.
   b) the amount of time required to make archival backups should we start doing so
   c) amount of time required to fail-back if required

Therefore, within the next 8 weeks, we either need to increase the amount of disk space available to relayDB and devDB, or we need to start purging data.

See also bug 635098
(Reporter)

Comment 5

7 years ago
Oh, other data:

* We currently have 41 weeks of data on the Postgres database.
* Currently devDB and relayDB have 500GB each available, and master01 has 830GB.
(Reporter)

Updated

7 years ago
Blocks: 635098
Josh out of These 330Gb can we break the numbers down by Versions ?

Ie out much 3.5 and earlier represent
How much 3.6 etc ....
(Reporter)

Comment 7

7 years ago
Clarification per e-mail:

The 30 weeks and 8 weeks projections were based on the idea that we start throttling FF4 crashes within the next couple of days.

If we continue FF4 at 100%, we only have 3-4 weeks on RelayDB and DevDB.
(Assignee)

Updated

7 years ago
Assignee: nobody → laura
(Reporter)

Comment 8

7 years ago
Laura,

Update on this?
(Reporter)

Comment 9

7 years ago
Ludovic,

I don't currently have a machine where I can run a query which will answer that question, since it would involve scanning the entire database.   Unfortunately, DevDB is far too slow, and I can't run such a report on prod.  Possibly StageDB will become available for this purpose sometime soon.
(Assignee)

Comment 10

7 years ago
Waiting on data reconcilation from Lars.
(Assignee)

Comment 11

7 years ago
PS this isn't a code bug so it doesn't block 1.7.8 freeze.

Lars will have the reconcilation finished by end of this week.  We'll give that data to CrashKill next week, and they'll get back to us with comments.
(Assignee)

Updated

7 years ago
Target Milestone: 1.7.8 → 2.0
(Assignee)

Updated

7 years ago
Target Milestone: 2.0 → 2.1
(Assignee)

Comment 12

7 years ago
Policy drafted and sent to Crashkill team.
(Assignee)

Comment 13

7 years ago
Got signoff on policy, here it is:
Keep indefinitely:
- Crashes/ADU  (ideally put this into some kind of metrics DB, Pentaho or whatever)
- Data on crash-analysis (revisit as needed)

Raw crashes: Delete after 6 months

Processed crashes:
- Nightly and Aurora crashes - 3 months
- Beta crashes - 6 months
- Release crashes - 12 months 

If non-Firefox projects have longer needs (please let me know) that may be okay, depending on how hard that is to implement.  I believe for now we're going to delete everything over a year old until we get some infrastructure in place for the granularity we want.
This sounds reasonable to me for now.
Component: Socorro → General
Product: Webtools → Socorro
(Assignee)

Comment 15

7 years ago
Calling this done.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.