Closed Bug 1266827 Opened 8 years ago Closed 8 years ago

s3 pseudo-filenames prevent crash listings by date

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

The pseudo-filename for a raw crash is:

    {prefix}/{version}/{name_of_thing}/{id}

where {id} is an ooid which encodes the depth and date at the end. Thus the pseudo-filename has the date the crash came in at the very end.

One thing we want to be able to do is look at crashes by date. This helps when reprocessing crashes for specific dates.

The problem here is that we've got tons of objects in our bucket in s3. Listing all the objects is not feasible and the only way to pare down the list to the ones for a specific date is to filter on prefix, but since the date is at the end of the pseudo-filename, we can't do that.

This bug covers changing the schema so that we can do a prefix filter and get the list of crashes for a specific date.
I was looking into this as part of the collector rewrite (bug #1145703), but I think we should do it before we do the rewrite.

I think there are a couple of possibilities:

1. change the schema to {prefix}/{version}/{name_of_thing}/{date}/{id}
2. change the ooid and reverse the order of things so it's date first, then depth, then the rest of the generated uuid

I suspect it'll be easier to change how ooids are generated. Right now, we generate a uuid, then drop the last 7 characters and replace them with the depth and then YYMMDD for the date.

If we switch the order, then we have to figure out a way to have the ooid functions (socorrolib/lib/ooid.py) know how to parse it to extract depth and date for both the old and new formats. Also, I'm not sure if there will be problems with the uuid to ooid transform and anything that uses that.
Spent some time fleshing out the two thoughts in the previous comment.


Possibility 1:

The pseudo-filename is determined by socorro/external/boto/connection_context.py build_key. If we were to change the pseudo-filename, I think that's the only place we have to do it.

So then we'd have v1 schemas for raw crashes:

    {prefix}/v1/{name_of_thing}/{id}

and v2 schemas for raw crashes:

    {prefix}/v2/{name_of_thing}/{submit-date}/{id}

We'll have to support v1 and v2 schemas. One way to do this is to set a date and all crashes submitted before that date (determined from the ooid) get v1 schema and all crashes submitted afterwards get a v2 schema. We'd be looking at name_of_thing and the date extracted from the id.

Anything that isn't a raw crash continues to get a v1 schema.

This seems like a good idea, but we'll have problems with any other code that's building keys but not using this function. We might find some problems after running existing tests. Looking through the code for anything that uses s3 or boto might unearth some problems, too.


Possibility 2:

Ooids look like this:

    xxxxxxxx-xxxx-xxxx-xxxx-xxxxxEYYMMDD

where the first E is depth and the YYMMDD is the date submitted. That's generated by creating a uuid, encoding it in hex and then replacing the last 7 characters with DYYMMDD.

Depth is used by fs crash storage and that's it (as far as I can tell).

The date is used in a bunch of places in the processor and in the fs crash storage.

If we were to flip this, we could end up with this:

    YYMMDDEx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

That gives us a date prefix we can query with.

However, this is all a bunch of hex and I can't think of a way to distinguish between old ooids and new ooids without changing something fundamental about the structure. We need to do that in order to correctly extract date and depth from an existing ooid. Thus I'm balking at the idea because I haven't a clue how changing the structure affects other things. Is anything expecting the current structure exactly like it is? Is anything else looking at the last 6 digits for the date without using the functions in the ooid.py module?


I think I like possibility the most 1 right now. If that plan is sound, then this could be a quick fix that we do this week and push and then starting next week we get v2 schemas and the world becomes just a little be more beautiful.

Any thoughts on the above plans?

Are there other plans?
Grabbing this to work on.

I think Rob and Chris have both telepathically communicated that they like possibility 1 and we should investigate that some more.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
I think there are a few ways to do this and all of them are problematic in different ways.

This is a hand-wavey explanation and I'm probably being overly fast and loose with terms. My apologies in advance.

One possibility is to change the CrashStorageBase (and all the subclasses) so that it takes an additional argument for any method in CrashStorageBase that eventually results in a call to ConnectionContextBase.build_key() with the boto classes. That's probably most of the methods. Then we update all the boto classes. Then we have to go through the codebase for anything that uses crash storage and change all the callers for all the crash storage classes. Then we have to go through the tests and update all the tests. Etc. This is probably a lot of work. I suspect we'll miss some stuff and we'll have fallout issues. The problem here is that we're updating tons of stuff. The nice thing is that we maintain the API abstraction.

Another possibility is to create a subclass of RegionalS3ConnectionContext (that's the one we're using now) that overrides build_key to do something different for certain types of things (e.g. "raw_crash"). Maybe we make it configurable and take in a list of strings with the default being ['raw_crash']. In this scenario, we're adding a new class and some new tests. Everything else stays the same. Then we push that out and update the configuration for all the things that use BotoS3CrashStorage to use the new connection context class. We could miss some things here, too, but I suspect it's less likely because there's a lot fewer changes. We're not making many changes and the ones we are making are additive, so that's good. We are sort of breaking the API abstraction because now the boto connection context has to know about raw_crash which is a special case of some other part of the system. Further, we really should change the version number on the pseudo-filenames, but the version number is set by the caller way up the stack and so with this possibility we'd probably be switching the version number in .build_key() which is ... not great. (To be honest, I'm not sure if the version number is about the structure being saved or the place it's being saved at, so maybe this isn't a big deal?)

There are some other possibilities that are along the lines of the above and probably mish-mashes of them.


At the moment, I'm inclined to go with option 2 because it's way faster to implement and way less risky. Further, we can undo it quickly with a configuration change. This also doesn't affect any other users since the functionality is being added as a new component.

Does that analysis sound accurate?

Are there other possible ways to attack this?

If I implement the second option, will anyone shun me at MozLondon?


needinfo'ing lonnen and rhelmer for thoughts on the above. Anyone else is welcome to comment, but I want an ok from at lest one of them before I go further.
Flags: needinfo?(rhelmer)
Flags: needinfo?(chris.lonnen)
I threw together an implementation for the easier option in comment #4. While doing that, it occurred to me it's easier to tweak the connection context classes so that key building is done by another component and then write our own that does fancy things with raw_crash. Trying that out now--it feels like a winner.
I threw together a rough pass of making key building the role of a separate component and then building our rules into a key builder class. That works nicely. I think it gets around the api abstraction issues because it's implemented in a way that's specific to our configuration and it doesn't incur problems with class hierarchy.

I talked to rhelmer about it on IRC and gathered that he thinks similarly.

I'm going to move forward with that. Nixing the needsinfo?s I had set.
Flags: needinfo?(rhelmer)
Flags: needinfo?(chris.lonnen)
In a PR: https://github.com/mozilla/socorro/pull/3332

A couple of things popped up while we worked through this.

First, s3 stores things in partitions and the way it figures out which partition to store an object is by looking at the keys. Everything suggests that you want "random things" as left-most in the key as possible so that the partition used is as random as possible. Sequential keys are super bad because then the load ends up directed at the same set of partitions and a heavy load will get bottlenecked on the performance of those nodes.

Second, the version in the pseudo-filename refers to the pathname schema--not the object. The version is specific to the boto classes. Given that, I moved all the version setting into the key builders rather than have it exposed higher in the call stack to things that don't know anything about pseudo-filenames.

Next steps:

1. the PR needs to be reviewed
2. we test everything on -stage

Rough test plan:

1. we change the configuration on stage setting the keybuilder to the DatePrefixKeyBuilder component
2. we submit some crashes to the -stage collector

   Do new raw_crash objects have the new key?
   Are old raw_crash objects accessible by the processor and webapp and whatever else that accesses them?

3. we wait a day, then submit some crashes to the -stage collector

   Do today's raw_crash objects have today's submission date in the key?


I think if all that is good, we should be good to go.

Are there problems with that test plan or things that it's missing?
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/1e8c708030b643514953db94ff663d98c35c2d1e
Fixes bug 1266827: Add keybuilder class to s3 crash storage

This breaks out key building into a separate class allowing us to
specify which keybuilder class to use.

It creates a new DatePrefixKeyBuilder that takes the date portion from
the end of the crash_id and an entropy portion from the beginning of the
crash_id and puts those two things before the crash_id in the
pseudo-filename.

This new pseudo-filename has entropy to reduce problems with s3
partitioning, but allows us to generate a listing of all raw crashes
submitted on a given date (takes 4096 bucket list requests, but it's
doable).

This is backwards-compatible with the current pseudo-filenames because
it changes .build_key() to .build_keys() which returns a list of keys to
try in order of "goodness". That way .fetch() can try the keys in order
until it finds the object in question.

https://github.com/mozilla/socorro/commit/ad296e5bb57f3b2a923d8647d1d0c9923daf1186
Merge pull request #3332 from willkg/1266827-s3-pseudof

Fixes bug 1266827: Add keybuilder class to s3 crash storage
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Rough test plan for -stage now that things are merged:

1. check logs for errors
2. check sentry for errors
3. change configuration setting keybuilder to DatePrefixKeyBuilder

   consulate kv set socorro/common/resource.boto.keybuilder_class socorro.external.boto.connection_context.DatePrefixKeyBuilder

4. send some crashes to stage today and see if they get the date prefix (tests the new key code)
5. make sure crashes we're siphoning from prod get processed (tests the backwards-compatible key code)
6. wait a day and check 4 and 5 again
I did number 3 and 4 and see the crashes with the new key name, but since I did the key change so late, I'm seeing tomorrow's date because it's tomorrow in UTC.

I'm not seeing any sentry email, so I think that's fine. I'll check with Peter tomorrow.

Outstanding things:

5. test backwards-compatible key code -- maybe we should try re-processing old crashes on stage? I'm not sure what other things we could do to make this work.

6. wait a day and check 4 again
I checked that the webapp can pull raw_crash objects for new crashes and old crashes which means that .fetch() is working as expected.

Seems like everything is fine. I think this is good to go.
Peter pushed the code to production an hour ago.

I made the consulate change on the -prod admin node, then went and checked s3 and I'm seeing raw_crash objects showing up with the new psuedo-filename. I think we're all good here.
You need to log in before you can comment on or make changes to this bug.