Closed Bug 1131083 Opened 9 years ago Closed 9 years ago

configure PHX staging to use symbols from S3

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: rhelmer)

References

Details

Attachments

(1 file, 2 obsolete files)

Symbols are now available in S3, stage should use them.
Blocks: 1097216
No longer blocks: 1071724
Attached patch patch (obsolete) — Splinter Review
Switch staging processors over to S3.

Note that I am going to go ahead and commit this today and start testing on stage, but would still appreciate review! I'll update this patch if I find any problems while testing.
Attachment #8561482 - Flags: review?(dmaher)
Sending        processor.ini
Transmitting file data .
Committed revision 100291.
Attachment #8561482 - Flags: review?(dmaher)
(In reply to Robert Helmer [:rhelmer] from comment #2)
> Sending        processor.ini
> Transmitting file data .
> Committed revision 100291.

Too late to review it now! :P
Seriously though, I do have questions - and I just spent a few seconds looking for how to add comments to a line before I realised I wasn't on Github. :P

* What's [companion_process] ?
* In [companion_process]:
  * Where'd the size for the cache come from?
  * Does verbosity=0 turn of logging, or what's the deal?
* In [destination]:
  * rule_sets was modified but remains commented out; that on purpose?
* In [[[BreakpadStackwalkerRule2015]]]:
  * chatty=False became commented; should we perhaps prefer explicitness here?  i.e. chatty=True
  * How are we handling access to the bucket defined in private_symbols_url ?
(In reply to Daniel Maher [:phrawzty] from comment #3)
> (In reply to Robert Helmer [:rhelmer] from comment #2)
> > Sending        processor.ini
> > Transmitting file data .
> > Committed revision 100291.
> 
> Too late to review it now! :P

Not too late! :) It's just on stage, we want to get this right before going to prod.

(In reply to Daniel Maher [:phrawzty] from comment #4)
> Seriously though, I do have questions - and I just spent a few seconds
> looking for how to add comments to a line before I realised I wasn't on
> Github. :P
> 
> * What's [companion_process] ?

It's a new feature in the Socorro processor, which runs a process alongside the normal processing job (a cleanup job that runs in a separate thread, in this case)

> * In [companion_process]:
>   * Where'd the size for the cache come from?
>   * Does verbosity=0 turn of logging, or what's the deal?

Yes, there are docs in the class (revealed with --help)

> * In [destination]:
>   * rule_sets was modified but remains commented out; that on purpose?

Oops that's modifying a default so needs to be uncommented, I'll fix that.

> * In [[[BreakpadStackwalkerRule2015]]]:
>   * chatty=False became commented; should we perhaps prefer explicitness
> here?  i.e. chatty=True

This is a good point the default might be too quiet. Do we want to see a log line every time
any symbol file is cleaned up from the symbol cache?

>   * How are we handling access to the bucket defined in private_symbols_url ?

We discussed using ACLs for PHX in bug 1119369 comment 10, until we have an HTTP proxy ready (which we can do in a similar manner to this companion_process, and HTTP proxy that processor invokes)
Missed one:

(In reply to Daniel Maher [:phrawzty] from comment #4)
> * In [companion_process]:
>   * Where'd the size for the cache come from?

Totally arbitrary default, we can set it to whatever we like (I think lars picked it.)

Also note that this class supports specifying human-readable sizes like "1G" which we should probably use.
(In reply to Robert Helmer [:rhelmer] from comment #6)
> Missed one:
> 
> (In reply to Daniel Maher [:phrawzty] from comment #4)
> > * In [companion_process]:
> >   * Where'd the size for the cache come from?
> 
> Totally arbitrary default, we can set it to whatever we like (I think lars
> picked it.)
> 
> Also note that this class supports specifying human-readable sizes like "1G"
> which we should probably use.

1GB is probably far too small even as a sane default... we'll probably want ~30GB or so (assuming we can get away with that on processor nodes)
From the data in bug 981079 comment 1 (we could re-run that analysis if need be) 35GB should give us a 99% hit rate, we could get a 95% hit rate with just 11GB.
I know we talked about this because my diagram has an entry for it:
http://people.mozilla.org/~tmielczarek/Proposed%20Symbols%20Data%20Flow.svg

...but if it's not a huge hassle we should spin up a simple HTTP cache in PHX and route the stackwalk s3 traffic through that. That should cut down the cold start time a bit, although we'll still have a delay when a processor fetches a large symbol file that's not in the HTTP cache. Maybe we can just temporarily get a longer processor timeout to handle those cases, we should be able to estimate that by looking at our symbol file sizes and s3->PHX transfer rates.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9)
> I know we talked about this because my diagram has an entry for it:
> http://people.mozilla.org/~tmielczarek/Proposed%20Symbols%20Data%20Flow.svg
> 
> ...but if it's not a huge hassle we should spin up a simple HTTP cache in
> PHX and route the stackwalk s3 traffic through that. That should cut down
> the cold start time a bit, although we'll still have a delay when a
> processor fetches a large symbol file that's not in the HTTP cache. Maybe we
> can just temporarily get a longer processor timeout to handle those cases,
> we should be able to estimate that by looking at our symbol file sizes and
> s3->PHX transfer rates.

I have this running on staging processor1 with a longer processor timeout (10 minutes), and I haven't seen any stackwalker timeouts in several days.

I couldn't find a caching proxy that did what we wanted so I am running https://github.com/rhelmer/caching-s3-proxy on processor1... need to figure out where exactly we want to run this in PHX. I had been thinking the admin box but firewall rules right now won't allow it.
Status: NEW → ASSIGNED
Depends on: 1135706
Bringing up VMs specifically to host this, so there isn't any confusion down the road about where this is running and what it is.
VM is up, seems like it's reachable fine from processors (I guess because it's private and in the same VLAN).

I went ahead and configured staging processors to use it:

Transmitting file data .
Committed revision 101205.
uwsgi on rhel6/python 2.6 is a nightmare, I got all the python deps packaged (using fpm) and working under mod_wsgi, switched to port 80:

Sending        processor.ini
Transmitting file data .
Committed revision 101208.

I'll get this all reviewed before switching prod over.
So, the caching S3 proxy seems to work, but I don't like that it's relatively new code and also a SPOF for the processors.

I played around with bucket policies last night and it looks like I was able to get source IP bucket policy working, while not breaking access for IAM users.

If this all works, then we can just let processors hit the private and public S3 buckets directly, and use the caching and cleanup we've already landed for processor.
phrawzty, would you mind double-checking the permissions and bucket policy on org.mozilla.crash-stats.staging.symbols-private ?

It looks like outgoing connections from staging processors both come from the same IP, is this safe to use and depend on (until we're out of PHX, we'll do something different in AWS like policy based on IAM roles)?

Assuming this all works, we'll want to apply it to the real bucket (not the staging one) and add the IP(s) for production as well.
Attachment #8561482 - Attachment is obsolete: true
Attachment #8569890 - Flags: review?(dmaher)
Comment on attachment 8569890 [details]
bucket policy for org.mozilla.crash-stats.staging.symbols-private

> 			"Action": "s3:*",
This works but it makes me very, very nervous.  Recommend implementing a tighter group of allowed actions.
(In reply to Daniel Maher [:phrawzty] from comment #16)
> Comment on attachment 8569890 [details]
> bucket policy for org.mozilla.crash-stats.staging.symbols-private
> 
> > 			"Action": "s3:*",
> This works but it makes me very, very nervous.  Recommend implementing a
> tighter group of allowed actions.

Oh thanks for catching this, I think we really only need GetObject. I'll test it out.
OK I've removed the "list" permission and set the policy to be GetObject specifically, testing from stage processors it seems to work as intended.
Attachment #8569890 - Attachment is obsolete: true
Attachment #8569890 - Flags: review?(dmaher)
Attachment #8569915 - Flags: review?(dmaher)
Comment on attachment 8569915 [details]
bucket policy for org.mozilla.crash-stats.staging.symbols-private

Looks good.
Attachment #8569915 - Flags: review?(dmaher) → review+
Thanks! I went ahead and enabled the same policy+perms for the production bucket.

I reverted the staging config so it points straight to S3 again:

Sending        processor.ini
Transmitting file data .
Committed revision 101230.

Also, I bumped up the cache size to 30G:

Sending        processor.ini
Transmitting file data .
Committed revision 101231.

I think we're done here, just need to do more testing and do the equiv for production when ready.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
(In reply to Robert Helmer [:rhelmer] from comment #14)
> So, the caching S3 proxy seems to work, but I don't like that it's
> relatively new code and also a SPOF for the processors.

I can understand where you're coming from. The only thing that we wanted that we don't have now is to have something as an HTTP cache in front of s3 to speed up downloads a bit. Would it be worthwhile to revisit spinning up a proxxy instance in PHX for that purpose?
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #21)
> (In reply to Robert Helmer [:rhelmer] from comment #14)
> > So, the caching S3 proxy seems to work, but I don't like that it's
> > relatively new code and also a SPOF for the processors.
> 
> I can understand where you're coming from. The only thing that we wanted
> that we don't have now is to have something as an HTTP cache in front of s3
> to speed up downloads a bit. Would it be worthwhile to revisit spinning up a
> proxxy instance in PHX for that purpose?

Hmmm yeah if we don't need to do S3 auth then proxxy (which AFAICT the core of is just nginx config right?) would probably be just fine We do have both production and staging s3proxy VMs spun up and ready to go, so there's nothing blocking this.

Do we really need a central proxy though? The processors keep a local cache on each machine, it's a little slow to warm up but I haven't seen any timeouts or other problems on staging. Mostly I'm worried about the proxy machine going down and then having to reprocess, which is annoying but not fatal (I'm not sure we'd be able to detect this with our current monitoring either)
(In reply to Robert Helmer [:rhelmer] from comment #22)
> fatal (I'm not sure we'd be able to detect this with our current monitoring
> either)

Is now a good time to mention that we have *no* monitoring in Amazon?
(In reply to Daniel Maher [:phrawzty] from comment #23)
> (In reply to Robert Helmer [:rhelmer] from comment #22)
> > fatal (I'm not sure we'd be able to detect this with our current monitoring
> > either)
> 
> Is now a good time to mention that we have *no* monitoring in Amazon?

I just filed bug 1138424 to track this. I am less worried right now since we also have no production services (yet).
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: