1131083 - configure PHX staging to use symbols from S3

Robert Helmer [:rhelmer]

Assignee

Description

•

11 years ago

Symbols are now available in S3, stage should use them.

Robert Helmer [:rhelmer]

Assignee

Updated

•

11 years ago

Blocks: 1097216
No longer blocks: 1071724

Robert Helmer [:rhelmer]

Assignee

Comment 1

•

11 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Switch staging processors over to S3. Note that I am going to go ahead and commit this today and start testing on stage, but would still appreciate review! I'll update this patch if I find any problems while testing.

Attachment #8561482 - Flags: review?(dmaher)

Robert Helmer [:rhelmer]

Assignee

Comment 2

•

11 years ago

Sending processor.ini Transmitting file data . Committed revision 100291.

Daniel Maher [:phrawzty]

Updated

•

11 years ago

Attachment #8561482 - Flags: review?(dmaher)

Daniel Maher [:phrawzty]

Comment 3

•

11 years ago

(In reply to Robert Helmer [:rhelmer] from comment #2) > Sending processor.ini > Transmitting file data . > Committed revision 100291. Too late to review it now! :P

Daniel Maher [:phrawzty]

Comment 4

•

11 years ago

Seriously though, I do have questions - and I just spent a few seconds looking for how to add comments to a line before I realised I wasn't on Github. :P * What's [companion_process] ? * In [companion_process]: * Where'd the size for the cache come from? * Does verbosity=0 turn of logging, or what's the deal? * In [destination]: * rule_sets was modified but remains commented out; that on purpose? * In [[[BreakpadStackwalkerRule2015]]]: * chatty=False became commented; should we perhaps prefer explicitness here? i.e. chatty=True * How are we handling access to the bucket defined in private_symbols_url ?

Robert Helmer [:rhelmer]

Assignee

Comment 5

•

11 years ago

(In reply to Daniel Maher [:phrawzty] from comment #3) > (In reply to Robert Helmer [:rhelmer] from comment #2) > > Sending processor.ini > > Transmitting file data . > > Committed revision 100291. > > Too late to review it now! :P Not too late! :) It's just on stage, we want to get this right before going to prod. (In reply to Daniel Maher [:phrawzty] from comment #4) > Seriously though, I do have questions - and I just spent a few seconds > looking for how to add comments to a line before I realised I wasn't on > Github. :P > > * What's [companion_process] ? It's a new feature in the Socorro processor, which runs a process alongside the normal processing job (a cleanup job that runs in a separate thread, in this case) > * In [companion_process]: > * Where'd the size for the cache come from? > * Does verbosity=0 turn of logging, or what's the deal? Yes, there are docs in the class (revealed with --help) > * In [destination]: > * rule_sets was modified but remains commented out; that on purpose? Oops that's modifying a default so needs to be uncommented, I'll fix that. > * In [[[BreakpadStackwalkerRule2015]]]: > * chatty=False became commented; should we perhaps prefer explicitness > here? i.e. chatty=True This is a good point the default might be too quiet. Do we want to see a log line every time any symbol file is cleaned up from the symbol cache? > * How are we handling access to the bucket defined in private_symbols_url ? We discussed using ACLs for PHX in bug 1119369 comment 10, until we have an HTTP proxy ready (which we can do in a similar manner to this companion_process, and HTTP proxy that processor invokes)

Robert Helmer [:rhelmer]

Assignee

Comment 6

•

11 years ago

Missed one: (In reply to Daniel Maher [:phrawzty] from comment #4) > * In [companion_process]: > * Where'd the size for the cache come from? Totally arbitrary default, we can set it to whatever we like (I think lars picked it.) Also note that this class supports specifying human-readable sizes like "1G" which we should probably use.

Robert Helmer [:rhelmer]

Assignee

Comment 7

•

11 years ago

(In reply to Robert Helmer [:rhelmer] from comment #6) > Missed one: > > (In reply to Daniel Maher [:phrawzty] from comment #4) > > * In [companion_process]: > > * Where'd the size for the cache come from? > > Totally arbitrary default, we can set it to whatever we like (I think lars > picked it.) > > Also note that this class supports specifying human-readable sizes like "1G" > which we should probably use. 1GB is probably far too small even as a sane default... we'll probably want ~30GB or so (assuming we can get away with that on processor nodes)

(not currently active) Ted Mielczarek

Comment 8

•

11 years ago

From the data in bug 981079 comment 1 (we could re-run that analysis if need be) 35GB should give us a 99% hit rate, we could get a 95% hit rate with just 11GB.

(not currently active) Ted Mielczarek

Comment 9

•

11 years ago

I know we talked about this because my diagram has an entry for it: http://people.mozilla.org/~tmielczarek/Proposed%20Symbols%20Data%20Flow.svg ...but if it's not a huge hassle we should spin up a simple HTTP cache in PHX and route the stackwalk s3 traffic through that. That should cut down the cold start time a bit, although we'll still have a delay when a processor fetches a large symbol file that's not in the HTTP cache. Maybe we can just temporarily get a longer processor timeout to handle those cases, we should be able to estimate that by looking at our symbol file sizes and s3->PHX transfer rates.

Robert Helmer [:rhelmer]

Assignee

Comment 10

•

11 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9) > I know we talked about this because my diagram has an entry for it: > http://people.mozilla.org/~tmielczarek/Proposed%20Symbols%20Data%20Flow.svg > > ...but if it's not a huge hassle we should spin up a simple HTTP cache in > PHX and route the stackwalk s3 traffic through that. That should cut down > the cold start time a bit, although we'll still have a delay when a > processor fetches a large symbol file that's not in the HTTP cache. Maybe we > can just temporarily get a longer processor timeout to handle those cases, > we should be able to estimate that by looking at our symbol file sizes and > s3->PHX transfer rates. I have this running on staging processor1 with a longer processor timeout (10 minutes), and I haven't seen any stackwalker timeouts in several days. I couldn't find a caching proxy that did what we wanted so I am running https://github.com/rhelmer/caching-s3-proxy on processor1... need to figure out where exactly we want to run this in PHX. I had been thinking the admin box but firewall rules right now won't allow it.

Status: NEW → ASSIGNED

Robert Helmer [:rhelmer]

Assignee

Updated

•

11 years ago

Depends on: 1135706

Robert Helmer [:rhelmer]

Assignee

Comment 11

•

11 years ago

Bringing up VMs specifically to host this, so there isn't any confusion down the road about where this is running and what it is.

Robert Helmer [:rhelmer]

Assignee

Comment 12

•

11 years ago

VM is up, seems like it's reachable fine from processors (I guess because it's private and in the same VLAN). I went ahead and configured staging processors to use it: Transmitting file data . Committed revision 101205.

Robert Helmer [:rhelmer]

Assignee

Comment 13

•

11 years ago

uwsgi on rhel6/python 2.6 is a nightmare, I got all the python deps packaged (using fpm) and working under mod_wsgi, switched to port 80: Sending processor.ini Transmitting file data . Committed revision 101208. I'll get this all reviewed before switching prod over.

Robert Helmer [:rhelmer]

Assignee

Comment 14

•

11 years ago

So, the caching S3 proxy seems to work, but I don't like that it's relatively new code and also a SPOF for the processors. I played around with bucket policies last night and it looks like I was able to get source IP bucket policy working, while not breaking access for IAM users. If this all works, then we can just let processors hit the private and public S3 buckets directly, and use the caching and cleanup we've already landed for processor.

Robert Helmer [:rhelmer]

Assignee

Comment 15

•

11 years ago

Attached file bucket policy for org.mozilla.crash-stats.staging.symbols-private (obsolete) — Details

phrawzty, would you mind double-checking the permissions and bucket policy on org.mozilla.crash-stats.staging.symbols-private ? It looks like outgoing connections from staging processors both come from the same IP, is this safe to use and depend on (until we're out of PHX, we'll do something different in AWS like policy based on IAM roles)? Assuming this all works, we'll want to apply it to the real bucket (not the staging one) and add the IP(s) for production as well.

Attachment #8561482 - Attachment is obsolete: true

Attachment #8569890 - Flags: review?(dmaher)

Daniel Maher [:phrawzty]

Comment 16

•

11 years ago

Comment on attachment 8569890 [details] bucket policy for org.mozilla.crash-stats.staging.symbols-private > "Action": "s3:*", This works but it makes me very, very nervous. Recommend implementing a tighter group of allowed actions.

Robert Helmer [:rhelmer]

Assignee

Comment 17

•

11 years ago

(In reply to Daniel Maher [:phrawzty] from comment #16) > Comment on attachment 8569890 [details] > bucket policy for org.mozilla.crash-stats.staging.symbols-private > > > "Action": "s3:*", > This works but it makes me very, very nervous. Recommend implementing a > tighter group of allowed actions. Oh thanks for catching this, I think we really only need GetObject. I'll test it out.

Robert Helmer [:rhelmer]

Assignee

Comment 18

•

11 years ago

Attached file bucket policy for org.mozilla.crash-stats.staging.symbols-private — Details

OK I've removed the "list" permission and set the policy to be GetObject specifically, testing from stage processors it seems to work as intended.

Attachment #8569890 - Attachment is obsolete: true

Attachment #8569890 - Flags: review?(dmaher)

Attachment #8569915 - Flags: review?(dmaher)

Daniel Maher [:phrawzty]

Comment 19

•

11 years ago

Comment on attachment 8569915 [details] bucket policy for org.mozilla.crash-stats.staging.symbols-private Looks good.

Attachment #8569915 - Flags: review?(dmaher) → review+

Robert Helmer [:rhelmer]

Assignee

Comment 20

•

11 years ago

Thanks! I went ahead and enabled the same policy+perms for the production bucket. I reverted the staging config so it points straight to S3 again: Sending processor.ini Transmitting file data . Committed revision 101230. Also, I bumped up the cache size to 30G: Sending processor.ini Transmitting file data . Committed revision 101231. I think we're done here, just need to do more testing and do the equiv for production when ready.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

(not currently active) Ted Mielczarek

Comment 21

•

11 years ago

(In reply to Robert Helmer [:rhelmer] from comment #14) > So, the caching S3 proxy seems to work, but I don't like that it's > relatively new code and also a SPOF for the processors. I can understand where you're coming from. The only thing that we wanted that we don't have now is to have something as an HTTP cache in front of s3 to speed up downloads a bit. Would it be worthwhile to revisit spinning up a proxxy instance in PHX for that purpose?

Robert Helmer [:rhelmer]

Assignee

Comment 22

•

11 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #21) > (In reply to Robert Helmer [:rhelmer] from comment #14) > > So, the caching S3 proxy seems to work, but I don't like that it's > > relatively new code and also a SPOF for the processors. > > I can understand where you're coming from. The only thing that we wanted > that we don't have now is to have something as an HTTP cache in front of s3 > to speed up downloads a bit. Would it be worthwhile to revisit spinning up a > proxxy instance in PHX for that purpose? Hmmm yeah if we don't need to do S3 auth then proxxy (which AFAICT the core of is just nginx config right?) would probably be just fine We do have both production and staging s3proxy VMs spun up and ready to go, so there's nothing blocking this. Do we really need a central proxy though? The processors keep a local cache on each machine, it's a little slow to warm up but I haven't seen any timeouts or other problems on staging. Mostly I'm worried about the proxy machine going down and then having to reprocess, which is annoying but not fatal (I'm not sure we'd be able to detect this with our current monitoring either)

Daniel Maher [:phrawzty]

Comment 23

•

11 years ago

(In reply to Robert Helmer [:rhelmer] from comment #22) > fatal (I'm not sure we'd be able to detect this with our current monitoring > either) Is now a good time to mention that we have *no* monitoring in Amazon?

Robert Helmer [:rhelmer]

Assignee

Comment 24

•

11 years ago

(In reply to Daniel Maher [:phrawzty] from comment #23) > (In reply to Robert Helmer [:rhelmer] from comment #22) > > fatal (I'm not sure we'd be able to detect this with our current monitoring > > either) > > Is now a good time to mention that we have *no* monitoring in Amazon? I just filed bug 1138424 to track this. I am less worried right now since we also have no production services (yet).

patch 11 years ago Robert Helmer [:rhelmer] 10.52 KB, patch		Details \| Diff \| Splinter Review
bucket policy for org.mozilla.crash-stats.staging.symbols-private 11 years ago Robert Helmer [:rhelmer] 350 bytes, text/plain		Details
bucket policy for org.mozilla.crash-stats.staging.symbols-private 11 years ago Robert Helmer [:rhelmer] 358 bytes, text/plain	dmaher : review+	Details