Closed Bug 1383067 Opened 7 years ago Closed 6 years ago

Point stackwalker to a third URL; Tecken

Categories

(Socorro :: Processor, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Assigned: willkg)

References

Details

Attachments

(1 file)

Right now, when we start stackwalker we provide it exactly 2 `--symbols-url` arguments [0]. We should make it 3. The third one being Tecken's and it being optional. By default it should not be enabled. Then we can manually enable it (consulate!) on stage just to see that it works in principle. 

Backstory; Tecken does two things for us:
1) It can do lookups on Microsoft's server on-the-fly to fill in those missing
2) It can make a retroactive report of ALL missing symbols yesterday. That is used (by TaskCluster) to retrofit missing symbols. (Apple and Microsoft are sometimes slow to publish their public symbol files).
Also, Tecken hasn't, in stage (or prod), proven to be able to handle the load that stackwalker might be sending to us. That's why we're starting gently. 

The work involves adding that third `--symbols-url` argument and doing it so that that it is optional. 


[0] https://github.com/mozilla-services/socorro/blob/cbe62c9257cbcb89f67df320762778445cf9ad4b/socorro/processor/breakpad_transform_rules.py#L492-L495
Depends on: 1365672
Oh, one thing I don't know that we covered in our chat in SF (although maybe I mentioned it) is that the dump_syms binary you're using for Tecken doesn't currently support dumping some information from Microsoft's symbols for 64-bit binaries that we need. There's also a slight additional complication in that to dump that information dump_syms needs access to the .exe or .dll file that goes along with the .pdb file. For Microsoft binaries those are also available from the symbol server (that's what we needed the code_id and code_file parameters for). You can see the code that handles that in my cron script here:
https://hg.mozilla.org/users/tmielczarek_mozilla.com/fetch-win32-symbols/file/602ba1af9091/symsrv-fetch.py#l132
Let's make this bug about adding that third `--symbols-url` to the stackwalker invocation. 
Clearly we have more work to do in Tecken to support these 64-bit things. Moved to https://bugzilla.mozilla.org/show_bug.cgi?id=1385344
Grabbing this to work on at some point soon.

Things this is waiting on:

1. Peter wants to do another load test of the Tecken -stage environment
2. Miles is taking over symbols.mozilla.org for -prod

Once those are done, then we can do this. We need to make it configurable because it'd be super to have Socorro -stage point to Tecken -stage. Further, we may want this to run on Socorro -stage for some period of time before we switch to -prod to address any performance discoveries.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Here's how to test this...:

If you download https://symbols.mozilla.org/missingsymbols.csv it's a CSV file, its headers are: "debug_file,debug_id,code_file,code_id"
When Visual Studio (aka. Microsoft Debuggers) hits us they DON'T have the extra query string parameters on the download URLs that stackwalker has. Thus all entries in missingsymbols.csv that LACK the code_file and code_id values, are coming from Visual Studio. 

If you start pointing stackwalker to symbols.mozilla.org, there should be entries in there WITH code_file and code_id. If you're too impatient to wait a day you can add ?today=1 to the URL and you get the latest and greatest.
We need to do this soon. I'm making it a P1 so it doesn't fall off my radar again.
Priority: -- → P1
Depends on: 1407997
I landed the fixes to the mdsw configuration. There's still some follow-up to do.

After the follow-up is done, the next step here is to rework the code so that we have a "urls" configuration parameter that takes a comma-separated list of urls rather than a "public-url" and a "private-url". When I have a PR reviewed, we can do some url shuffling and then we should be able to land this.

Having said that, we're not going to make it before the change freeze. Earliest we can do this is after the freeze has lifted. So ... like end of November or early December. I'm bumping it down to P2 since this is blocked on the change freeze.
Priority: P1 → P2
I'm going to drop the symbols_public_url and symbols_private_url for a single symbols_urls configuration that takes a comma delimited list of strings.

We'll need to make configuration changes to the -stage and -prod configuration before this lands:

# Add the new symbols_urls key that has both the public and private urls in it in that order
consulate kv set socorro/processor/processor.raw_to_processed_transform.BreakpadStackwalkerRule2015.symbols_urls "https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1,https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-private/v1"

# Remove the old keys
consulate kv rm socorro/processor/processor.raw_to_processed_transform.BreakpadStackwalkerRule2015.private_symbols_url
consulate kv rm socorro/processor/processor.raw_to_processed_transform.BreakpadStackwalkerRule2015.public_symbols_url
Is the intention to first change how the code works but use the same URLs. Then, once that dust settles, actually change the URLs?

At the moment, https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1 === https://symbols.mozilla.org actually. 

I guess we have Stage at our advantage to try it out there first.
I chatted with Peter on IRC about this. I want to keep the PR as is, but we'll keep this bug open for the next step where we switch the urls list to include Tecken.

We could do that in one of three ways:

1. tecken, private bucket

Tecken ends up with wrong bookkeeping for missing symbols. Peter posits it's not that wrong and we don't care. I'm on the fence about how I feel about that.

We'd probably want to do something in the processor such that if tecken went down, we'd notice and either stop processing or reprocess those crashes later.

2. private bucket, tecken

Tecken has the right bookkeeping for missing symbols.

We only have two lookups to do. Depending on whether looking up a symbol in s3 directly or through tecken is faster, it might slow the processor down. This does let us take advantage of tecken caching symbol lookups. That might help depending on the characteristics of what symbols are getting looked up.

We'd probably want to do something in the processor such that if tecken went down, we'd notice and either stop processing or reprocess those crashes later.

3. public bucket, private bucket, tecken

This keeps the order of buckets the same. Tecken ends up with the correct bookkeeping for missing symbols.

This adds a third lookup which will slow the processor down, but maybe not meaningfully.

If tecken went down, it wouldn't affect processing at all.


I want to think about this more, but not today. Also, we should have ops around before making this kind of change.
Commits pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/9134fe9b55720c0ad349114f97266c637b6dcfd9
fixes bug 1383067 - redo symbols-urls conf in BreakpadStackwalker2015

This drops the private-symbols-url and public-symbols-url configuration
variables for a single symbols-urls variable that takes a comma-delimited
list of strings.

Now we can add arbitrary number of urls in the order we want them
checked.

https://github.com/mozilla-services/socorro/commit/8f1ae738ab743d4d3ba586a4fb769fa760e872e8
Merge pull request #4232 from willkg/1383067-urls

fixes bug 1383067 - redo symbols-urls conf in BreakpadStackwalker2015
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reopening this to track the configuration changes.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I added the socorro/processor/processor.raw_to_processed_transform.BreakpadStackwalkerRule2015.symbols_urls key to -stage. Will wait for that to deploy, then check the processor, then remove the keys we no longer need.
-stage deployed and I verified the processor is working correctly with the new configuration.

Then I removed the dead keys while keeping an eye on the processor and everything is groovy.

This is good to go to -prod.
Whoops--the other thing outstanding here is actually adding tecken. We'll do that via configuration changes after everything is settled on -prod.
We did a -prod push just now. I updated the configuration and verified the processors are still working and symbolicating.

So--that's all done. Next step is to add Tecken.

Peter says to use this url: https://symbols.mozilla.org

consulate kv set socorro/processor/processor.raw_to_processed_transform.BreakpadStackwalkerRule2015.symbols_urls "https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1,https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-private/v1,https://symbols.mozilla.org"

He said we should change this in -stage first. I updated -stage and verified it's working.
Looked like symbols was throwing errors along the lines of some symbol exceeding the character length max. I switched -stage back to the public and private urls. I'll work with Peter on a fix.
Depends on: 1423708
I switched -stage to use tecken as the third url again and everything seems fine.
After talking with Peter, I set -prod to use tecken as the third url just now.
Peter: Are we cool? Can we leave this set?
Flags: needinfo?(peterbe)
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #20)
> Peter: Are we cool? Can we leave this set?

Yes. I checked Datadog and the number didn't rise significantly.
Flags: needinfo?(peterbe)
Awesome! I'll leave it and mark this FIXED.
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: