Closed Bug 1361854 Opened 7 years ago Closed 7 years ago

Store Not Found symbol lookups

Categories

(Socorro :: Symbols, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Unassigned)

References

Details

Tecken (the Symbol Download part) currently supports HEAD requests to see if a symbol *exists* in whichever S3 buckets that Tecken is configured to iterate through. You can also do a GET which will either 302 redirect to an absolute S3 URL or a 404 Not Found.

What we should do is make that 404 Not Found a lot bigger deal. We'll keep a database of all failed look ups. That way we can use Tecken from Socorro Processor as a fallback and build up a database of symbols that the Socorro Processor wanted but couldn't get. 

This database can then be used to report on. Ted currently needs it to be a CSV formatted URL that he can feed into his Taskcluster job that fills in those missing symbols. 

This opens up the opportunity to expand the functionality on what to do if the symbol isn't in any of the S3 buckets. For example, it could (if the filename matches a pattern) take the opportunity to try to fetch it from MSDN (and store it permanently for the next time it's asked for). Or, it could do other advanced lookups against Apple's repositories of symbols.
One important detail that needs attention...

In the Socorro Processor, when it (currently) analysis which symbols couldn't be found [0], it also as a "code_id". Currently Tecken Downloader only knows to look up by debug file (e.g. xul.pdb), debug ID and filename (e.g. xul.sym).


[0] https://github.com/mozilla/socorro/blob/55beaf1281e7b522e0526b2aa2bf74d15f6c1263/socorro/processor/mozilla_transform_rules.py#L815
Do we want to keep track of *all* 404s or just 404s from the Socorro processor?
Blocks: 1361809
(In reply to Peter Bengtsson [:peterbe] from comment #1)
> One important detail that needs attention...
> 
> In the Socorro Processor, when it (currently) analysis which symbols
> couldn't be found [0], it also as a "code_id". Currently Tecken Downloader
> only knows to look up by debug file (e.g. xul.pdb), debug ID and filename
> (e.g. xul.sym).

I feel like I remember discussing this at some point in the past, but perhaps I'm mistaken. For 64-bit Windows binaries, we need the DLL/EXE along with the PDB to dump all the symbols we need (some of them are in the binary itself). They're both available from Microsoft's symbol server, but we need the code_id / code_file to fetch the binary. We could make the processor send them in a HTTP header or query string or something like that.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #3)
> (In reply to Peter Bengtsson [:peterbe] from comment #1)
> > One important detail that needs attention...
> > 
> > In the Socorro Processor, when it (currently) analysis which symbols
> > couldn't be found [0], it also as a "code_id". Currently Tecken Downloader
> > only knows to look up by debug file (e.g. xul.pdb), debug ID and filename
> > (e.g. xul.sym).
> 
> I feel like I remember discussing this at some point in the past, but
> perhaps I'm mistaken. For 64-bit Windows binaries, we need the DLL/EXE along
> with the PDB to dump all the symbols we need (some of them are in the binary
> itself). They're both available from Microsoft's symbol server, but we need
> the code_id / code_file to fetch the binary. We could make the processor
> send them in a HTTP header or query string or something like that.

So that requires a change to stackwalker.cc, right? To append `?code_id=...` to the URL before it does the actual curl. 
Do you want me to file you a designated bug specifically about that?

Tecken already picks up the debug name, debug id, and file name from the URL [0]. All I need to do, in the Django view function, is to do `code_id = request.GET.get('code_id')` and store it as a NULL if it's not there. 

I can start working on this very soon. 


[0] https://github.com/mozilla/tecken/blob/master/tecken/download/urls.py#L19
Wow! I didn't realize this but the amount of S3 traffic for symbols that Socorro Processor sends is enormous. 

I enabled logs on the two S3 buckets for almost two hours. Then I downloaded them all and ran some analytics. 

CODE       PRIVATE         COUNT RATE query/sec
404        false          214162 36.3171103951
403        true           133977 24.7053291536
404        true                6 6.0
200        false           21575 3.65863998643
200        true              217 0.040088675411
304        false             158 0.0284530884207
301        false               4 0.0138408304498
500        true                2 0.00534759358289
403        false               9 0.00201117318436
500        false               2 0.000411946446962


What this means is that we get, for example, 24.7 requests per second on the private bucket that result in a 403. 

Mind you! because of IAM role magic, if you try to query a key that doesn't exist on https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-private/ you get a 403. I.e.

(From within the webeng AWS org...)
$ # the private bucket
$ curl -v https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-private/DOESNOTEXIST.TXT
...
403 Forbidden
...
$ # the public bucket
$ curl -v https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/DOESNOTEXIST.TXT
...
404 Not Found
...

So, in total Socorro Processor asks S3 for 61 KEYS PER SECOND THAT RESULT IN A NOT FOUND. 


Doing 61 writes / second into Postgres (even if 99% of them are SQL UPDATES) is not particularly attractive. Relational databases aren't ideal for many rapid writes. 

Let's use Redis for this and let's use expiring keys since we're only ever interested in what were the missing symbols in the last day anyway.
Also useful to know is the average and median total time it takes for S3 to respond. 

CODE       PRIVATE         COUNT    AVERAGE     MEDIAN
200        true              217        205        111
200        false           21575        170         56
304        false             158         35        5.0
404        false          214162         20        5.0
403        true           133977         11          5

(filtered out those with counts < 10)

The "Total Time" is...:

"The number of milliseconds the request was in flight from the server's perspective. This value is measured from the time your request is received to the time that the last byte of the response is sent. Measurements made from the client's perspective might be longer due to network latency. "
-- http://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html

This does NOT include the download time from the client's point of view. But it's certainly interesting that a 404 takes on average (20+11)/2 ~= 16 Milliseconds. Meaning, S3 is extremely fast at figuring out that it does NOT have a file.
Depends on: 1362512
For what it's worth, I took the list of ALL URLs and sorted by the most common file extension on the URLs:

59405 .sym
5590 .ptr
4144 .dll
4133 .dl_
1130 .pdb
1117 .pd_
142 .ex_
141 .exe
43 .api
40 .ap_
I have a solution in code now. The idea is that it writes down every symbol it can't find in S3 in the Redis cache. (Not the Redis used for storing parsed symbols). The solution is to use an incrementing key made up of a prefix, date, symbol, debug id, filename, code_file, code_id. E.g. `missingsymbols:2017-05-08:eps2003csp11.pdb|8963A03F74A5494C9F0271964B7185E11|eps2003csp11.sym|eps2003csp11.dll|56582C6Fd5000`

These are fairly fat keys and since we expected roughly 300K keys per day, the size in Redis would take up a bit of space. 

Running a ongoing measurement in Docker, and noting the total number of keys compared to used_memory, I get this slope:
y = 208.77x + 780000

(see https://docs.google.com/spreadsheets/d/1Tdjml3IpHziV_MfBzQd7b-Hii8S-BXt2GOCszX-__Ug/edit?usp=sharing)

Meaning, at 300k keys we expect this to use up 63411000.0 ~= 60MB. In other words, very small.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.