Closed
Bug 1361854
Opened 7 years ago
Closed 7 years ago
Store Not Found symbol lookups
Categories
(Socorro :: Symbols, task)
Socorro
Symbols
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: peterbe, Unassigned)
References
Details
Tecken (the Symbol Download part) currently supports HEAD requests to see if a symbol *exists* in whichever S3 buckets that Tecken is configured to iterate through. You can also do a GET which will either 302 redirect to an absolute S3 URL or a 404 Not Found. What we should do is make that 404 Not Found a lot bigger deal. We'll keep a database of all failed look ups. That way we can use Tecken from Socorro Processor as a fallback and build up a database of symbols that the Socorro Processor wanted but couldn't get. This database can then be used to report on. Ted currently needs it to be a CSV formatted URL that he can feed into his Taskcluster job that fills in those missing symbols. This opens up the opportunity to expand the functionality on what to do if the symbol isn't in any of the S3 buckets. For example, it could (if the filename matches a pattern) take the opportunity to try to fetch it from MSDN (and store it permanently for the next time it's asked for). Or, it could do other advanced lookups against Apple's repositories of symbols.
Reporter | ||
Comment 1•7 years ago
|
||
One important detail that needs attention... In the Socorro Processor, when it (currently) analysis which symbols couldn't be found [0], it also as a "code_id". Currently Tecken Downloader only knows to look up by debug file (e.g. xul.pdb), debug ID and filename (e.g. xul.sym). [0] https://github.com/mozilla/socorro/blob/55beaf1281e7b522e0526b2aa2bf74d15f6c1263/socorro/processor/mozilla_transform_rules.py#L815
Comment 2•7 years ago
|
||
Do we want to keep track of *all* 404s or just 404s from the Socorro processor?
Comment 3•7 years ago
|
||
(In reply to Peter Bengtsson [:peterbe] from comment #1) > One important detail that needs attention... > > In the Socorro Processor, when it (currently) analysis which symbols > couldn't be found [0], it also as a "code_id". Currently Tecken Downloader > only knows to look up by debug file (e.g. xul.pdb), debug ID and filename > (e.g. xul.sym). I feel like I remember discussing this at some point in the past, but perhaps I'm mistaken. For 64-bit Windows binaries, we need the DLL/EXE along with the PDB to dump all the symbols we need (some of them are in the binary itself). They're both available from Microsoft's symbol server, but we need the code_id / code_file to fetch the binary. We could make the processor send them in a HTTP header or query string or something like that.
Reporter | ||
Comment 4•7 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #3) > (In reply to Peter Bengtsson [:peterbe] from comment #1) > > One important detail that needs attention... > > > > In the Socorro Processor, when it (currently) analysis which symbols > > couldn't be found [0], it also as a "code_id". Currently Tecken Downloader > > only knows to look up by debug file (e.g. xul.pdb), debug ID and filename > > (e.g. xul.sym). > > I feel like I remember discussing this at some point in the past, but > perhaps I'm mistaken. For 64-bit Windows binaries, we need the DLL/EXE along > with the PDB to dump all the symbols we need (some of them are in the binary > itself). They're both available from Microsoft's symbol server, but we need > the code_id / code_file to fetch the binary. We could make the processor > send them in a HTTP header or query string or something like that. So that requires a change to stackwalker.cc, right? To append `?code_id=...` to the URL before it does the actual curl. Do you want me to file you a designated bug specifically about that? Tecken already picks up the debug name, debug id, and file name from the URL [0]. All I need to do, in the Django view function, is to do `code_id = request.GET.get('code_id')` and store it as a NULL if it's not there. I can start working on this very soon. [0] https://github.com/mozilla/tecken/blob/master/tecken/download/urls.py#L19
Reporter | ||
Comment 5•7 years ago
|
||
Wow! I didn't realize this but the amount of S3 traffic for symbols that Socorro Processor sends is enormous. I enabled logs on the two S3 buckets for almost two hours. Then I downloaded them all and ran some analytics. CODE PRIVATE COUNT RATE query/sec 404 false 214162 36.3171103951 403 true 133977 24.7053291536 404 true 6 6.0 200 false 21575 3.65863998643 200 true 217 0.040088675411 304 false 158 0.0284530884207 301 false 4 0.0138408304498 500 true 2 0.00534759358289 403 false 9 0.00201117318436 500 false 2 0.000411946446962 What this means is that we get, for example, 24.7 requests per second on the private bucket that result in a 403. Mind you! because of IAM role magic, if you try to query a key that doesn't exist on https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-private/ you get a 403. I.e. (From within the webeng AWS org...) $ # the private bucket $ curl -v https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-private/DOESNOTEXIST.TXT ... 403 Forbidden ... $ # the public bucket $ curl -v https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/DOESNOTEXIST.TXT ... 404 Not Found ... So, in total Socorro Processor asks S3 for 61 KEYS PER SECOND THAT RESULT IN A NOT FOUND. Doing 61 writes / second into Postgres (even if 99% of them are SQL UPDATES) is not particularly attractive. Relational databases aren't ideal for many rapid writes. Let's use Redis for this and let's use expiring keys since we're only ever interested in what were the missing symbols in the last day anyway.
Reporter | ||
Comment 6•7 years ago
|
||
Also useful to know is the average and median total time it takes for S3 to respond. CODE PRIVATE COUNT AVERAGE MEDIAN 200 true 217 205 111 200 false 21575 170 56 304 false 158 35 5.0 404 false 214162 20 5.0 403 true 133977 11 5 (filtered out those with counts < 10) The "Total Time" is...: "The number of milliseconds the request was in flight from the server's perspective. This value is measured from the time your request is received to the time that the last byte of the response is sent. Measurements made from the client's perspective might be longer due to network latency. " -- http://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html This does NOT include the download time from the client's point of view. But it's certainly interesting that a 404 takes on average (20+11)/2 ~= 16 Milliseconds. Meaning, S3 is extremely fast at figuring out that it does NOT have a file.
Reporter | ||
Comment 7•7 years ago
|
||
For what it's worth, I took the list of ALL URLs and sorted by the most common file extension on the URLs: 59405 .sym 5590 .ptr 4144 .dll 4133 .dl_ 1130 .pdb 1117 .pd_ 142 .ex_ 141 .exe 43 .api 40 .ap_
Reporter | ||
Comment 8•7 years ago
|
||
I have a solution in code now. The idea is that it writes down every symbol it can't find in S3 in the Redis cache. (Not the Redis used for storing parsed symbols). The solution is to use an incrementing key made up of a prefix, date, symbol, debug id, filename, code_file, code_id. E.g. `missingsymbols:2017-05-08:eps2003csp11.pdb|8963A03F74A5494C9F0271964B7185E11|eps2003csp11.sym|eps2003csp11.dll|56582C6Fd5000` These are fairly fat keys and since we expected roughly 300K keys per day, the size in Redis would take up a bit of space. Running a ongoing measurement in Docker, and noting the total number of keys compared to used_memory, I get this slope: y = 208.77x + 780000 (see https://docs.google.com/spreadsheets/d/1Tdjml3IpHziV_MfBzQd7b-Hii8S-BXt2GOCszX-__Ug/edit?usp=sharing) Meaning, at 300k keys we expect this to use up 63411000.0 ~= 60MB. In other words, very small.
Reporter | ||
Comment 9•7 years ago
|
||
PR https://github.com/mozilla-services/tecken/pull/72
Comment 10•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla-services/tecken https://github.com/mozilla-services/tecken/commit/0cfad89e0e1d16de47790150138a58e0acc00fb3 fixes bug 1361854 - Store Not Found symbol lookups (#72)
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•