1127532 - host "snappy" server (symbolapi.m.o)

Assignee

Description

•

10 years ago

There's currently a bespoke VM hosting symbolapi.m.o which runs https://github.com/vdjeric/Snappy-Symbolication-Server/ Snappy just needs access to our S3 symbols, should not need any access to other services.

Robert Helmer [:rhelmer]

Assignee

Comment 1

•

10 years ago

Should we put this on an EC2 node? Currently this only needs access to public data (the S3 symbols-public bucket which is totally open now), so this might be a good candidate for Heroku. Especially as we want vladan to be able to get access to it for debugging if necessary, and we want it to auto-deploy.

Flags: needinfo?(dmaher)

Robert Helmer [:rhelmer]

Assignee

Comment 2

•

10 years ago

(In reply to Robert Helmer [:rhelmer] from comment #1) > Should we put this on an EC2 node? > > Currently this only needs access to public data (the S3 symbols-public > bucket which is totally open now), so this might be a good candidate for > Heroku. Especially as we want vladan to be able to get access to it for > debugging if necessary, and we want it to auto-deploy. Maybe answering my own question, but this seems to be too slow for Heroku which times out web requests after 30 seconds.

Lonnen :lonnen

Comment 3

•

10 years ago

It times out 30 seconds until the first byte is sent, but then there's a rolling 55 second window to accommodate streaming things. Are you worried about the initial byte?

Robert Helmer [:rhelmer]

Assignee

Comment 4

•

10 years ago

(In reply to Chris Lonnen :lonnen from comment #3) > It times out 30 seconds until the first byte is sent, but then there's a > rolling 55 second window to accommodate streaming things. Are you worried > about the initial byte? Snappy downloads and parses (potentially multiple) symbol files per incoming request, and it does this before sending the first byte. It *was* taking over 30s to return a response, before I realized how different the defaults are from the sample.ini, also the default path for the MRU cache wasn't working on Heroku. It seems faster now: curl -d '{"stacks":[[[0,11723767],[1, 65802]]],"memoryMap":[["xul.pdb","44E4EC8C2F41492B9369D6B9A059577C2"],["wntdll.pdb","D74F79EB1F8D4A45ABCD2F476CCABACC2"]],"version":4}' https://murmuring-waters-3757.herokuapp.com/ Vladan, this is just a temporary URL (it's running on my own Heroku account) but would you mind checking to see if it seems to be working correctly and is fast enough ^?

Flags: needinfo?(vdjeric)

Vladan Djeric (:vladan)

Comment 5

•

10 years ago

- We don't need HTTPS for symbolication - I tried your curl command, and it took 20 seconds to satisfy that toy request. It caches it afterwards. * By contrast, the current symbolication server (using NFS symbol mount) answers a similar toy request almost instantaneously. * It's a developer tool, but I'd like it to be perform faster. I think when we ran it on Mark Reid's EC2 node (also uses symbols from S3), it was much faster. - I tried symbolicating a brief profile captured with the profiler extension against that herokuapp URL. It took it 40 seconds to return a response. * The profiler extension couldn't parse the response: "profiler.filteredThreadSamples is undefined" * Ted, do you know what this error is about?

Flags: needinfo?(vdjeric) → needinfo?(ted)

Robert Helmer [:rhelmer]

Assignee

Comment 6

•

10 years ago

(In reply to Vladan Djeric (:vladan) from comment #5) > - We don't need HTTPS for symbolication > > - I tried your curl command, and it took 20 seconds to satisfy that toy > request. It caches it afterwards. > * By contrast, the current symbolication server (using NFS symbol mount) > answers a similar toy request almost instantaneously. > * It's a developer tool, but I'd like it to be perform faster. I think when > we ran it on Mark Reid's EC2 node (also uses symbols from S3), it was much > faster. > > - I tried symbolicating a brief profile captured with the profiler extension > against that herokuapp URL. It took it 40 seconds to return a response. I spun up an EC2 micro instance in the same zone as the symbols S3 bucket. Heroku is in a different AWS zone, and may be slower for other reasons. How does this compare: http://ec2-54-191-238-159.us-west-2.compute.amazonaws.com

Daniel Maher [:phrawzty]

Comment 7

•

10 years ago

(In reply to Robert Helmer [:rhelmer] from comment #2) > (In reply to Robert Helmer [:rhelmer] from comment #1) > > Should we put this on an EC2 node? > > > > Currently this only needs access to public data (the S3 symbols-public > > bucket which is totally open now), so this might be a good candidate for > > Heroku. Especially as we want vladan to be able to get access to it for > > debugging if necessary, and we want it to auto-deploy. > > Maybe answering my own question, but this seems to be too slow for Heroku > which times out web requests after 30 seconds. I have no strong opinion either way. If we host it on EC2 then we're responsible for the infrastructure, so that will need to be config managed (etc), which is fine. If we go Heroku, then we don't manage the infra, which is also fine. Either way, however, we still need to manage things like deployment, monitoring, asset management, and so forth. We don't currently have a good policy in place for managing not-AWS, so that could be a consideration, though we should probably think about the best way to do that going forward. :)

Flags: needinfo?(dmaher)

(not currently active) Ted Mielczarek

Comment 8

•

10 years ago

(In reply to Vladan Djeric (:vladan) from comment #5) > - I tried symbolicating a brief profile captured with the profiler extension > against that herokuapp URL. It took it 40 seconds to return a response. > * The profiler extension couldn't parse the response: > "profiler.filteredThreadSamples is undefined" > * Ted, do you know what this error is about? I don't know what this error is, sorry. The responses from my test queries look the same from both the existing server and the test servers. I used the profiler extension to profile against a local server running the new code when I wrote those patches and it worked fine here. In terms of timing: luser@eye7:/build$ time curl -d '{"stacks":[[[0,11723767],[1, 65802]]],"memoryMap":[["xul.pdb","44E4EC8C2F41492B9369D6B9A059577C2"],["wntdll.pdb","D74F79EB1F8D4A45ABCD2F476CCABACC2"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://symbolapi.mozilla.org/; echo [["XREMain::XRE_mainRun() (in xul.pdb)", "KiUserCallbackDispatcher (in wntdll.pdb)"]] real 0m0.659s user 0m0.007s sys 0m0.003s luser@eye7:/build$ time curl -d '{"stacks":[[[0,11723767],[1, 65802]]],"memoryMap":[["xul.pdb","44E4EC8C2F41492B9369D6B9A059577C2"],["wntdll.pdb","D74F79EB1F8D4A45ABCD2F476CCABACC2"]],"version":3,"osName":"Windows","appName":"Firefox"}' https://murmuring-waters-3757.herokuapp.com/; echo [["XREMain::XRE_mainRun() (in xul.pdb)", "KiUserCallbackDispatcher (in wntdll.pdb)"]] real 0m0.381s user 0m0.005s sys 0m0.009s luser@eye7:/build$ time curl -d '{"stacks":[[[0,11723767],[1, 65802]]],"memoryMap":[["xul.pdb","44E4EC8C2F41492B9369D6B9A059577C2"],["wntdll.pdb","D74F79EB1F8D4A45ABCD2F476CCABACC2"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://ec2-54-244-178-234.us-west-2.compute.amazonaws.com/; echo [["XREMain::XRE_mainRun() (in xul.pdb)", "KiUserCallbackDispatcher (in wntdll.pdb)"]] real 0m0.392s user 0m0.003s sys 0m0.006s All 3 servers seem to be about as fast. However, the Heroku server did take a long time to respond to my first request (which I didn't think to time, of course). I think Heroku might spin down the service when it's not in use. If that's true that's probably bad for server performance, as it caches parsed symbols in memory, so restarting means it loses its entire cache. In fact, it'd be doubly bad because as an optimization it attempts to reload previously-used symbols on startup, so if your request caused the server to be spun up you'd have to wait extra long while it parsed all those symbols.

Flags: needinfo?(ted)

Robert Helmer [:rhelmer]

Assignee

Comment 9

•

10 years ago

(In reply to Daniel Maher [:phrawzty] from comment #7) > (In reply to Robert Helmer [:rhelmer] from comment #2) > > (In reply to Robert Helmer [:rhelmer] from comment #1) > > > Should we put this on an EC2 node? > > > > > > Currently this only needs access to public data (the S3 symbols-public > > > bucket which is totally open now), so this might be a good candidate for > > > Heroku. Especially as we want vladan to be able to get access to it for > > > debugging if necessary, and we want it to auto-deploy. > > > > Maybe answering my own question, but this seems to be too slow for Heroku > > which times out web requests after 30 seconds. > > I have no strong opinion either way. If we host it on EC2 then we're > responsible for the infrastructure, so that will need to be config managed > (etc), which is fine. If we go Heroku, then we don't manage the infra, > which is also fine. Either way, however, we still need to manage things > like deployment, monitoring, asset management, and so forth. We don't > currently have a good policy in place for managing not-AWS, so that could be > a consideration, though we should probably think about the best way to do > that going forward. :) From the testing Ted and Vladan have done in this bug, this sounds like a better fit for EC2 in any case.

Vladan Djeric (:vladan)

Comment 10

•

10 years ago

I retested with the profiler extension. I pointed the profiler.symbolicationUrl value to the 3 servers and timed how long it took to fetch symbols for a startup profile. Heroku: ~45 seconds for the first (uncached) request, it then timed out and gave back an invalid response ("profiler.filteredThreadSamples is undefined"). Maybe it's a timeout? After that, it took about 10 seconds to respond (correctly). EC2: ~60 seconds for the first (uncached) request, but it did return a correct response. About 10 seconds thereafter. symbolapi.mozilla.org: 10 seconds for all requests. I think symbolapi.mozilla.org had the advantage of pre-fetching, or having been already used by others for symbolicating today's Nightly already. In any case, I agree we should probably host on EC2. I don't think a micro instance is sufficient, we want more RAM for caching and a faster CPU for faster parsing of the .sym files. Maybe a c4.xlarge? Also he server's cache size config should be adjusted for the amount of RAM available. Eventually, we should restore the pre-fetching functionality somehow but I'm ok with leaving that for later.

Vladan Djeric (:vladan)

Comment 11

•

10 years ago

Note that it would be useful to know where the time is being spent during that first uncached request, whether it's in fetching or in processing the fetched .sym files. A few printfs ought to be enough. I'd appreciate it if someone could get those numbers, I don't have time to do that this week as I'm helping with the Telemetry/FHR unification which is on a pretty tight schedule, and a few other things as well.

Robert Helmer [:rhelmer]

Assignee

Comment 12

•

10 years ago

(In reply to Vladan Djeric (:vladan) from comment #11) > Note that it would be useful to know where the time is being spent during > that first uncached request, whether it's in fetching or in processing the > fetched .sym files. A few printfs ought to be enough. I'd appreciate it if > someone could get those numbers, I don't have time to do that this week as > I'm helping with the Telemetry/FHR unification which is on a pretty tight > schedule, and a few other things as well. I am curious about this too, I'll work on it this week.

Robert Helmer [:rhelmer]

Assignee

Comment 13

•

10 years ago

(In reply to Vladan Djeric (:vladan) from comment #10) > In any case, I agree we should probably host on EC2. I don't think a micro > instance is sufficient, we want more RAM for caching and a faster CPU for > faster parsing of the .sym files. Maybe a c4.xlarge? Also he server's cache > size config should be adjusted for the amount of RAM available. > > Eventually, we should restore the pre-fetching functionality somehow but I'm > ok with leaving that for later. Sounds like a plan. Thanks!

Robert Helmer [:rhelmer]

Assignee

Comment 14

•

10 years ago

(In reply to Robert Helmer [:rhelmer] from comment #12) > (In reply to Vladan Djeric (:vladan) from comment #11) > > Note that it would be useful to know where the time is being spent during > > that first uncached request, whether it's in fetching or in processing the > > fetched .sym files. One thing I notice right away is that fetching a symbol file with e.g. curl on the same box takes almost no time, so presumably the problem is elsewhere. Real numbers coming soon.

(not currently active) Ted Mielczarek

Comment 15

•

10 years ago

My gut feeling is that it's what I described in comment 8--the Heroku dyno is being put to sleep (described here: https://devcenter.heroku.com/articles/dynos#dyno-sleeping ) which means the server process is terminated. When you make a new request it has to start again, and on startup it attempts to re-fill the memory cache using its MRU list from the last run, which adds extra overhead.

(not currently active) Ted Mielczarek

Comment 16

•

10 years ago

Just so we're comparing apples to apples here I picked two symbol files that symbolapi is extremely unlikely to have precached--xul.pdb from the Firefox 25 and 26 releases: luser@eye7:/build$ time curl -d '{"stacks":[[[0,10047540]]],"memoryMap":[["xul.pdb","6F63BDA0293544FBA56816979837A3882"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://symbolapi.mozilla.org/; echo [["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]] real 0m2.027s user 0m0.005s sys 0m0.005s luser@eye7:/build$ time curl -d '{"stacks":[[[0,10047540]]],"memoryMap":[["xul.pdb","6F63BDA0293544FBA56816979837A3882"]],"version":3,"osName":"Windows","appName":"Firefox"}' https://murmuring-waters-3757.herokuapp.com/; echo [["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]] real 0m23.855s user 0m0.011s sys 0m0.004s luser@eye7:/build$ time curl -d '{"stacks":[[[0,10047540]]],"memoryMap":[["xul.pdb","6F63BDA0293544FBA56816979837A3882"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://ec2-54-244-178-234.us-west-2.compute.amazonaws.com/; echo [["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]] real 0m11.729s user 0m0.005s sys 0m0.005s luser@eye7:/build$ time curl -d '{"stacks":[[[0,47993]]],"memoryMap":[["xul.pdb","BC74C9A206B443FA82D5A64B9865E2601"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://symbolapi.mozilla.org/; echo [["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]] real 0m2.316s user 0m0.010s sys 0m0.000s luser@eye7:/build$ time curl -d '{"stacks":[[[0,47993]]],"memoryMap":[["xul.pdb","BC74C9A206B443FA82D5A64B9865E2601"]],"version":3,"osName":"Windows","appName":"Firefox"}' https://murmuring-waters-3757.herokuapp.com/; echo [["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]] real 0m27.827s user 0m0.005s sys 0m0.009s luser@eye7:/build$ time curl -d '{"stacks":[[[0,47993]]],"memoryMap":[["xul.pdb","BC74C9A206B443FA82D5A64B9865E2601"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://ec2-54-244-178-234.us-west-2.compute.amazonaws.com/; echo [["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]] real 0m9.889s user 0m0.000s sys 0m0.010s The timing is fairly consistent there with the existing symbolapi taking ~2s, the Heroku dyno taking 20+s and the EC2 instance taking ~10s.

(not currently active) Ted Mielczarek

Comment 17

•

10 years ago

To repeat these experiments usefully you'd need to pick a new xul.pdb file that's unlikely to be cached. Old Firefox releases (or Thunderbird or whatever) are good candidates. If you're spinning up a new server you can just repeat one of those commands with your new URL, since a new server will have an empty cache. (Note that if you stop and restart a server it'll prefill its cache so that's not valid.)

Robert Helmer [:rhelmer]

Assignee

Comment 18

•

10 years ago

So we could make Heroku faster by using higher-CPU dynos, but at the price it's probably not worth it. Repeating the experiment in comment 16, first hit with a few different instance types: * t2.medium - ~5s * c4.xlarge - ~3s I stopped the server, removed /tmp/snappy-mru-symbols.json, started and repeated several times, seems pretty consistent.

(not currently active) Ted Mielczarek

Comment 19

•

10 years ago

For a more realistic workload here's a request body packet captured from the profiler extension: https://gist.github.com/luser/7054086aac022ab0ac01

Robert Helmer [:rhelmer]

Assignee

Comment 20

•

10 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #19) > For a more realistic workload here's a request body packet captured from the > profiler extension: > https://gist.github.com/luser/7054086aac022ab0ac01 This is great, thanks! I'll use this to compare the current symbolapi.m.o against the instance types we're considering, I want to make sure this service stays fast.

Vladan Djeric (:vladan)

Comment 21

•

10 years ago

So is Snappy on EC2 now?

Robert Helmer [:rhelmer]

Assignee

Comment 22

•

10 years ago

(In reply to Vladan Djeric (:vladan) from comment #21) > So is Snappy on EC2 now? Not yet, we're working on building out our AWS infra this quarter.

Assignee: nobody → rhelmer

Blocks: 1118288

Status: NEW → ASSIGNED

Vladan Djeric (:vladan)

Comment 23

•

10 years ago

When are the NFS-mounted symbols going away?

Flags: needinfo?(rhelmer)

Robert Helmer [:rhelmer]

Assignee

Comment 24

•

10 years ago

(In reply to Vladan Djeric (:vladan) from comment #23) > When are the NFS-mounted symbols going away? I don't think we have a specific date yet, but we won't shut it down until the replacement in AWS is ready. We're trying to get all of symbols and all of crash-stats done this quarter.

Flags: needinfo?(rhelmer)

(not currently active) Ted Mielczarek

Comment 25

•

10 years ago

Right, once we fix all the deps of bug 1071724 we'll no longer be using the NFS mount for anything. Once we've switched everyone to using the upload API (bug 1085557, bug 1130138) and land bug 1085530 we'll no longer be storing new symbols in NFS so s3 will be the repository of record. Actually turning off the NFS mount is just a formality at that point, as we'll no longer be putting any new data into it. Obviously we should finalize the Snappy migration before we stop putting new symbols in NFS.

Robert Helmer [:rhelmer]

Assignee

Comment 26

•

10 years ago

:phrawzty packaged this up for us, and we will spin up an EC2 node for it next: https://github.com/mozilla/Snappy-Symbolication-Server/commit/172344b7b141a3385ef2382678dbbbd5ab8b86d6

Robert Helmer [:rhelmer]

Assignee

Comment 27

•

10 years ago

We have this working fine in staging: http://symbolapi.mocotoolsstaging.net/ Going to bring up the prod version soon, the last step will be transitioning DNS. We need to be done with our AWS move first, flipping dependencies around.

No longer blocks: 1118288

Depends on: 1118288

Robert Helmer [:rhelmer]

Assignee

Comment 28

•

10 years ago

Prod infra is up as well: http://symbolapi.mocotoolsprod.net/ We need to do some more testing of our AWS env, the last step here is going to be switching the symbolapi.mozilla.org DNS to point to the above.

Robert Helmer [:rhelmer]

Assignee

Updated

•

10 years ago

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED