Closed Bug 1127532 Opened 9 years ago Closed 9 years ago

host "snappy" server (symbolapi.m.o)

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: rhelmer)

References

Details

There's currently a bespoke VM hosting symbolapi.m.o which runs https://github.com/vdjeric/Snappy-Symbolication-Server/

Snappy just needs access to our S3 symbols, should not need any access to other services.
Should we put this on an EC2 node?

Currently this only needs access to public data (the S3 symbols-public bucket which is totally open now), so this might be a good candidate for Heroku. Especially as we want vladan to be able to get access to it for debugging if necessary, and we want it to auto-deploy.
Flags: needinfo?(dmaher)
(In reply to Robert Helmer [:rhelmer] from comment #1)
> Should we put this on an EC2 node?
> 
> Currently this only needs access to public data (the S3 symbols-public
> bucket which is totally open now), so this might be a good candidate for
> Heroku. Especially as we want vladan to be able to get access to it for
> debugging if necessary, and we want it to auto-deploy.

Maybe answering my own question, but this seems to be too slow for Heroku which times out web requests after 30 seconds.
It times out 30 seconds until the first byte is sent, but then there's a rolling 55 second window to accommodate streaming things. Are you worried about the initial byte?
(In reply to Chris Lonnen :lonnen from comment #3)
> It times out 30 seconds until the first byte is sent, but then there's a
> rolling 55 second window to accommodate streaming things. Are you worried
> about the initial byte?

Snappy downloads and parses (potentially multiple) symbol files per incoming request, and it does this before sending the first byte.

It *was* taking over 30s to return a response, before I realized how different the defaults are from the sample.ini, also the default path for the MRU cache wasn't working on Heroku.

It seems faster now:

curl -d '{"stacks":[[[0,11723767],[1, 65802]]],"memoryMap":[["xul.pdb","44E4EC8C2F41492B9369D6B9A059577C2"],["wntdll.pdb","D74F79EB1F8D4A45ABCD2F476CCABACC2"]],"version":4}' https://murmuring-waters-3757.herokuapp.com/

Vladan, this is just a temporary URL (it's running on my own Heroku account) but would you mind checking to see if it seems to be working correctly and is fast enough ^?
Flags: needinfo?(vdjeric)
- We don't need HTTPS for symbolication

- I tried your curl command, and it took 20 seconds to satisfy that toy request. It caches it afterwards.
* By contrast, the current symbolication server (using NFS symbol mount) answers a similar toy request almost instantaneously.
* It's a developer tool, but I'd like it to be perform faster. I think when we ran it on Mark Reid's EC2 node (also uses symbols from S3), it was much faster.

- I tried symbolicating a brief profile captured with the profiler extension against that herokuapp URL. It took it 40 seconds to return a response.
* The profiler extension couldn't parse the response: "profiler.filteredThreadSamples is undefined"
* Ted, do you know what this error is about?
Flags: needinfo?(vdjeric) → needinfo?(ted)
(In reply to Vladan Djeric (:vladan) from comment #5)
> - We don't need HTTPS for symbolication
> 
> - I tried your curl command, and it took 20 seconds to satisfy that toy
> request. It caches it afterwards.
> * By contrast, the current symbolication server (using NFS symbol mount)
> answers a similar toy request almost instantaneously.
> * It's a developer tool, but I'd like it to be perform faster. I think when
> we ran it on Mark Reid's EC2 node (also uses symbols from S3), it was much
> faster.
> 
> - I tried symbolicating a brief profile captured with the profiler extension
> against that herokuapp URL. It took it 40 seconds to return a response.


I spun up an EC2 micro instance in the same zone as the symbols S3 bucket. Heroku is in
a different AWS zone, and may be slower for other reasons.

How does this compare:
http://ec2-54-191-238-159.us-west-2.compute.amazonaws.com
(In reply to Robert Helmer [:rhelmer] from comment #2)
> (In reply to Robert Helmer [:rhelmer] from comment #1)
> > Should we put this on an EC2 node?
> > 
> > Currently this only needs access to public data (the S3 symbols-public
> > bucket which is totally open now), so this might be a good candidate for
> > Heroku. Especially as we want vladan to be able to get access to it for
> > debugging if necessary, and we want it to auto-deploy.
> 
> Maybe answering my own question, but this seems to be too slow for Heroku
> which times out web requests after 30 seconds.

I have no strong opinion either way.  If we host it on EC2 then we're responsible for the infrastructure, so that will need to be config managed (etc), which is fine.  If we go Heroku, then we don't manage the infra, which is also fine.  Either way, however, we still need to manage things like deployment, monitoring, asset management, and so forth.  We don't currently have a good policy in place for managing not-AWS, so that could be a consideration, though we should probably think about the best way to do that going forward. :)
Flags: needinfo?(dmaher)
(In reply to Vladan Djeric (:vladan) from comment #5)
> - I tried symbolicating a brief profile captured with the profiler extension
> against that herokuapp URL. It took it 40 seconds to return a response.
> * The profiler extension couldn't parse the response:
> "profiler.filteredThreadSamples is undefined"
> * Ted, do you know what this error is about?

I don't know what this error is, sorry. The responses from my test queries look the same from both the existing server and the test servers. I used the profiler extension to profile against a local server running the new code when I wrote those patches and it worked fine here.

In terms of timing:
luser@eye7:/build$ time curl -d '{"stacks":[[[0,11723767],[1, 65802]]],"memoryMap":[["xul.pdb","44E4EC8C2F41492B9369D6B9A059577C2"],["wntdll.pdb","D74F79EB1F8D4A45ABCD2F476CCABACC2"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://symbolapi.mozilla.org/; echo
[["XREMain::XRE_mainRun() (in xul.pdb)", "KiUserCallbackDispatcher (in wntdll.pdb)"]]
real	0m0.659s
user	0m0.007s
sys	0m0.003s

luser@eye7:/build$ time curl -d '{"stacks":[[[0,11723767],[1, 65802]]],"memoryMap":[["xul.pdb","44E4EC8C2F41492B9369D6B9A059577C2"],["wntdll.pdb","D74F79EB1F8D4A45ABCD2F476CCABACC2"]],"version":3,"osName":"Windows","appName":"Firefox"}' https://murmuring-waters-3757.herokuapp.com/; echo
[["XREMain::XRE_mainRun() (in xul.pdb)", "KiUserCallbackDispatcher (in wntdll.pdb)"]]
real	0m0.381s
user	0m0.005s
sys	0m0.009s

luser@eye7:/build$ time curl -d '{"stacks":[[[0,11723767],[1, 65802]]],"memoryMap":[["xul.pdb","44E4EC8C2F41492B9369D6B9A059577C2"],["wntdll.pdb","D74F79EB1F8D4A45ABCD2F476CCABACC2"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://ec2-54-244-178-234.us-west-2.compute.amazonaws.com/; echo
[["XREMain::XRE_mainRun() (in xul.pdb)", "KiUserCallbackDispatcher (in wntdll.pdb)"]]
real	0m0.392s
user	0m0.003s
sys	0m0.006s

All 3 servers seem to be about as fast. However, the Heroku server did take a long time to respond to my first request (which I didn't think to time, of course). I think Heroku might spin down the service when it's not in use. If that's true that's probably bad for server performance, as it caches parsed symbols in memory, so restarting means it loses its entire cache. In fact, it'd be doubly bad because as an optimization it attempts to reload previously-used symbols on startup, so if your request caused the server to be spun up you'd have to wait extra long while it parsed all those symbols.
Flags: needinfo?(ted)
(In reply to Daniel Maher [:phrawzty] from comment #7)
> (In reply to Robert Helmer [:rhelmer] from comment #2)
> > (In reply to Robert Helmer [:rhelmer] from comment #1)
> > > Should we put this on an EC2 node?
> > > 
> > > Currently this only needs access to public data (the S3 symbols-public
> > > bucket which is totally open now), so this might be a good candidate for
> > > Heroku. Especially as we want vladan to be able to get access to it for
> > > debugging if necessary, and we want it to auto-deploy.
> > 
> > Maybe answering my own question, but this seems to be too slow for Heroku
> > which times out web requests after 30 seconds.
> 
> I have no strong opinion either way.  If we host it on EC2 then we're
> responsible for the infrastructure, so that will need to be config managed
> (etc), which is fine.  If we go Heroku, then we don't manage the infra,
> which is also fine.  Either way, however, we still need to manage things
> like deployment, monitoring, asset management, and so forth.  We don't
> currently have a good policy in place for managing not-AWS, so that could be
> a consideration, though we should probably think about the best way to do
> that going forward. :)


From the testing Ted and Vladan have done in this bug, this sounds like a better fit for EC2 in any case.
I retested with the profiler extension. I pointed the profiler.symbolicationUrl value to the 3 servers and timed how long it took to fetch symbols for a startup profile.

Heroku: ~45 seconds for the first (uncached) request, it then timed out and gave back an invalid response ("profiler.filteredThreadSamples is undefined"). Maybe it's a timeout? After that, it took about 10 seconds to respond (correctly).

EC2: ~60 seconds for the first (uncached) request, but it did return a correct response.  About 10 seconds thereafter.

symbolapi.mozilla.org: 10 seconds for all requests. I think symbolapi.mozilla.org had the advantage of pre-fetching, or having been already used by others for symbolicating today's Nightly already.

In any case, I agree we should probably host on EC2. I don't think a micro instance is sufficient, we want more RAM for caching and a faster CPU for faster parsing of the .sym files. Maybe a c4.xlarge? Also he server's cache size config should be adjusted for the amount of RAM available.

Eventually, we should restore the pre-fetching functionality somehow but I'm ok with leaving that for later.
Note that it would be useful to know where the time is being spent during that first uncached request, whether it's in fetching or in processing the fetched .sym files. A few printfs ought to be enough. I'd appreciate it if someone could get those numbers, I don't have time to do that this week as I'm helping with the Telemetry/FHR unification which is on a pretty tight schedule, and a few other things as well.
(In reply to Vladan Djeric (:vladan) from comment #11)
> Note that it would be useful to know where the time is being spent during
> that first uncached request, whether it's in fetching or in processing the
> fetched .sym files. A few printfs ought to be enough. I'd appreciate it if
> someone could get those numbers, I don't have time to do that this week as
> I'm helping with the Telemetry/FHR unification which is on a pretty tight
> schedule, and a few other things as well.

I am curious about this too, I'll work on it this week.
(In reply to Vladan Djeric (:vladan) from comment #10)
> In any case, I agree we should probably host on EC2. I don't think a micro
> instance is sufficient, we want more RAM for caching and a faster CPU for
> faster parsing of the .sym files. Maybe a c4.xlarge? Also he server's cache
> size config should be adjusted for the amount of RAM available.
> 
> Eventually, we should restore the pre-fetching functionality somehow but I'm
> ok with leaving that for later.

Sounds like a plan. Thanks!
(In reply to Robert Helmer [:rhelmer] from comment #12)
> (In reply to Vladan Djeric (:vladan) from comment #11)
> > Note that it would be useful to know where the time is being spent during
> > that first uncached request, whether it's in fetching or in processing the
> > fetched .sym files.

One thing I notice right away is that fetching a symbol file with e.g. curl on the same box takes almost no time, so presumably the problem is elsewhere. Real numbers coming soon.
My gut feeling is that it's what I described in comment 8--the Heroku dyno is being put to sleep (described here: https://devcenter.heroku.com/articles/dynos#dyno-sleeping ) which means the server process is terminated. When you make a new request it has to start again, and on startup it attempts to re-fill the memory cache using its MRU list from the last run, which adds extra overhead.
Just so we're comparing apples to apples here I picked two symbol files that symbolapi is extremely unlikely to have precached--xul.pdb from the Firefox 25 and 26 releases:

luser@eye7:/build$ time curl -d '{"stacks":[[[0,10047540]]],"memoryMap":[["xul.pdb","6F63BDA0293544FBA56816979837A3882"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://symbolapi.mozilla.org/; echo
[["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]]
real	0m2.027s
user	0m0.005s
sys	0m0.005s

luser@eye7:/build$ time curl -d '{"stacks":[[[0,10047540]]],"memoryMap":[["xul.pdb","6F63BDA0293544FBA56816979837A3882"]],"version":3,"osName":"Windows","appName":"Firefox"}' https://murmuring-waters-3757.herokuapp.com/; echo
[["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]]
real	0m23.855s
user	0m0.011s
sys	0m0.004s

luser@eye7:/build$ time curl -d '{"stacks":[[[0,10047540]]],"memoryMap":[["xul.pdb","6F63BDA0293544FBA56816979837A3882"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://ec2-54-244-178-234.us-west-2.compute.amazonaws.com/; echo
[["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]]
real	0m11.729s
user	0m0.005s
sys	0m0.005s

luser@eye7:/build$ time curl -d '{"stacks":[[[0,47993]]],"memoryMap":[["xul.pdb","BC74C9A206B443FA82D5A64B9865E2601"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://symbolapi.mozilla.org/; echo
[["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]]
real	0m2.316s
user	0m0.010s
sys	0m0.000s

luser@eye7:/build$ time curl -d '{"stacks":[[[0,47993]]],"memoryMap":[["xul.pdb","BC74C9A206B443FA82D5A64B9865E2601"]],"version":3,"osName":"Windows","appName":"Firefox"}' https://murmuring-waters-3757.herokuapp.com/; echo
[["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]]
real	0m27.827s
user	0m0.005s
sys	0m0.009s

luser@eye7:/build$ time curl -d '{"stacks":[[[0,47993]]],"memoryMap":[["xul.pdb","BC74C9A206B443FA82D5A64B9865E2601"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://ec2-54-244-178-234.us-west-2.compute.amazonaws.com/; echo
[["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]]
real	0m9.889s
user	0m0.000s
sys	0m0.010s

The timing is fairly consistent there with the existing symbolapi taking ~2s, the Heroku dyno taking 20+s and the EC2 instance taking ~10s.
To repeat these experiments usefully you'd need to pick a new xul.pdb file that's unlikely to be cached. Old Firefox releases (or Thunderbird or whatever) are good candidates. If you're spinning up a new server you can just repeat one of those commands with your new URL, since a new server will have an empty cache. (Note that if you stop and restart a server it'll prefill its cache so that's not valid.)
So we could make Heroku faster by using higher-CPU dynos, but at the price it's probably not worth it.

Repeating the experiment in comment 16, first hit with a few different instance types:

* t2.medium - ~5s
* c4.xlarge - ~3s

I stopped the server, removed /tmp/snappy-mru-symbols.json, started and repeated several times, seems pretty consistent.
For a more realistic workload here's a request body packet captured from the profiler extension:
https://gist.github.com/luser/7054086aac022ab0ac01
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #19)
> For a more realistic workload here's a request body packet captured from the
> profiler extension:
> https://gist.github.com/luser/7054086aac022ab0ac01

This is great, thanks! I'll use this to compare the current symbolapi.m.o against the instance types we're considering, I want to make sure this service stays fast.
So is Snappy on EC2 now?
(In reply to Vladan Djeric (:vladan) from comment #21)
> So is Snappy on EC2 now?

Not yet, we're working on building out our AWS infra this quarter.
Assignee: nobody → rhelmer
Blocks: 1118288
Status: NEW → ASSIGNED
When are the NFS-mounted symbols going away?
Flags: needinfo?(rhelmer)
(In reply to Vladan Djeric (:vladan) from comment #23)
> When are the NFS-mounted symbols going away?

I don't think we have a specific date yet, but we won't shut it down until the replacement in AWS is ready. We're trying to get all of symbols and all of crash-stats done this quarter.
Flags: needinfo?(rhelmer)
Right, once we fix all the deps of bug 1071724 we'll no longer be using the NFS mount for anything. Once we've switched everyone to using the upload API (bug 1085557, bug 1130138) and land bug 1085530 we'll no longer be storing new symbols in NFS so s3 will be the repository of record. Actually turning off the NFS mount is just a formality at that point, as we'll no longer be putting any new data into it.

Obviously we should finalize the Snappy migration before we stop putting new symbols in NFS.
:phrawzty packaged this up for us, and we will spin up an EC2 node for it next:

https://github.com/mozilla/Snappy-Symbolication-Server/commit/172344b7b141a3385ef2382678dbbbd5ab8b86d6
We have this working fine in staging:

http://symbolapi.mocotoolsstaging.net/

Going to bring up the prod version soon, the last step will be transitioning DNS.

We need to be done with our AWS move first, flipping dependencies around.
No longer blocks: 1118288
Depends on: 1118288
Prod infra is up as well:

http://symbolapi.mocotoolsprod.net/

We need to do some more testing of our AWS env, the last step here is going to be switching the symbolapi.mozilla.org DNS to point to the above.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.