Closed Bug 1244589 Opened 8 years ago Closed 7 years ago

Symbol API server (symbolapi.mozilla.org) is down

Categories

(Socorro :: Infra, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: xidorn, Unassigned)

References

Details

(Whiteboard: [fce-active-legacy])

It seems to me this is down frequently... or actually hasn't been up for quite a while?

It makes me pretty sad when I finish profiling and then find the server is down, which turns the profiling into completely no sense...
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2524]
Unfortunately, my group isn't responsible for the symbolapi site.  I'm trying to see *who* can help you with your issues with this site: the last person who was mainly responsible for the site has left and no one seems sure who has taken his place.
I would say this is a major issue (or probably even blocker) since without this server, the Gecko Profiler is almost useless, and consequently it is nearly impossible to debug performance issues in Gecko.
Severity: normal → major
Vladan used to take care of this server.

Peterbe, this is a python service hosted in the Socorro infra - the Socorro team should probably take this over.

It may be helpful to look at the git commit log to find someone else who has worked on this app:
https://github.com/mozilla/Snappy-Symbolication-Server/
Flags: needinfo?(peterbe)
FWIW, this service has been flaky for some time, so I wrote a replacement as an opportunity to learn Rust:
https://github.com/rhelmer/symbolapi

Needs a bit more work but is mostly complete at this point.

From my testing, it's fast enough on Heroku (we tried the python "snappy" service there first and it wasn't), this is such a simple service it's silly to manage EC2 nodes for it etc. Only needs access to the public symbol bucket.
(In reply to Robert Helmer [:rhelmer] from comment #3)
> Vladan used to take care of this server.
> 
> Peterbe, this is a python service hosted in the Socorro infra - the Socorro
> team should probably take this over.
> 
> It may be helpful to look at the git commit log to find someone else who has
> worked on this app:
> https://github.com/mozilla/Snappy-Symbolication-Server/

I do recall that when Vladan was managing this, he would investigate the log and restart when this service would get stuck... to resolve this bug a simple restart may be all that's needed.

This service is installed via RPM and has a systemd service etc. as you'd expect.
Flags: needinfo?(peterbe) → needinfo?(jschneider)
(In reply to Robert Helmer [:rhelmer] from comment #4)
> FWIW, this service has been flaky for some time, so I wrote a replacement as
> an opportunity to learn Rust:
> https://github.com/rhelmer/symbolapi
> 
> Needs a bit more work but is mostly complete at this point.
> 
> From my testing, it's fast enough on Heroku (we tried the python "snappy"
> service there first and it wasn't), this is such a simple service it's silly
> to manage EC2 nodes for it etc. Only needs access to the public symbol
> bucket.

Oh, that would be great. I guess if the symbol bucket is public accessible, I can use this to deploy a server for myself :)
(In reply to Xidorn Quan [:xidorn] (UTC+10) (less responsive until Feb 22) from comment #6)
> (In reply to Robert Helmer [:rhelmer] from comment #4)
> > FWIW, this service has been flaky for some time, so I wrote a replacement as
> > an opportunity to learn Rust:
> > https://github.com/rhelmer/symbolapi
> > 
> > Needs a bit more work but is mostly complete at this point.
> > 
> > From my testing, it's fast enough on Heroku (we tried the python "snappy"
> > service there first and it wasn't), this is such a simple service it's silly
> > to manage EC2 nodes for it etc. Only needs access to the public symbol
> > bucket.
> 
> Oh, that would be great. I guess if the symbol bucket is public accessible,
> I can use this to deploy a server for myself :)

Yes you could - symbols are in https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/ and the current server is https://github.com/mozilla/Snappy-Symbolication-Server/
Flags: needinfo?(jschneider)
:xidorn - Have you been able to deploy a symbol server for yourself?

If so, it is okay to close out this bug?
Flags: needinfo?(quanxunzhen)
I don't think whether this bug could be closed relates to whether I am able to deploy my own symbol server at all, as far as the profiler addon depends on the official server by default.

It seems this server has restored, so this bug can be closed.

I hope some team has taken the maintaince of this server. :rhelmer, do you know what team is now responsible for this server?
Flags: needinfo?(quanxunzhen) → needinfo?(rhelmer)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
(In reply to Xidorn Quan [:xidorn] (UTC+8) from comment #9)
> I don't think whether this bug could be closed relates to whether I am able
> to deploy my own symbol server at all, as far as the profiler addon depends
> on the official server by default.
> 
> It seems this server has restored, so this bug can be closed.
> 
> I hope some team has taken the maintaince of this server. :rhelmer, do you
> know what team is now responsible for this server?

It's in the Socorro infra but I don't think there's anyone tending to the software and the instance now that Vladan has left.

Peterbe, we need to at least make sure the node is getting system updates, and there should be some basic monitoring in place - here is a basic example of how to use the service, I can minimize this for you: https://gist.github.com/luser/7054086aac022ab0ac01
Status: RESOLVED → REOPENED
Flags: needinfo?(rhelmer) → needinfo?(peterbe)
Resolution: WORKSFORME → ---
Rob, 
When we spoke you talked of convincing Travis's team to take on this work. How did that go?
Flags: needinfo?(peterbe)
The server is down again.  Could somebody please restart it?
Assignee: server-ops-webops → nobody
Component: WebOps: Other → MOC: Problems
QA Contact: smani → lypulong
symbolapi is down again and Webops doesn't appear to have any access to help so tossing this back in the MOC's queue.  Created bug 1270813 to update the out of date documentation to reflect reality.
See Also: → 1270813
Hello,

We don't monitor this, and it seems like all owning parties are aware that this is down.

Leaving this to them and the documentation in bug 1270813. Feel free to reopen if there is anything we can assist with.
Severity: major → normal
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → INCOMPLETE
This should definitely be fixed, otherwise it would make analysing performance issue very hard and annoying.

Alternatively, could you try open a new service with the same function via using either the original Python impl or the Rust impl? It is not necessary to be in the same domain, but we need a stable service for this!
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Severity: normal → blocker
CC'ing other potential owners.
Severity: blocker → normal
Update,

PM'd Robert (rhelmer), said he had pinged those who work with this in #breakpad.
I took care of the immediate problem here by restarting the symbolapi service. I ran a quick strace first:

stat("/usr/lib/python2.7/site-packages/tornado-4.2.1-py2.7-linux-x86_64.egg/tornado/gen.py", {st_mode=S_IFREG|0644, st_size=37197, ...}) = 0
stat("/usr/lib/python2.7/site-packages/tornado-4.2.1-py2.7-linux-x86_64.egg/tornado/http1connection.py", {st_mode=S_IFREG|0644, st_size=30327, ...}) = 0
stat("/usr/lib/python2.7/site-packages/tornado-4.2.1-py2.7-linux-x86_64.egg/tornado/gen.py", {st_mode=S_IFREG|0644, st_size=37197, ...}) = 0
stat("/usr/lib/python2.7/site-packages/tornado-4.2.1-py2.7-linux-x86_64.egg/tornado/concurrent.py", {st_mode=S_IFREG|0644, st_size=18000, ...}) = 0
accept(8, 0x7fffd8446b90, [16])         = -1 EAGAIN (Resource temporarily unavailable)

There is a supervisor (well, systemd) but the app is hanging not crashing which is why it gets stuck like this.
Component: MOC: Problems → Infra
Product: Infrastructure & Operations → Socorro
QA Contact: lypulong
Oh, ugh, also - this isn't starting right from the systemd script :/ it was left running in a screen session with a debug.ini ... it also looks like it isn't working quite right with some test data I have lying around.

I'll get everything that's actually running checked in, but it's clear this service needs some work.
Aha, figured it out - the local cache area in `/tmp/snappy` appears to be corrupt, moving that out of the way helped. Now it works with the test data I loaded.

So what looks like happened here is that someone manually upgraded the software in-place, we need to fix it and rebuilt the RPM and re-deploy to clean it up.
Severity: normal → blocker
Hm looks like this is down again, error in the log file is:
Sun May  8 02:12:56 2016 MainThread   ERROR Error reading MRU symbols state file

strace doesn't show anything interesting, looks like there's some kind of corruption going on in the local file cache.
OK symbolapi.m.o is back up for the moment, going to keep this open to look into the problem though.
(In reply to Robert Helmer [:rhelmer] from comment #22)
> Hm looks like this is down again, error in the log file is:
> Sun May  8 02:12:56 2016 MainThread   ERROR Error reading MRU symbols state
> file
> 
> strace doesn't show anything interesting, looks like there's some kind of
> corruption going on in the local file cache.

FYI: during Sunday (till 23:00 CEST) I was able to work with it well.
jp, can we get some kind of monitoring for symbolapi.m.o? Here is a way I test it manually:

curl -d '{ "memoryMap": [ [ "mozalloc.pdb", "78C16E5E4E674550841DB69A8ADCE06F2" ] ], "stacks": [ [ [ 0, 5584 ] ] ], "version": 4 }' http://symbolapi.mozilla.org

That should respond in under 1s with:

{"symbolicatedStacks": [["__CppXcptFilter (in mozalloc.pdb)"]], "knownModules": [true]}

Do we have an existing monitoring service we're using that can handle something like this? I'd rather get pinged by a service than make people file bugs :)
Flags: needinfo?(jschneider)
Looks like it is down again... sad to find this every time I want to look into a performance issue...
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #26)
> Looks like it is down again... sad to find this every time I want to look
> into a performance issue...

Hm, ok well I've restarted it again - this doesn't seem like a very scalable way to go :)

We really need monitoring on this, and someone needs to spend some time debugging. I might be able to do that but not right away - AFAIK most of the people that worked on and set this up such as Vladan have moved on, so having a new owner here would be helpful.
Just chatted with xidorn, it stopped working almost immediately after restarting it. It was different this time, from the logs it was doing work (downloading and parsing symbols etc) but was just hanging.

I noticed that this was running a newer version of snappy symbolication server with a config to aid in debugging, so I reverted it to the latest RPM that we built the EC2 node with, since it seemed to be working better before.
Guys, I'm really sorry to raise my voice here, but it's been several days and I still can't work.  Is there at least a workaround to bypass the server somehow?

Maybe more a question for Benoit?
Flags: needinfo?(bgirard)
I don't have access to the server so there's not much I can do. We should really establish someone to maintain this more.

If want to a local build you can setup a local symbolserver:
https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Profiling_with_the_Built-in_Profiler_and_Local_Symbols_on_Windows

Otherwise for now you can work on mac where we don't need the snappy symbol server.
Flags: needinfo?(bgirard)
I just restarted the symbolapi service once more, and JP got back to me regarding monitoring.

If somebody wants to own this (it's just running on an ec2 instance, can be moved easily and doesn't need any special access), please get in touch with me.
Flags: needinfo?(jschneider)
(In reply to Benoit Girard (:BenWa) from comment #30)
> I don't have access to the server so there's not much I can do. We should
> really establish someone to maintain this more.
> 
> If want to a local build you can setup a local symbolserver:
> https://developer.mozilla.org/en-US/docs/Mozilla/Performance/
> Profiling_with_the_Built-in_Profiler_and_Local_Symbols_on_Windows
> 
> Otherwise for now you can work on mac where we don't need the snappy symbol
> server.

Benoit, thanks!  Just made this work, something I actually had to do anyway.
geckoprofiler 1.16.16 does not wok...
it gets stuck while "Retrieving profile". 

https://hg.mozilla.org/mozilla-central/rev/16663eb3dcfa759f25b5e27b101bc79270c156f2
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0 ID:20160522030240
Server is down again...
Can someone get us some information on the server and services and MOC will take over monitoring and partner with you for triage and escalation steps
Flags: needinfo?(rhelmer)
(In reply to Linda Ypulong [:unixfairy] from comment #35)
> Can someone get us some information on the server and services and MOC will
> take over monitoring and partner with you for triage and escalation steps

A simple monitor that makes a request like in comment 25 would go a long way. Thanks!
Flags: needinfo?(rhelmer)
I've gone ahead and restarted the service, but we still need an owner.

This used to be owned by the perf team and then was moved into the Socorro AWS infra when the datacenter it was hosted in was being shut down. This wasn't really considered a production service which is why it doesn't have monitoring and SLA and so forth that one would expect (as we have for crash collection and processing for instance).

Getting monitoring stood up for this would be a great start. I probably will not have time to try to debug what's wrong with the server, but I will take a look at putting a watchdog service in place to at least automatically restart.
OK. I've put in place this really dumb watchdog script, running every 2 minutes:

```
#!/usr/bin/env bash

expected='{"symbolicatedStacks": [["__CppXcptFilter (in mozalloc.pdb)"]], "knownModules": [true]}'
actual=$(curl -s -m 30 -d '{ "memoryMap": [ [ "mozalloc.pdb", "78C16E5E4E674550841DB69A8ADCE06F2" ] ], "stacks": [ [ [ 0, 5584 ] ] ], "version": 4 }' http://symbolapi.mozilla.org)

if [ "$expected" != "$actual" ]
then
  sudo systemctl restart mozilla-snappy
  echo "snappy server restarted" | sendmail rhelmer@mozilla.com
fi
```

This test should take under 1s. If it takes more than 30 or responds with anything other than the expected response, the service will be restarted and I'll be notified.

I'll add some more diagnostic info to the email as time allows.
ryanc and fauweh can you work with this?  Thanks!
Flags: needinfo?(rchilds)
Flags: needinfo?(kferrando)
Sure.

Will get with Keegan and get this out in bug 1277332.
Flags: needinfo?(rchilds)
Flags: needinfo?(kferrando)
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2524]
I received an email notification from the watchdog script that symbolapi was restarted at 9 PM Pacific last night. I checked on it this morning (and just now) and it seems ok.
Chatted w/ mconley and ddurst this morning, the version of snappy was too old to work with the gecko profiler properly.

So in the meantime we agreed to run the latest version of snappy from github. It's running in screen, under the watchdog script from comment 38.

jp - we need to rebuild the latest version of this RPM and re-deploy this server (not right away but maybe after everyone is back from London):
https://github.com/mozilla/Snappy-Symbolication-Server/tree/master/RPM

We will presumably still have the problem that the server is hanging, I am happy to help make a cronjob version of the watchdog restart script, and ddurst will help with this.
Flags: needinfo?(jschneider)
rhelmer:  Let's actually see if we can rally up and work on it together, as I'm not really aware of where that rpm is being genned from, etc.  Also, looking forward to seeing you in London!
Thanks, everyone, for your help on this. I'm sorry it's been such a confusing transition (I wasn't aware my team owned this, and that the-team-formerly-known-as-perf managed it themselves previously). I'm all for having this either under the purview of MOC.

Adding gsvelto, as we've been discussing integrations with this service. He might be interested to see the Rust repo...
Flags: needinfo?(jschneider)
(In reply to David Durst [:ddurst] from comment #44)
> Thanks, everyone, for your help on this. I'm sorry it's been such a
> confusing transition (I wasn't aware my team owned this, and that
> the-team-formerly-known-as-perf managed it themselves previously). I'm all
> for having this either under the purview of MOC.
> 
> Adding gsvelto, as we've been discussing integrations with this service. He
> might be interested to see the Rust repo...

The rust symbolapi prototype I put together is at http://github.com/rhelmer/symbolapi - note the outstanding issues on the github repo. I've done a bit of stress and correctness testing with it and think the approach is solid, but needs some finishing up. It works on Heroku (and elsewhere, EC2 etc): http://symbolapi.herokuapp.com/

Just FYI it's currently not able to handle requests from the Gecko profiler on Windows, which seems to be sending invalid unicode (there's a github issue open for this), but you can see a basic symbolication request working:

curl -s -m 30 -d '{ "memoryMap": [ [ "mozalloc.pdb", "78C16E5E4E674550841DB69A8ADCE06F2" ] ], "stacks": [ [ [ 0, 5584 ] ] ], "version": 4 }' symbolapi.herokuapp.com

It uses the same source symbol data from the public S3 bucket that the python symbolapi uses.

Also, Ted's new Rust symbol parser is in use here (he's done a partial Rust breakpad rewrite), which is one big reason that it runs so much faster on Heroku than the current python+tornado app.
Hmmm, seems to be down again :/

But this time it fails pretty quick with
> Symbolication request to http://symbolapi.mozilla.org/gecko-profiler/ failed with error code HTTP 503

I guess there is something different than before.
I think the drive is full again.
It did indeed run out of disk space.

I just chatted w/ ddurst about this in IRC and then:
 * removed cache
 * `git pull`'d to pick up https://github.com/mozilla/Snappy-Symbolication-Server/pull/27
 * restarted
Sounds like there need to be some kind of cache management.
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #49)
> Sounds like there need to be some kind of cache management.

Yes, it looks like there's an MRU with eviction here:
https://github.com/mozilla/Snappy-Symbolication-Server/blob/master/symCache.py

So that might need to be tweaked, or looked at if it's not working.

David, mind if we close this bug, and open new ones as needed? I think we're in a better place now with the perf team owning the code here, and cloud ops is getting ready to move this to a new instance with monitoring+metrics and such.
Flags: needinfo?(ddurst)
In the meantime... it seems the server is still down at this moment, with a different error:
> Symbolication request to http://symbolapi.mozilla.org/gecko-profiler/ failed with error code HTTP 504
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #51)
> In the meantime... it seems the server is still down at this moment, with a
> different error:
> > Symbolication request to http://symbolapi.mozilla.org/gecko-profiler/ failed with error code HTTP 504

Thanks - hm well it is logging this, and the watchdog script keeps killing it because the service is not responding:

2016-06-28 03:20:15,870 DEBUG   14678   Loading library ('xul.pdb', '2F48C44652644E3DA4EFF431CECD49EE2')
2016-06-28 03:20:17,529 DEBUG   14678   Loading library ('xul.pdb', '38B6B75DE29B4B059679EADF456867B82')
2016-06-28 03:20:19,393 DEBUG   14678   Loading library ('xul.pdb', '3EEBFDA53C804FAC9393EA0D9D582B3E2')
2016-06-28 03:20:21,064 DEBUG   14678   Loading library ('xul.pdb', '474039D4A04D469088C4774D0CE3F9622')
2016-06-28 03:20:22,894 DEBUG   14678   Loading library ('xul.pdb', '5A3AD8249FE040998956751BF1573D6D2')
2016-06-28 03:20:24,723 DEBUG   14678   Loading library ('xul.pdb', 'CDA9B302C0F44BE3ABF101A2B59FF3E92')

If I send in a request, it logs that:

2016-06-28 03:21:20,972 DEBUG   14667   Request body: { "memoryMap": [ [ "mozalloc.pdb", "78C16E5E4E674550841DB69A8ADCE06F2" ] ], "stacks": [ [ [ 0, 5584 ] ] ], "version": 4 } IP=52.27.4.196

But it doesn't seem to be responding.

I wiped the cache and restarted, let's see if it happens again... we can try backing out https://github.com/mozilla/Snappy-Symbolication-Server/commit/e38f3fdd7b49ab181a95bd76bbd3a5f4fea97276 (symbolapi was updated to that today) if so.
What's the watchdog interval?
Flags: needinfo?(ddurst) → needinfo?(rhelmer)
(In reply to David Durst [:ddurst] from comment #53)
> What's the watchdog interval?

Currently it is 600s (I bumped it up yesterday from 30s to see if maybe we just weren't giving enough time to populate a cold cache).

Also I've seen one restart email today, at 6:41 Pacific (can we send this to someone else too maybe? :)

It feels a lot slower at the moment, although I do see it getting a lot of traffic.
Flags: needinfo?(rhelmer)
Feel free to send those to me, too.
Since we've implemented cache file compression, the server stability has been good (no diskspace related hangs), but the service itself is still getting kicked by the watchdog script on a regular basis.

Until we have better insight into why via datadog, there has been a pattern over the past 6 days that the watchdog script kicks in between 2100 and 2130 (ish) every evening.

I don't know that there's anything forensic to look at right now, but maybe someone can observe the process in that window tonight? We could at least narrow the range of possibilities.
Flags: needinfo?(rhelmer)
Whiteboard: [fce-active]
(In reply to David Durst [:ddurst] from comment #56)
> Since we've implemented cache file compression, the server stability has
> been good (no diskspace related hangs), but the service itself is still
> getting kicked by the watchdog script on a regular basis.
> 
> Until we have better insight into why via datadog, there has been a pattern
> over the past 6 days that the watchdog script kicks in between 2100 and 2130
> (ish) every evening.
> 
> I don't know that there's anything forensic to look at right now, but maybe
> someone can observe the process in that window tonight? We could at least
> narrow the range of possibilities.

I wonder if it'll be more valuable to add some more diagnostic output to the watchdog script - strace might be useful as a start. If I was sitting at the console I might try gdb to see where it's stuck, so forcing a core dump would be nice.

:willkg, any other info you think would be useful for diagnosing the snappy server besides the above? I am not familiar enough with Tornado to know if there's something more specific we could be doing.

:peterbe has Tornado experience too IIRC, have you had to diagnose hangs before?
Flags: needinfo?(willkg)
Flags: needinfo?(rhelmer)
Flags: needinfo?(peterbe)
(Removing webops cc's since this product is outside our scope, but feel free to invite us back if anything comes up.)
I've never used Tornado before, so I don't have any ideas off the top of my head.

Out of curiosity, how do we know it's getting kicked by the watchdog script on a regular basis? Is that denoted in a log file or is someone getting notification emails or what?
Flags: needinfo?(willkg)
(In reply to Will Kahn-Greene [:willkg] from comment #59)
> I've never used Tornado before, so I don't have any ideas off the top of my
> head.
> 
> Out of curiosity, how do we know it's getting kicked by the watchdog script
> on a regular basis? Is that denoted in a log file or is someone getting
> notification emails or what?

Email notifications, they go to ddurst and myself at the moment.
I have done a fair amount of Tornado but never had to do any deep debugging so my guess is as good as anybody elses. Also, it's been years since I invested in tornado.
Flags: needinfo?(peterbe)
Having said that, I know enough about Tornado to say that the best thing might here be to simply replace tornado with flask or django. Tornado's power lies with high concurrency and lots of network I/O at play. Running the whole thing through uWSGI or gunicorn is likely to be much stronger.
Is this somehow down again? The curl command seems to work, but I cannot get symbols for my Firefox 50.0a2 (2016-09-03). It gets stuck on "Gathering unresolved symbols..."
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #63)
> Is this somehow down again? The curl command seems to work, but I cannot get
> symbols for my Firefox 50.0a2 (2016-09-03). It gets stuck on "Gathering
> unresolved symbols..."

seems so to me
I've escalated to :rhelmer tonight asking if he could take another look. I've also asked for an update in bug 1277332.
Something got better because it is working now for me in thunderbird. However, I cannot access bug 1277332 to see if there was an update.
(In reply to Ashlee Chavez [:ashlee] from comment #65)
> I've escalated to :rhelmer tonight asking if he could take another look.
> I've also asked for an update in bug 1277332.

I couldn't find anything wrong on the server side and I don't seem to have access to bug 1277332.

:ashlee could you cc: me please ^?
Flags: needinfo?(achavez)
(In reply to Robert Helmer [:rhelmer] from comment #67)
> (In reply to Ashlee Chavez [:ashlee] from comment #65)
> > I've escalated to :rhelmer tonight asking if he could take another look.
> > I've also asked for an update in bug 1277332.
> 
> I couldn't find anything wrong on the server side and I don't seem to have
> access to bug 1277332.
> 
> :ashlee could you cc: me please ^?

You'v been cc'd :)
Flags: needinfo?(achavez)
Is it somehow broken again? I cannot get symbols for Firefox Developer Edition 51 on Mac.
Never mind. It works now.
Please file new bugs if there are new service interruptions
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → FIXED
Whiteboard: [fce-active] → [fce-active-legacy]
You need to log in before you can comment on or make changes to this bug.