Closed Bug 584365 Opened 14 years ago Closed 14 years ago

Tinderbox server doesn't work sometimes


( Graveyard :: Server Operations, task)

Not set


(Not tracked)



(Reporter: ehsan.akhgari, Assigned: fox2mike)


I tried to view some logs on the tinderbox this morning, and it failed after a *long* time with error 500 (Internal Server Error).  I just tested the same logs now and they worked.  This has happened in the past few days for me as well.
What exactly were you trying to view?
Assignee: server-ops → shyam
I don't see any issues with tinderbox (no nagios alerts etc) for today, so this could be a specific case or an issue at your end.
I've been hitting this on and off for the past few days as well.
(In reply to comment #1)
> What exactly were you trying to view?

Some logs from mozilla-central, like

Can you try to reproduce this now? (I have apache logging in far more detail than it was before). While I can confirm there were 500 errors in the access_log, nothing in the error_log = I can't really tell you what caused the issue :|

Hopefully, if you can reproduce this, we'll get an error we can then work with.

Please post the URL on the bug as well.
Will do.  FWIW, I've starred 5 million oranges in the past hour, and the Tinderbox server didn't give me a single error back.  :(
Yeah. I've seen a few more, but nothing in the error logs, so this isn't going to be any use till I figure out how to get more meaningful errors.
Passing this to Jeremy who'll be the next person oncall, he can take a look if it happens again.
Assignee: shyam → jeremy.orem+bugs
Component: Server Operations: Tinderbox Maintenance → Server Operations
OS: Mac OS X → All
Hardware: x86 → All
Probably easier to open a new bug if this happens again.
Closed: 14 years ago
Resolution: --- → FIXED
I just tried starring the same build again and this time it was successful.
Failed to load
for me a couple of minutes ago. Possibly timing out due to load ?
I got this again:

This is *really* hurting the developers, bumping the priority to blocker.
Severity: critical → blocker
Though the symptoms are pretty general, they look to me exactly like when in the past the oncall has said "yeah, somebody was spidering tinderbox, I just blocked them." at around 23:43, after a fairly long period of things working reasonably well.
Resetting assignee to draw attention
Assignee: jeremy.orem+bugs → server-ops
Assignee: server-ops → shyam
Who are the devs who work on tinderbox?

Without useful error logs, we're stuck. While I see 500s in the access logs, I don't see anything in the error logs. Without knowing what's causing the 500, I'm helpless to be able to help fix it.
I'm going to close mozilla-central until this issue is resolved.
Turns out we can't close mozilla-central because
won't load :(  (It "loads" for 4 or 5 minutes, and then stops loading on a blank page.)
FWIW, the current status is that no tinderbox page is accessible from MV, Toronto, and Europe (Italy?).
There are few devs who work on Tinderbox. bear might be able to help, but I suspect you might need to dig into this yourself.
The box was so loaded it was useless, I've rebooted it.

I can't dig into something that doesn't make much sense :) I need to see why the application is throwing a 500. When the apache logs running in debug don't tell me anything, I'm as useless as the next person.
Seeing as showlog.cgi is one of the worst offenders can you set-up an http cache (mod_cache, varnish, squid -- whatever) to cache hits to that cgi script?
If it's showlog.cgi that is stalling, then it's running out of resources while trying to generate the html page form the output of the error parsing and the log expansion.  It could also be triggering a mem swap if enough logs are requested that are all very large (but I say that not knowing what memory constraints are on that box).

The simplest short term solution would be to put it behind a cache for showlog.cgi url's only (as we are discussing in #ops now.)
Okay, tinderbox should be back up now. I've disabled the debug logs as well, as they were fairly useless and adding more load to apache.
Current status:
* mod_cache is running but we're not sure that it's helping much
* Aravind is working on putting a different proxy in front of Apache, which we expect will work better than mod_cache

(details: mod_cache claims to be working, but apache may be gzip'ing files after they hit the cache, due to Accept-Encoding: gzip being sent. When in place, the cache will be caching the already gzip'ed version).
bug 584920 is more up to date, duping forward.
Closed: 14 years ago14 years ago
Resolution: --- → DUPLICATE
Product: → Graveyard
You need to log in before you can comment on or make changes to this bug.