Closed Bug 702318 Opened 13 years ago Closed 11 years ago

Incorrectly generated crash report page (10 MB content, broken HTML code, server errors inlined)

Categories

(Socorro :: Webapp, task, P1)

x86_64
Windows 7

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mayhemer, Assigned: rhelmer)

References

()

Details

(Keywords: dogfood)

Attachments

(3 files)

Just happened with the URL.

The request generated a 10MB HTML, it almost killed Firefox on 4GHz i7 machine with 12GB.

Please search the html code for e.g.:
Warning: Invalid callback 1321291997.5047, no array or string given in Unknown on line 0

I had to stop the page load manually.

Something went wrong on the server, probably.
This might be the issue I've seen where Nightly ended up eating all my memory and becoming impossible to use until I either killed and restarted it or if possible managed to close the tab responsible.
Just happened again with https://crash-stats.mozilla.com/report/index/33675efd-678c-48a7-b705-ac3552111110

To note: I opened a lot of reports for that signature at ones with a middle click (exactly 4).
dolske just mentioned in IRC that he hit this, could not reproduce.

(In reply to Honza Bambas (:mayhemer) from comment #0)
> Warning: Invalid callback 1321291997.5047, no array or string given in
> Unknown on line 0

Brandon, have you seen anything like this with PHP before ^?

I don't know if this is new or not, the only change to the reports stuff recently is https://github.com/mozilla/socorro/commit/845964a6237f2d170c77e3afc87bee4ca5766473 (which was in the latest Socorro release, 2.3.2)
I can reproduce now, in a new profile. Seems to be enough to set Firefox to restore tab, open ~6 crash reports, and then restart Firefox. Sometimes it restores all the crash report tabs fine, other times one or two of the tabs gets stuck loading and memory skyrockets.

I was originally hitting this in a current OS X Nightly, and also reproduced in a stock debug build from the 18th. Nothing unusual in the output, other than a dozen or so "WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8053000C: file /Users/dolske/build/mozilla-central/content/base/src/nsGenericElement.cpp, line 5349".

I know we've had problems in the past with garbage input causing Gecko to fall over (eg loading a large binary file as text/plain), though I don't see anything obviously wrong with the attachment here.

So, there are 3 bugs here...
1) Socorro is generating abnormally large output (maybe?)
2) Firefox is consuming enormous amounts of memory as a result of #1
3) about:memory doesn't know what's up, it's all in heap-unclassified

CCing njn and bz to see how to split this up.
I did manage to repro this once, seems totally random as far as I can tell.

The only things I see in the range in the error logs, could be totally unrelated though:

2011-11-21 21:04:33 -08:00 --- error: Uncaught PHP Error: Undefined offset: 161 in file application/views/report/index.php on line 264
2011-11-21 21:04:33 -08:00 --- error: [5xx Error] File: application/views/report/index.php; Line: 264; Message: Undefined offset: 161
Looks like we're getting a *lot* of these error messages in the logs, see bug 707226.
Blocks: 707226
Honza's initial error message is indicative of incorrect definition of a user defined function or lambda function.  I couldn't find anywhere in recently touched code where we have call_user_func() (only in old code) but it's possible the params we are passing to something old have changed.  I'll keep digging, just commenting in case this jogs anyone's brain.
Next clue: all the errors appear to be for chromehangs
Assignee: nobody → rhelmer
No longer blocks: 707226
Target Milestone: --- → 2.4
(In reply to Laura Thomson :laura from comment #8)
> Next clue: all the errors appear to be for chromehangs

I'm not sure bug 707226 is directly related, looks like we've been getting those for some time (since the beginning of the year), although it's become a lot more frequent recently.

Looking again at comment 0:
(In reply to Honza Bambas (:mayhemer) from comment #0)
> Warning: Invalid callback 1321291997.5047, no array or string given in
> Unknown on line 0

One thing that jumps out at me is "1321291997.5047", that looks like a unix-style timestamp which makes me thing we're passing that to a function that expects array or string (presumably an anonymous function since it's "Unknown on line 0").

I'm not sure why this leads to the kind of output in attachment 574324 [details] though.
(In reply to Robert Helmer [:rhelmer] from comment #9)
> One thing that jumps out at me is "1321291997.5047", that looks like a
> unix-style timestamp which makes me thing we're passing that to a function
> that expects array or string (presumably an anonymous function since it's
> "Unknown on line 0").

Hmm actually I guess "1321291997.5047" is probably being passed in the function name, if I am reading that error correctly.

Anyway looking at the output more closely, we get this at the top:

Fatal error: print_r(): Cannot use output buffering in output buffering display handlers in /data/socorro/htdocs/system/core/Kohana.php on line 1563 

Then the page is repeated each time we see:

Warning: Invalid callback 6.912535529148E-310, no array or string given in Unknown on line 0 p�{�?E html> 

The page output is printed 6 times.

I am going to try to repro this on a dev environment, since I think I'll need to turn on dev mode to see what the real error is.
(In reply to Laura Thomson :laura from comment #8)
> Next clue: all the errors appear to be for chromehangs

I don't think that is true.  I experienced this long ago before chromehang had been introduced.
And another detail:  I believe the page would be generated infinite number of times until manually stopped by pressing the stop button.
(In reply to Honza Bambas (:mayhemer) from comment #11)
> (In reply to Laura Thomson :laura from comment #8)
> > Next clue: all the errors appear to be for chromehangs
> 
> I don't think that is true.  I experienced this long ago before chromehang
> had been introduced.

I agree, sorry for confusing the issue by conflating bug 707226.

(In reply to Honza Bambas (:mayhemer) from comment #12)
> And another detail:  I believe the page would be generated infinite number
> of times until manually stopped by pressing the stop button.

I have been running a few loops for the past 24 hours (one against production and one against a development environment) and haven't been able to reproduce this issue yet. 

Has anyone seen this recently? I wonder if something in the latest release (bug 706156) made this go away (possibly even Apache+mod_php service being restarted)?
(In reply to Robert Helmer [:rhelmer] from comment #13)
> Has anyone seen this recently?

I *think* I have seen it last week, but not completely sure. It usually happened to me when I opened a lot of crash reports at the same time by middle-clicking multiple entries from a reports/list.
I just tried to and failed to reproduce this (in order to investigate the heap-unclassified issue).

Based on comment 4... in a new profile, I opened six tabs, each to https://crash-stats.mozilla.com/report/index/7f539cb9-94b0-44da-b58c-449b52111018, plus about:memory in a 7th tab, then restarted the browser.

Each time I restarted all tabs loaded up fairly quickly and without problem.  "explicit" in about:memory was around 100MB each time, nothing surprising.  "heap-unclassified" was around 25--30%, also not unusual.
Good STR is IMO described well in comment 14.  It always happened to me while opening many (10+) reports using the middle-click in the list.  And also on OSX, but I simply never tried to open so many windows on Windows (..sounds funny :)).
And AFAIK, I had another case of that yesterday with exact the same STR as in comment #14, I opened some 8-10 crash reports in background tabs by middle-clicking the crash dates on a reports/list "Reports" tab in fast succession, and one of those new tabs did not stop loading while used RAM metrics went up and up - I closed that tab and RAM usage went down significantly.
This has been happening to me off and on for months.  Chrome doesn't seem to have the same problem with handling 5 or so crash reports in rapid succession.  Though I think I had trouble one time if I opened a dozen or so...
Blocks: 706931
Severity: normal → major
Keywords: dogfood
(In reply to Nicholas Nethercote [:njn] from comment #15)
> I just tried to and failed to reproduce this (in order to investigate the
> heap-unclassified issue).
> 
> Based on comment 4... in a new profile, I opened six tabs, each to
> https://crash-stats.mozilla.com/report/index/7f539cb9-94b0-44da-b58c-
> 449b52111018, plus about:memory in a 7th tab, then restarted the browser.

I just reproduced this on a current OS X nightly, with 4 tabs containing that URL (new profile, with only changed prefs to restore previous tabs and not check for default browser at start).

It took about 10 tries, but then I hit it and memory spiraled up.

I also just did it again, by launching the browser just once and using tab right-click, Reload All Tabs. Took 10-15 tries.

Also confirmed (3rd time) that it works with about:memory in a 5th tab.
I have been running curl in a while loop for over a week (one against prod and one against dev) and haven't been able to reproduce.

I just reproduced it in latest OS X nightly using steps from comment 19; given the error messages in the output I am thinking this is probably a very intermittent server-side bug, either it's triggered by parallel requests or it's just increasing the chance of hitting it (the former seems like it'd be something deeper than Socorro code, this is PHP so we're single-threaded etc.)

I'll try upping the number of concurrent connections in my curl test, and also try to repro using the steps in comment 19 in Chrome (just to eliminate the client side as a possibility).
(In reply to Robert Helmer [:rhelmer] from comment #20)
> I'll try upping the number of concurrent connections in my curl test, and
> also try to repro using the steps in comment 19 in Chrome (just to eliminate
> the client side as a possibility).

Repro'd in latest Chrome OSX release. It's pretty catastrophic there too, locks up the whole browser, had to force-quit etc.

Anyway, I am satisfied that this is a server-side bug, I expect it has to do with a callback function being returned that isn't actually a function, which is somehow causing one of the loops that build the page to run several times (possibly an infinite loop, or at least until the PHP timeout is hit and the process is killed).

I am concerned that this seems to have something to do with running the query in parallel; hopefully that's just a red herring. Socorro is load-balanced across many web servers so if it is some kind of concurrency bug, it would make sense why we're not always hitting the same server at the same time.

It is also possible that the test condition I was using before was wrong somehow; I was grepping the output for a fragment of one of the error messages. I've just changed this to look at page size instead, here it is for reference:

found=0; while true 
do 
  curl -s 'https://crash-stats.mozilla.com/report/index/7f539cb9-94b0-44da-b58c-449b52111018' > output-${found}-prod.log
  size=`stat -c "%s" output-0-prod.log`
  if [ $size -ne 408438 ]
  then
    (echo "found one: $found"; found=$((found+1)))
  fi
done

I am also running an equivalent query against one of our dev servers. If nothing shows up in the next day or so, I'll add some concurrency to these tests.
Status: NEW → ASSIGNED
(In reply to Robert Helmer [:rhelmer] from comment #21)
>   if [ $size -ne 408438 ]
>   then
>     (echo "found one: $found"; found=$((found+1)))

Oops, shouldn't have parens around this one ^
Can't repro using curl, even running 12 processes in parallel, but I can easily repro with the browser on both prod and dev.

I am starting to suspect it might be an XHR call that this page makes. The "Invalid callback" would make sense if the code isn't validating the input and passing some nonsense value instead of a function.

This doesn't explain the page looping behavior, could be something pathological in the error handler.
OK I spent some time bisecting that page, and discovered that it's not the reports page. I can easily make this happen for any page following steps in comment 19 (it's especially easy on crash-stats-dev.allizom.org since it's a single box).

So, this is most likely a bug in the (deprecated version of) the Kohana framework that we use (we can't easily upgrade since it's been completely rewritten in the meantime). Not sure why it doesn't happen with curl; based on the telltale "print_r" error at the top of the output, I think it may have something to do with the headers the browser is sending (possible compression, although it looks like we have that turned off in the config).

I'll look at the headers the browsers send and see if I can repro with a curl loop.

Let's continue investigating, but I strongly suggest that we move forward the UI reimplementation project and do the reports page first, ASAP. lonnen says it's most visited page by far, and it seems like the reports_list page is the most likely place someone would open up a bunch of tabs like this.
I can reliably reproduce using this script within a few runs (about 1 minute).

The problem seems to be cookies - specifically the "kohanasession" cookie, if it is set to the same value for several concurrent requests.

Looking at the config and the Kohana session docs, I noticed this:

http://docs.kohanaphp.com/libraries/session#regenerate

"""
This can be done automatically by setting the session.regenerate config value to an integer greater than 0 (default value is 0). However, automatic session regeneration isn't recommended because it can cause a race condition when you have multiple session requests while regenerating the session id (most commonly noticed with ajax requests). For security reasons it's recommended that you manually call regenerate() whenever a visitor's session privileges are escalated (e.g. they logged in, accessed a restricted area, etc). 
"""

For some reason, our config has been changed from the default of 0 to 3. 

I just changed this to 0 on crash-stats-dev and haven't reproduced the problem with this script (I can reproduce after flipping it back, tried 3 times of each just to make sure I'd isolated the right thing).

I'll leave this running overnight, and try reproducing with browsers in the morning.
Can't repro on crash-stats-dev with script or browser, with this config change.

I think we should make this change, and also a code change to call regenerate() when users log in (per suggestion in the docs).
Target Milestone: 2.4 → 2.3.5
If you want to have reliable steps to reproduce this issue manually, please see my bug 712197, which has been duped lately. Usually I don't see this problem, and it only manifests itself when I'm trying to d&d a tab with a report loading directly after the connection has been established but the content is still loading. I can reproduce it all the times.
Any news?  Reproduced with STR:
- open 9.0 top crashes
- open with middle click 6 signatures in background tabs, do it quickly
=> the 6th is broken, I'm attaching it...
(In reply to Honza Bambas (:mayhemer) from comment #29)
> Created attachment 583442 [details]
> saved HTML of the response, II
> 
> Any news?  Reproduced with STR:
> - open 9.0 top crashes
> - open with middle click 6 signatures in background tabs, do it quickly
> => the 6th is broken, I'm attaching it...

Per comment 25, we have the fix in hand but we've had a couple out-of-cycle releases dropped in, so 2.4 is delayed (in favor of 2.3.4.[1,2] and 2.3.5 https://wiki.mozilla.org/Socorro:Releases)

I'll get this up and running on our dev environment in the meantime.

I believe that this is a bug that has existed a little over a year, from when session support was initially enabled in Kohana.
Target Milestone: 2.3.5 → 2.4
Component: Socorro → General
Product: Webtools → Socorro
Assignee: rhelmer → nobody
Component: General → Webapp
QA Contact: socorro → webapp
laura r? https://github.com/mozilla/socorro/pull/243
Assignee: nobody → rhelmer
Priority: -- → P1
Commit pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/ee98598180ad9a84d63f83fb6d41eb27620b012a
Merge pull request #243 from rhelmer/bug702318-explicit-session-regen

bug 702318 - disable session auto-regen to avoid race condition. explici...
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
I haven't been able to reproduce on crash-stats-dev.allizom.org since this landed, using the script in attachment 581199 [details] (I can still repro in under 1 minute on prod with the same script).

Notes for QA:

Automatic Kohana session regeneration (regenerating the ID stored in the cookie) is buggy and has a race condition http://docs.kohanaphp.com/libraries/session#regenerate

The change here is to not use automatic regeneration but instead only regenerate on admin login.

See attachment 581199 [details] and comment 19 for STR (also comment 28 and comment 29)

Expected behavior is that the content of the kohanasession cookie will only change when logging in, instead of the current production default of changing every 3 page loads.
Robert, when is the next push to production targeted for?
Verified on crash-stats.allizom.org, I'm unable to recreate the crashy behavior based on the steps to reproduce in comment 33.

Thanks for the excellent steps to reproduce and in particular the helpful script, attachment 581199 [details]
Status: RESOLVED → VERIFIED
[:rhelmer] Just ran into this one again on prod, this was the page:
https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/10.0b3/28/all
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
[:rhelmer] Just ran into this one again on prod, this was the page:
https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/10.0b3/28/all
(In reply to Schalk Neethling from comment #39)
> [:rhelmer] Just ran into this one again on prod, this was the page:
> https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/10.0b3/28/all

Huh, I just confirmed that the config is in place, trying the script in attachment 581199 [details] against prod to see if I can reproduce.
(In reply to Robert Helmer [:rhelmer] from comment #40)
> (In reply to Schalk Neethling from comment #39)
> > [:rhelmer] Just ran into this one again on prod, this was the page:
> > https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/10.0b3/28/all
> 
> Huh, I just confirmed that the config is in place, trying the script in
> attachment 581199 [details] against prod to see if I can reproduce.

I don't seem to be able to reproduce using the script, I will keep it running though.

From Schalk's description in irc, sounds like the same (or similar) bug (print_r at the top of the HTML output).

Unfortunately he had to restart his computer and was unable to capture a copy of the page, I'll keep trying to reproduce. If anyone sees this again and is able to get a copy of the page (or even a screenshot) please attach it, and also try to note the time it happened so we can look in the server logs.
I cannot reproduce this problem anymore. Loading a dozen of crash-reports while being logged in does not cause the reported issue.
(I reported bug 706931)
I am also of the opinion that something is still wrong - I crashed yesterday bp-88fc7420-a739-4fa1-aaac-564142120115 with 5 crash queries and 2 topcrash links open ... similar to past experience with crash-stats
bp-4d0eada8-ae73-4cb0-847d-8e7702120104
bp-dfce4f45-835d-4ae1-96ec-6c0992111212
bp-ebcd12af-dbb5-4367-a2e8-bc0af2111207
I've not attempted to reproduce but I have the sessionstore
Target Milestone: 2.4 → 2.4.2
Still haven't been able to reproduce, leaving this open and pushing it out to see if we get any more reports.
Target Milestone: 2.4.2 → 2.4.3
(In reply to Schalk Neethling from comment #39)
> [:rhelmer] Just ran into this one again on prod, this was the page:
Schalk, are you still able to reproduce? Neither rhelmer nor myself are able to.
[:mbrandt][:rhelmer] Nope have not seen this one again.
Going to mark this resolved, please reopen if you see it again.
Target Milestone: 2.4.3 → 2.4
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
(In reply to Schalk Neethling from comment #47)
> [:mbrandt][:rhelmer] Nope have not seen this one again.

Thanks schalk. Bumping to QA verified per comment 47.
Status: RESOLVED → VERIFIED
I seem to have his this again today, loaded a crash report and the tab kept loading and loading and loading, making the browser unresponsive. rhelmer asked me to comment here. :)
We deployed the Django app on https://crash-stats.mozilla.com today, see bug 749359

This bug was identified as specific to the obsolete version of the Kohana framework the previous version of Socorro used.

So, we don't expect to see anything like this :)
Status: VERIFIED → RESOLVED
Closed: 12 years ago11 years ago
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: