Closed
Bug 1342685
Opened 8 years ago
Closed 8 years ago
Intermittent ts_paint | Found crashes after test run, terminating test
Categories
(Testing :: Talos, defect)
Tracking
(firefox52 unaffected, firefox-esr52 unaffected, firefox53 unaffected, firefox54 fixed, firefox55 fixed)
RESOLVED
FIXED
mozilla55
Tracking | Status | |
---|---|---|
firefox52 | --- | unaffected |
firefox-esr52 | --- | unaffected |
firefox53 | --- | unaffected |
firefox54 | --- | fixed |
firefox55 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: blassey)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])
Attachments
(1 file)
5.01 KB,
patch
|
billm
:
review+
gchang
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
![]() |
||
Comment 1•8 years ago
|
||
This is a timeout that would have previously been reported as bug 1339594.
See Also: → 1339594
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
![]() |
||
Comment 4•8 years ago
|
||
There is some variation in the failure reports, but a great many are Windows pgo and e10s. Each of these failures looks like:
12:14:09 INFO - Running cycle XX/20 for ts_paint test...
12:14:09 INFO - TEST-INFO | started process 2488 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmpz68mdt\profile http://localhost:49896/startup_test/tspaint_test.html)
12:16:39 INFO - Timeout waiting for test completion; killing browser...
12:16:40 INFO - Terminating psutil.Process(pid=2488, name=u'firefox.exe')
12:16:40 INFO - Unable to kill process
12:16:40 ERROR - Traceback (most recent call last):
12:16:40 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 144, in run_browser
12:16:40 INFO - return_code = context.kill_process()
12:16:40 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 41, in kill_process
12:16:40 INFO - procs = self.process.children()
12:16:40 INFO - File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper
12:16:40 INFO - raise NoSuchProcess(self.pid, self._name)
12:16:40 INFO - NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=2488, name=u'firefox.exe')
12:16:40 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/cyZ3yyA1TsqBeWTL9McUPA/artifacts/public/build/firefox-54.0a1.en-US.win32.crashreporter-symbols.zip
12:16:48 INFO - mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\users\cltbld\appdata\local\temp\tmpz68mdt\profile\minidumps\37b3d967-c6f8-4010-aad7-427594151d64.dmp c:\users\cltbld\appdata\local\temp\tmpsul3ty
12:16:56 INFO - mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\37b3d967-c6f8-4010-aad7-427594151d64.dmp
12:16:56 INFO - PROCESS-CRASH | ts_paint | application crashed [unknown top frame]
...a timeout 90 seconds after "started process".
The crash reports look consistent across these failures, with thread 0's stack containing:
12:16:56 INFO - 22 xul.dll!mozilla::dom::ContentParent::LaunchSubprocess(mozilla::hal::ProcessPriority) [ContentParent.cpp:8d026c601510 : 1968 + 0x21]
12:16:56 INFO - eip = 0x60bcff69 esp = 0x001ad054 ebp = 0x001ad374
12:16:56 INFO - Found by: call frame info
12:16:56 INFO - 23 xul.dll!mozilla::dom::ContentParent::GetNewOrUsedBrowserProcess(nsAString_internal const &,mozilla::hal::ProcessPriority,mozilla::dom::ContentParent *) [ContentParent.cpp:8d026c601510 : 838 + 0xe]
12:16:56 INFO - eip = 0x60bcd554 esp = 0x001ad37c ebp = 0x001ad3b8
12:16:56 INFO - Found by: call frame info
12:16:56 INFO - 24 xul.dll!mozilla::dom::ContentParent::CreateBrowser(mozilla::dom::TabContext const &,mozilla::dom::Element *,mozilla::dom::ContentParent *) [ContentParent.cpp:8d026c601510 : 1149 + 0x16]
12:16:56 INFO - eip = 0x60bcb372 esp = 0x001ad3c0 ebp = 0x001ad51c
12:16:56 INFO - Found by: call frame info
12:16:56 INFO - 25 xul.dll!nsFrameLoader::TryRemoteBrowser() [nsFrameLoader.cpp:8d026c601510 : 2944 + 0x12]
12:16:56 INFO - eip = 0x606022b4 esp = 0x001ad524 ebp = 0x001ad670
12:16:56 INFO - Found by: call frame info
12:16:56 INFO - 26 xul.dll!nsFrameLoader::ShowRemoteFrame(mozilla::gfx::IntSizeTyped<mozilla::ScreenPixel> const &,nsSubDocumentFrame *) [nsFrameLoader.cpp:8d026c601510 : 1221 + 0xe]
12:16:56 INFO - eip = 0x605ffc48 esp = 0x001ad678 ebp = 0x001ad6a8
12:16:56 INFO - Found by: call frame info
12:16:56 INFO - 27 xul.dll!nsFrameLoader::Show(int,int,int,int,nsSubDocumentFrame *) [nsFrameLoader.cpp:8d026c601510 : 1091 + 0xd]
12:16:56 INFO - eip = 0x6002a9ae esp = 0x001ad6b0 ebp = 0x001ad794
12:16:56 INFO - Found by: call frame info
12:16:56 INFO - 28 xul.dll!nsSubDocumentFrame::ShowViewer() [nsSubDocumentFrame.cpp:8d026c601510 : 191 + 0x12]
12:16:56 INFO - eip = 0x5fb5963d esp = 0x001ad79c ebp = 0x001ad7d0
12:16:56 INFO - Found by: previous frame's frame pointer
12:16:56 INFO - 29 xul.dll!AsyncFrameInit::Run() [nsSubDocumentFrame.cpp:8d026c601510 : 91 + 0x5]
12:16:56 INFO - eip = 0x5fb59592 esp = 0x001ad7d8 ebp = 0x001ad7dc
12:16:56 INFO - Found by: call frame info
12:16:56 INFO - 30 xul.dll!nsContentUtils::RemoveScriptBlocker() [nsContentUtils.cpp:8d026c601510 : 5271 + 0x9]
12:16:57 INFO - eip = 0x5fad0db3 esp = 0x001ad7e4 ebp = 0x001ad800
12:16:57 INFO - Found by: call frame info
![]() |
||
Comment 5•8 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #4)
> ...a timeout 90 seconds after "started process".
s/90 seconds/150 seconds/
(consistent with the ts_paint timeout value)
Updated•8 years ago
|
Whiteboard: [stockwell needswork]
Comment 6•8 years ago
|
||
odd why this is pgo and rarely on opt. I suspect we need to debug this more, and why is this talos specific, and not seen in other test harnesses? One explanation could be that we have different preferences. Another could be that the addons we are using are causing problems.
I see a list of talos exceptions, 2 of the 3 most problematic are this bug (ts_paint crash-could be timeout) and bug 934310 (sessionrestore - timeout).
both of these bugs are win7 pgo whereas the other bugs are primarily windows 8 pgo/opt mix. All seem to be e10s specific.
:gbrown, do you see value in trying to debug the stack a bit?
for example, based on the stack, I suspect this line is problematic:
https://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp?q=LaunchSubprocess&redirect_type=direct#1968
possibly adding code in there to see what mChildID is and if it is valid, maybe try/catch around some lines there with MOZ_ASSERTS instead?
the other route would be to look at the preferences and environment variables used in talos and compare those to reftest/mochitest/web-platform-tests to see if there is oversights. Also since this is ts_paint/sessionrestore, there should only be the talos-powers extension (note, this is the only harness where we actually sign extensions)
thoughts on next steps?
Flags: needinfo?(gbrown)
![]() |
||
Comment 7•8 years ago
|
||
The most recent crash reports have a different line number,
xul.dll!mozilla::dom::ContentParent::LaunchSubprocess(mozilla::hal::ProcessPriority) [ContentParent.cpp:a4653bf66378 : 1983 + 0x21]
pointing to
https://hg.mozilla.org/mozilla-central/annotate/6d0ea5812180/dom/ipc/ContentParent.cpp#l1983
[1983] boolPrefs << i << ':' << Preferences::GetBool(ContentPrefs::GetContentPref(i)) << '|';
Looking up the stack further, there are several frames like:
__crt_seh_guarded_call<void>::operator()<<lambda_...
which I suppose are the << operations.
I would normally suspect Preferences::GetBool or ContentPrefs::GetContentPref of hanging before anything else on line [1983], but I don't see those in the crash reports, while I do often see WideCharToMultiByte, or similar.
:blassey - It looks like you wrote this code - https://hg.mozilla.org/mozilla-central/rev/5a43dbcb95ca, https://hg.mozilla.org/mozilla-central/diff/5a43dbcb95ca/dom/ipc/ContentParent.cpp . Can you help us make sense of these Talos-on-Windows hangs?
Flags: needinfo?(gbrown) → needinfo?(blassey.bugs)
![]() |
||
Comment 8•8 years ago
|
||
btw, these crash-due-to-timeout bugs are the new expression of the Talos NoSuchProcess failures. These crash reports are consistent with https://bugzilla.mozilla.org/show_bug.cgi?id=1236770#c33 - it all goes back to the same change.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 10•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=822c58c7144cb7f19d1dad70a6093e9ddad60be8&selectedJob=83228607
Bill, it looks like std::stringstream's << operator is causing hangs on windows. From this try push it looks like using nsAutoCString for a buffer and nsPrintfCString for formatting avoids this.
Assignee: nobody → blassey.bugs
Flags: needinfo?(blassey.bugs)
Attachment #8846432 -
Flags: review?(wmccloskey)
Comment hidden (Intermittent Failures Robot) |
Comment on attachment 8846432 [details] [diff] [review]
string_cat_hang.patch
Review of attachment 8846432 [details] [diff] [review]:
-----------------------------------------------------------------
::: dom/ipc/ContentParent.cpp
@@ +1967,5 @@
> extraArgs.push_back(IsForBrowser() ? "-isForBrowser" : "-notForBrowser");
>
> + nsAutoCString boolPrefs;
> + nsAutoCString intPrefs;
> + nsAutoCString stringPrefs;
Not sure the Auto part will do much good here since these strings are likely to be longer than 64 bytes.
@@ +1984,3 @@
> break;
> case nsIPrefBranch::PREF_STRING: {
> + nsCString value(Preferences::GetCString(ContentPrefs::GetContentPref(i)).get());
What about:
nsAdoptingCString value = Preferences::GetCString(ContentPrefs::GetContentPref(i));
?
Attachment #8846432 -
Flags: review?(wmccloskey) → review+
Comment 13•8 years ago
|
||
:blassey, thanks for getting a patch up, let me know if you need help testing on try or getting this landed.
Assignee | ||
Comment 14•8 years ago
|
||
(In reply to Bill McCloskey (:billm) from comment #12)
> Comment on attachment 8846432 [details] [diff] [review]
> string_cat_hang.patch
>
> Review of attachment 8846432 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: dom/ipc/ContentParent.cpp
> @@ +1967,5 @@
> > extraArgs.push_back(IsForBrowser() ? "-isForBrowser" : "-notForBrowser");
> >
> > + nsAutoCString boolPrefs;
> > + nsAutoCString intPrefs;
> > + nsAutoCString stringPrefs;
>
> Not sure the Auto part will do much good here since these strings are likely
> to be longer than 64 bytes.
what do you think of:
char boolBuf[1024];
nsFixedCString boolPref(boolBuf, 1024, 0)
...
> @@ +1984,3 @@
> > break;
> > case nsIPrefBranch::PREF_STRING: {
> > + nsCString value(Preferences::GetCString(ContentPrefs::GetContentPref(i)).get());
>
> What about:
> nsAdoptingCString value =
> Preferences::GetCString(ContentPrefs::GetContentPref(i));
> ?
Yup, that'll avoid a copy
nsFixedCString seems fine.
Comment 16•8 years ago
|
||
Pushed by blassey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d58da9e3123f
Intermittent ts_paint | Found crashes after test run, terminating test r=billm
Comment 17•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox55:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Comment 18•8 years ago
|
||
Please request Aurora approval on this when you get a chance.
Blocks: 1303096
status-firefox52:
--- → unaffected
status-firefox53:
--- → unaffected
status-firefox54:
--- → affected
status-firefox-esr52:
--- → unaffected
Flags: needinfo?(blassey.bugs)
Assignee | ||
Comment 19•8 years ago
|
||
Comment on attachment 8846432 [details] [diff] [review]
string_cat_hang.patch
Approval Request Comment
[Feature/Bug causing the regression]: 1303096
[User impact if declined]: random hangs on win7 PGO in automation (no reports outside automation)
[Is this code covered by automated tests?]:
[Has the fix been verified in Nightly?]: Not yet, need next orange factor report
[Needs manual test from QE? If yes, steps to reproduce]:
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: No
[Why is the change risky/not risky?]: Just using our string classes vs stdlib's
[String changes made/needed]: none
Flags: needinfo?(blassey.bugs)
Attachment #8846432 -
Flags: approval-mozilla-aurora?
Updated•8 years ago
|
Whiteboard: [stockwell needswork] → [stockwell fixed]
Comment hidden (Intermittent Failures Robot) |
Comment 21•8 years ago
|
||
Hi :blassey,
According to comment #20, is that Ok?
Flags: needinfo?(blassey.bugs)
![]() |
||
Comment 22•8 years ago
|
||
I think there is only one report in comment 20 that happened after blassey's change: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=84349216&lineNumber=1891. That looks to me like a different, unrelated crash.
Comment 24•8 years ago
|
||
Comment on attachment 8846432 [details] [diff] [review]
string_cat_hang.patch
Fix an intermittent failure. Aurora54+.
Attachment #8846432 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 25•8 years ago
|
||
bugherder uplift |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•