Closed Bug 1342685 Opened 3 years ago Closed 3 years ago

Intermittent ts_paint | Found crashes after test run, terminating test

Categories

(Testing :: Talos, defect)

Version 3
defect
Not set

Tracking

(firefox52 unaffected, firefox-esr52 unaffected, firefox53 unaffected, firefox54 fixed, firefox55 fixed)

RESOLVED FIXED
mozilla55
Tracking Status
firefox52 --- unaffected
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: blassey)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])

Attachments

(1 file)

This is a timeout that would have previously been reported as bug 1339594.
See Also: → 1339594
There is some variation in the failure reports, but a great many are Windows pgo and e10s. Each of these failures looks like:

12:14:09     INFO -  Running cycle XX/20 for ts_paint test...
12:14:09     INFO -  TEST-INFO | started process 2488 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmpz68mdt\profile http://localhost:49896/startup_test/tspaint_test.html)
12:16:39     INFO -  Timeout waiting for test completion; killing browser...
12:16:40     INFO -  Terminating psutil.Process(pid=2488, name=u'firefox.exe')
12:16:40     INFO -  Unable to kill process
12:16:40    ERROR -  Traceback (most recent call last):
12:16:40     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 144, in run_browser
12:16:40     INFO -      return_code = context.kill_process()
12:16:40     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 41, in kill_process
12:16:40     INFO -      procs = self.process.children()
12:16:40     INFO -    File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper
12:16:40     INFO -      raise NoSuchProcess(self.pid, self._name)
12:16:40     INFO -  NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=2488, name=u'firefox.exe')
12:16:40     INFO -  mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/cyZ3yyA1TsqBeWTL9McUPA/artifacts/public/build/firefox-54.0a1.en-US.win32.crashreporter-symbols.zip
12:16:48     INFO -  mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\users\cltbld\appdata\local\temp\tmpz68mdt\profile\minidumps\37b3d967-c6f8-4010-aad7-427594151d64.dmp c:\users\cltbld\appdata\local\temp\tmpsul3ty
12:16:56     INFO -  mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\37b3d967-c6f8-4010-aad7-427594151d64.dmp
12:16:56     INFO -  PROCESS-CRASH | ts_paint | application crashed [unknown top frame]

...a timeout 90 seconds after "started process".

The crash reports look consistent across these failures, with thread 0's stack containing:

12:16:56     INFO -  22  xul.dll!mozilla::dom::ContentParent::LaunchSubprocess(mozilla::hal::ProcessPriority) [ContentParent.cpp:8d026c601510 : 1968 + 0x21]
12:16:56     INFO -      eip = 0x60bcff69   esp = 0x001ad054   ebp = 0x001ad374
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  23  xul.dll!mozilla::dom::ContentParent::GetNewOrUsedBrowserProcess(nsAString_internal const &,mozilla::hal::ProcessPriority,mozilla::dom::ContentParent *) [ContentParent.cpp:8d026c601510 : 838 + 0xe]
12:16:56     INFO -      eip = 0x60bcd554   esp = 0x001ad37c   ebp = 0x001ad3b8
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  24  xul.dll!mozilla::dom::ContentParent::CreateBrowser(mozilla::dom::TabContext const &,mozilla::dom::Element *,mozilla::dom::ContentParent *) [ContentParent.cpp:8d026c601510 : 1149 + 0x16]
12:16:56     INFO -      eip = 0x60bcb372   esp = 0x001ad3c0   ebp = 0x001ad51c
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  25  xul.dll!nsFrameLoader::TryRemoteBrowser() [nsFrameLoader.cpp:8d026c601510 : 2944 + 0x12]
12:16:56     INFO -      eip = 0x606022b4   esp = 0x001ad524   ebp = 0x001ad670
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  26  xul.dll!nsFrameLoader::ShowRemoteFrame(mozilla::gfx::IntSizeTyped<mozilla::ScreenPixel> const &,nsSubDocumentFrame *) [nsFrameLoader.cpp:8d026c601510 : 1221 + 0xe]
12:16:56     INFO -      eip = 0x605ffc48   esp = 0x001ad678   ebp = 0x001ad6a8
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  27  xul.dll!nsFrameLoader::Show(int,int,int,int,nsSubDocumentFrame *) [nsFrameLoader.cpp:8d026c601510 : 1091 + 0xd]
12:16:56     INFO -      eip = 0x6002a9ae   esp = 0x001ad6b0   ebp = 0x001ad794
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  28  xul.dll!nsSubDocumentFrame::ShowViewer() [nsSubDocumentFrame.cpp:8d026c601510 : 191 + 0x12]
12:16:56     INFO -      eip = 0x5fb5963d   esp = 0x001ad79c   ebp = 0x001ad7d0
12:16:56     INFO -      Found by: previous frame's frame pointer
12:16:56     INFO -  29  xul.dll!AsyncFrameInit::Run() [nsSubDocumentFrame.cpp:8d026c601510 : 91 + 0x5]
12:16:56     INFO -      eip = 0x5fb59592   esp = 0x001ad7d8   ebp = 0x001ad7dc
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  30  xul.dll!nsContentUtils::RemoveScriptBlocker() [nsContentUtils.cpp:8d026c601510 : 5271 + 0x9]
12:16:57     INFO -      eip = 0x5fad0db3   esp = 0x001ad7e4   ebp = 0x001ad800
12:16:57     INFO -      Found by: call frame info
(In reply to Geoff Brown [:gbrown] from comment #4)
> ...a timeout 90 seconds after "started process".

s/90 seconds/150 seconds/ 

(consistent with the ts_paint timeout value)
Whiteboard: [stockwell needswork]
odd why this is pgo and rarely on opt.  I suspect we need to debug this more, and why is this talos specific, and not seen in other test harnesses?  One explanation could be that we have different preferences.  Another could be that the addons we are using are causing problems.

I see a list of talos exceptions, 2 of the 3 most problematic are this bug (ts_paint crash-could be timeout) and bug 934310 (sessionrestore - timeout).

both of these bugs are win7 pgo whereas the other bugs are primarily windows 8 pgo/opt mix.  All seem to be e10s specific.

:gbrown, do you see value in trying to debug the stack a bit?

for example, based on the stack, I suspect this line is problematic:
https://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp?q=LaunchSubprocess&redirect_type=direct#1968

possibly adding code in there to see what mChildID is and if it is valid, maybe try/catch around some lines there with MOZ_ASSERTS instead?

the other route would be to look at the preferences and environment variables used in talos and compare those to reftest/mochitest/web-platform-tests to see if there is oversights.  Also since this is ts_paint/sessionrestore, there should only be the talos-powers extension (note, this is the only harness where we actually sign extensions)

thoughts on next steps?
Flags: needinfo?(gbrown)
The most recent crash reports have a different line number,

xul.dll!mozilla::dom::ContentParent::LaunchSubprocess(mozilla::hal::ProcessPriority) [ContentParent.cpp:a4653bf66378 : 1983 + 0x21]

pointing to

https://hg.mozilla.org/mozilla-central/annotate/6d0ea5812180/dom/ipc/ContentParent.cpp#l1983

  [1983]   boolPrefs << i << ':' << Preferences::GetBool(ContentPrefs::GetContentPref(i)) << '|';

Looking up the stack further, there are several frames like:

  __crt_seh_guarded_call<void>::operator()<<lambda_...

which I suppose are the << operations.

I would normally suspect Preferences::GetBool or ContentPrefs::GetContentPref of hanging before anything else on line [1983], but I don't see those in the crash reports, while I do often see WideCharToMultiByte, or similar.


:blassey - It looks like you wrote this code - https://hg.mozilla.org/mozilla-central/rev/5a43dbcb95ca, https://hg.mozilla.org/mozilla-central/diff/5a43dbcb95ca/dom/ipc/ContentParent.cpp . Can you help us make sense of these Talos-on-Windows hangs?
Flags: needinfo?(gbrown) → needinfo?(blassey.bugs)
btw, these crash-due-to-timeout bugs are the new expression of the Talos NoSuchProcess failures. These crash reports are consistent with https://bugzilla.mozilla.org/show_bug.cgi?id=1236770#c33 - it all goes back to the same change.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=822c58c7144cb7f19d1dad70a6093e9ddad60be8&selectedJob=83228607

Bill, it looks like std::stringstream's << operator is causing hangs on windows. From this try push it looks like using nsAutoCString for a buffer and nsPrintfCString for formatting avoids this.
Assignee: nobody → blassey.bugs
Flags: needinfo?(blassey.bugs)
Attachment #8846432 - Flags: review?(wmccloskey)
Comment on attachment 8846432 [details] [diff] [review]
string_cat_hang.patch

Review of attachment 8846432 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentParent.cpp
@@ +1967,5 @@
>    extraArgs.push_back(IsForBrowser() ? "-isForBrowser" : "-notForBrowser");
>  
> +  nsAutoCString boolPrefs;
> +  nsAutoCString intPrefs;
> +  nsAutoCString stringPrefs;

Not sure the Auto part will do much good here since these strings are likely to be longer than 64 bytes.

@@ +1984,3 @@
>        break;
>      case nsIPrefBranch::PREF_STRING: {
> +      nsCString value(Preferences::GetCString(ContentPrefs::GetContentPref(i)).get());

What about:
nsAdoptingCString value = Preferences::GetCString(ContentPrefs::GetContentPref(i));
?
Attachment #8846432 - Flags: review?(wmccloskey) → review+
:blassey, thanks for getting a patch up, let me know if you need help testing on try or getting this landed.
(In reply to Bill McCloskey (:billm) from comment #12)
> Comment on attachment 8846432 [details] [diff] [review]
> string_cat_hang.patch
> 
> Review of attachment 8846432 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: dom/ipc/ContentParent.cpp
> @@ +1967,5 @@
> >    extraArgs.push_back(IsForBrowser() ? "-isForBrowser" : "-notForBrowser");
> >  
> > +  nsAutoCString boolPrefs;
> > +  nsAutoCString intPrefs;
> > +  nsAutoCString stringPrefs;
> 
> Not sure the Auto part will do much good here since these strings are likely
> to be longer than 64 bytes.
what do you think of:

char boolBuf[1024];
nsFixedCString boolPref(boolBuf, 1024, 0)
... 
> @@ +1984,3 @@
> >        break;
> >      case nsIPrefBranch::PREF_STRING: {
> > +      nsCString value(Preferences::GetCString(ContentPrefs::GetContentPref(i)).get());
> 
> What about:
> nsAdoptingCString value =
> Preferences::GetCString(ContentPrefs::GetContentPref(i));
> ?
Yup, that'll avoid a copy
Pushed by blassey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d58da9e3123f
Intermittent ts_paint | Found crashes after test run, terminating test r=billm
https://hg.mozilla.org/mozilla-central/rev/d58da9e3123f
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Please request Aurora approval on this when you get a chance.
Blocks: 1303096
Flags: needinfo?(blassey.bugs)
Comment on attachment 8846432 [details] [diff] [review]
string_cat_hang.patch

Approval Request Comment
[Feature/Bug causing the regression]: 1303096
[User impact if declined]: random hangs on win7 PGO in automation (no reports outside automation)
[Is this code covered by automated tests?]:
[Has the fix been verified in Nightly?]: Not yet, need next orange factor report
[Needs manual test from QE? If yes, steps to reproduce]: 
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: No
[Why is the change risky/not risky?]: Just using our string classes vs stdlib's
[String changes made/needed]: none
Flags: needinfo?(blassey.bugs)
Attachment #8846432 - Flags: approval-mozilla-aurora?
Whiteboard: [stockwell needswork] → [stockwell fixed]
Hi :blassey,
According to comment #20, is that Ok?
Flags: needinfo?(blassey.bugs)
I think there is only one report in comment 20 that happened after blassey's change: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=84349216&lineNumber=1891. That looks to me like a different, unrelated crash.
Yup, let's uplift it
Flags: needinfo?(blassey.bugs)
Comment on attachment 8846432 [details] [diff] [review]
string_cat_hang.patch

Fix an intermittent failure. Aurora54+.
Attachment #8846432 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.