Intermittent ts_paint | Found crashes after test run, terminating test

RESOLVED FIXED in Firefox 54

Status

Testing
Talos
RESOLVED FIXED
8 months ago
3 months ago

People

(Reporter: Treeherder Bug Filer, Assigned: blassey)

Tracking

({intermittent-failure})

Version 3
mozilla55
intermittent-failure
Points:
---

Firefox Tracking Flags

(firefox52 unaffected, firefox-esr52 unaffected, firefox53 unaffected, firefox54 fixed, firefox55 fixed)

Details

(Whiteboard: [stockwell fixed])

Attachments

(1 attachment)

(Reporter)

Description

8 months ago
treeherder
Filed by: philringnalda [at] gmail.com

https://treeherder.mozilla.org/logviewer.html#?job_id=80181485&repo=mozilla-central

https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-win32-pgo/1488036766/mozilla-central_win7_ix_test-other-e10s-pgo-bm127-tests1-windows-build83.txt.gz

Comment 1

8 months ago
This is a timeout that would have previously been reported as bug 1339594.
See Also: → bug 1339594

Comment 2

8 months ago
8 failures in 812 pushes (0.01 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* autoland: 5
* mozilla-central: 3

Platform breakdown:
* windows7-32: 8

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1342685&startday=2017-02-20&endday=2017-02-26&tree=all

Comment 3

8 months ago
36 failures in 783 pushes (0.046 failures/push) were associated with this bug in the last 7 days. 

This is the #32 most frequent failure this week. 

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **

Repository breakdown:
* autoland: 18
* mozilla-inbound: 16
* mozilla-central: 2

Platform breakdown:
* windows7-32: 36

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1342685&startday=2017-02-27&endday=2017-03-05&tree=all

Comment 4

8 months ago
There is some variation in the failure reports, but a great many are Windows pgo and e10s. Each of these failures looks like:

12:14:09     INFO -  Running cycle XX/20 for ts_paint test...
12:14:09     INFO -  TEST-INFO | started process 2488 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmpz68mdt\profile http://localhost:49896/startup_test/tspaint_test.html)
12:16:39     INFO -  Timeout waiting for test completion; killing browser...
12:16:40     INFO -  Terminating psutil.Process(pid=2488, name=u'firefox.exe')
12:16:40     INFO -  Unable to kill process
12:16:40    ERROR -  Traceback (most recent call last):
12:16:40     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 144, in run_browser
12:16:40     INFO -      return_code = context.kill_process()
12:16:40     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 41, in kill_process
12:16:40     INFO -      procs = self.process.children()
12:16:40     INFO -    File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper
12:16:40     INFO -      raise NoSuchProcess(self.pid, self._name)
12:16:40     INFO -  NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=2488, name=u'firefox.exe')
12:16:40     INFO -  mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/cyZ3yyA1TsqBeWTL9McUPA/artifacts/public/build/firefox-54.0a1.en-US.win32.crashreporter-symbols.zip
12:16:48     INFO -  mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\users\cltbld\appdata\local\temp\tmpz68mdt\profile\minidumps\37b3d967-c6f8-4010-aad7-427594151d64.dmp c:\users\cltbld\appdata\local\temp\tmpsul3ty
12:16:56     INFO -  mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\37b3d967-c6f8-4010-aad7-427594151d64.dmp
12:16:56     INFO -  PROCESS-CRASH | ts_paint | application crashed [unknown top frame]

...a timeout 90 seconds after "started process".

The crash reports look consistent across these failures, with thread 0's stack containing:

12:16:56     INFO -  22  xul.dll!mozilla::dom::ContentParent::LaunchSubprocess(mozilla::hal::ProcessPriority) [ContentParent.cpp:8d026c601510 : 1968 + 0x21]
12:16:56     INFO -      eip = 0x60bcff69   esp = 0x001ad054   ebp = 0x001ad374
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  23  xul.dll!mozilla::dom::ContentParent::GetNewOrUsedBrowserProcess(nsAString_internal const &,mozilla::hal::ProcessPriority,mozilla::dom::ContentParent *) [ContentParent.cpp:8d026c601510 : 838 + 0xe]
12:16:56     INFO -      eip = 0x60bcd554   esp = 0x001ad37c   ebp = 0x001ad3b8
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  24  xul.dll!mozilla::dom::ContentParent::CreateBrowser(mozilla::dom::TabContext const &,mozilla::dom::Element *,mozilla::dom::ContentParent *) [ContentParent.cpp:8d026c601510 : 1149 + 0x16]
12:16:56     INFO -      eip = 0x60bcb372   esp = 0x001ad3c0   ebp = 0x001ad51c
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  25  xul.dll!nsFrameLoader::TryRemoteBrowser() [nsFrameLoader.cpp:8d026c601510 : 2944 + 0x12]
12:16:56     INFO -      eip = 0x606022b4   esp = 0x001ad524   ebp = 0x001ad670
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  26  xul.dll!nsFrameLoader::ShowRemoteFrame(mozilla::gfx::IntSizeTyped<mozilla::ScreenPixel> const &,nsSubDocumentFrame *) [nsFrameLoader.cpp:8d026c601510 : 1221 + 0xe]
12:16:56     INFO -      eip = 0x605ffc48   esp = 0x001ad678   ebp = 0x001ad6a8
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  27  xul.dll!nsFrameLoader::Show(int,int,int,int,nsSubDocumentFrame *) [nsFrameLoader.cpp:8d026c601510 : 1091 + 0xd]
12:16:56     INFO -      eip = 0x6002a9ae   esp = 0x001ad6b0   ebp = 0x001ad794
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  28  xul.dll!nsSubDocumentFrame::ShowViewer() [nsSubDocumentFrame.cpp:8d026c601510 : 191 + 0x12]
12:16:56     INFO -      eip = 0x5fb5963d   esp = 0x001ad79c   ebp = 0x001ad7d0
12:16:56     INFO -      Found by: previous frame's frame pointer
12:16:56     INFO -  29  xul.dll!AsyncFrameInit::Run() [nsSubDocumentFrame.cpp:8d026c601510 : 91 + 0x5]
12:16:56     INFO -      eip = 0x5fb59592   esp = 0x001ad7d8   ebp = 0x001ad7dc
12:16:56     INFO -      Found by: call frame info
12:16:56     INFO -  30  xul.dll!nsContentUtils::RemoveScriptBlocker() [nsContentUtils.cpp:8d026c601510 : 5271 + 0x9]
12:16:57     INFO -      eip = 0x5fad0db3   esp = 0x001ad7e4   ebp = 0x001ad800
12:16:57     INFO -      Found by: call frame info

Comment 5

8 months ago
(In reply to Geoff Brown [:gbrown] from comment #4)
> ...a timeout 90 seconds after "started process".

s/90 seconds/150 seconds/ 

(consistent with the ts_paint timeout value)
Whiteboard: [stockwell needswork]
odd why this is pgo and rarely on opt.  I suspect we need to debug this more, and why is this talos specific, and not seen in other test harnesses?  One explanation could be that we have different preferences.  Another could be that the addons we are using are causing problems.

I see a list of talos exceptions, 2 of the 3 most problematic are this bug (ts_paint crash-could be timeout) and bug 934310 (sessionrestore - timeout).

both of these bugs are win7 pgo whereas the other bugs are primarily windows 8 pgo/opt mix.  All seem to be e10s specific.

:gbrown, do you see value in trying to debug the stack a bit?

for example, based on the stack, I suspect this line is problematic:
https://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp?q=LaunchSubprocess&redirect_type=direct#1968

possibly adding code in there to see what mChildID is and if it is valid, maybe try/catch around some lines there with MOZ_ASSERTS instead?

the other route would be to look at the preferences and environment variables used in talos and compare those to reftest/mochitest/web-platform-tests to see if there is oversights.  Also since this is ts_paint/sessionrestore, there should only be the talos-powers extension (note, this is the only harness where we actually sign extensions)

thoughts on next steps?
Flags: needinfo?(gbrown)

Comment 7

7 months ago
The most recent crash reports have a different line number,

xul.dll!mozilla::dom::ContentParent::LaunchSubprocess(mozilla::hal::ProcessPriority) [ContentParent.cpp:a4653bf66378 : 1983 + 0x21]

pointing to

https://hg.mozilla.org/mozilla-central/annotate/6d0ea5812180/dom/ipc/ContentParent.cpp#l1983

  [1983]   boolPrefs << i << ':' << Preferences::GetBool(ContentPrefs::GetContentPref(i)) << '|';

Looking up the stack further, there are several frames like:

  __crt_seh_guarded_call<void>::operator()<<lambda_...

which I suppose are the << operations.

I would normally suspect Preferences::GetBool or ContentPrefs::GetContentPref of hanging before anything else on line [1983], but I don't see those in the crash reports, while I do often see WideCharToMultiByte, or similar.


:blassey - It looks like you wrote this code - https://hg.mozilla.org/mozilla-central/rev/5a43dbcb95ca, https://hg.mozilla.org/mozilla-central/diff/5a43dbcb95ca/dom/ipc/ContentParent.cpp . Can you help us make sense of these Talos-on-Windows hangs?
Flags: needinfo?(gbrown) → needinfo?(blassey.bugs)

Comment 8

7 months ago
btw, these crash-due-to-timeout bugs are the new expression of the Talos NoSuchProcess failures. These crash reports are consistent with https://bugzilla.mozilla.org/show_bug.cgi?id=1236770#c33 - it all goes back to the same change.

Comment 9

7 months ago
18 failures in 137 pushes (0.131 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* mozilla-inbound: 9
* autoland: 5
* mozilla-central: 4

Platform breakdown:
* windows7-32: 18

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1342685&startday=2017-03-09&endday=2017-03-09&tree=all
Created attachment 8846432 [details] [diff] [review]
string_cat_hang.patch

https://treeherder.mozilla.org/#/jobs?repo=try&revision=822c58c7144cb7f19d1dad70a6093e9ddad60be8&selectedJob=83228607

Bill, it looks like std::stringstream's << operator is causing hangs on windows. From this try push it looks like using nsAutoCString for a buffer and nsPrintfCString for formatting avoids this.
Assignee: nobody → blassey.bugs
Flags: needinfo?(blassey.bugs)
Attachment #8846432 - Flags: review?(wmccloskey)

Comment 11

7 months ago
75 failures in 790 pushes (0.095 failures/push) were associated with this bug in the last 7 days. 

This is the #23 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-inbound: 34
* autoland: 27
* mozilla-central: 13
* mozilla-aurora: 1

Platform breakdown:
* windows7-32: 72
* windows8-64: 2
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1342685&startday=2017-03-06&endday=2017-03-12&tree=all
Comment on attachment 8846432 [details] [diff] [review]
string_cat_hang.patch

Review of attachment 8846432 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentParent.cpp
@@ +1967,5 @@
>    extraArgs.push_back(IsForBrowser() ? "-isForBrowser" : "-notForBrowser");
>  
> +  nsAutoCString boolPrefs;
> +  nsAutoCString intPrefs;
> +  nsAutoCString stringPrefs;

Not sure the Auto part will do much good here since these strings are likely to be longer than 64 bytes.

@@ +1984,3 @@
>        break;
>      case nsIPrefBranch::PREF_STRING: {
> +      nsCString value(Preferences::GetCString(ContentPrefs::GetContentPref(i)).get());

What about:
nsAdoptingCString value = Preferences::GetCString(ContentPrefs::GetContentPref(i));
?
Attachment #8846432 - Flags: review?(wmccloskey) → review+
:blassey, thanks for getting a patch up, let me know if you need help testing on try or getting this landed.
(In reply to Bill McCloskey (:billm) from comment #12)
> Comment on attachment 8846432 [details] [diff] [review]
> string_cat_hang.patch
> 
> Review of attachment 8846432 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: dom/ipc/ContentParent.cpp
> @@ +1967,5 @@
> >    extraArgs.push_back(IsForBrowser() ? "-isForBrowser" : "-notForBrowser");
> >  
> > +  nsAutoCString boolPrefs;
> > +  nsAutoCString intPrefs;
> > +  nsAutoCString stringPrefs;
> 
> Not sure the Auto part will do much good here since these strings are likely
> to be longer than 64 bytes.
what do you think of:

char boolBuf[1024];
nsFixedCString boolPref(boolBuf, 1024, 0)
... 
> @@ +1984,3 @@
> >        break;
> >      case nsIPrefBranch::PREF_STRING: {
> > +      nsCString value(Preferences::GetCString(ContentPrefs::GetContentPref(i)).get());
> 
> What about:
> nsAdoptingCString value =
> Preferences::GetCString(ContentPrefs::GetContentPref(i));
> ?
Yup, that'll avoid a copy
nsFixedCString seems fine.

Comment 16

7 months ago
Pushed by blassey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d58da9e3123f
Intermittent ts_paint | Found crashes after test run, terminating test r=billm

Comment 17

7 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/d58da9e3123f
Status: NEW → RESOLVED
Last Resolved: 7 months ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Please request Aurora approval on this when you get a chance.
Blocks: 1303096
status-firefox52: --- → unaffected
status-firefox53: --- → unaffected
status-firefox54: --- → affected
status-firefox-esr52: --- → unaffected
Flags: needinfo?(blassey.bugs)
Comment on attachment 8846432 [details] [diff] [review]
string_cat_hang.patch

Approval Request Comment
[Feature/Bug causing the regression]: 1303096
[User impact if declined]: random hangs on win7 PGO in automation (no reports outside automation)
[Is this code covered by automated tests?]:
[Has the fix been verified in Nightly?]: Not yet, need next orange factor report
[Needs manual test from QE? If yes, steps to reproduce]: 
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: No
[Why is the change risky/not risky?]: Just using our string classes vs stdlib's
[String changes made/needed]: none
Flags: needinfo?(blassey.bugs)
Attachment #8846432 - Flags: approval-mozilla-aurora?
Whiteboard: [stockwell needswork] → [stockwell fixed]

Comment 20

7 months ago
11 failures in 777 pushes (0.014 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 6
* autoland: 4
* mozilla-central: 1

Platform breakdown:
* windows7-32: 9
* windows8-64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1342685&startday=2017-03-13&endday=2017-03-19&tree=all
Hi :blassey,
According to comment #20, is that Ok?
Flags: needinfo?(blassey.bugs)
I think there is only one report in comment 20 that happened after blassey's change: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=84349216&lineNumber=1891. That looks to me like a different, unrelated crash.
Yup, let's uplift it
Flags: needinfo?(blassey.bugs)
Comment on attachment 8846432 [details] [diff] [review]
string_cat_hang.patch

Fix an intermittent failure. Aurora54+.
Attachment #8846432 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Comment 25

7 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/f264eb5dd1eb
status-firefox54: affected → fixed

Comment 26

5 months ago
5 failures in 777 pushes (0.006 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 5

Platform breakdown:
* windows8-64: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1342685&startday=2017-05-15&endday=2017-05-21&tree=all

Comment 27

3 months ago
10 failures in 720 pushes (0.014 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 7
* autoland: 2
* mozilla-central: 1

Platform breakdown:
* windows7-32: 5
* windows10-64: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1342685&startday=2017-07-10&endday=2017-07-16&tree=all
You need to log in before you can comment on or make changes to this bug.