Closed Bug 1598924 Opened 4 years ago Closed 4 years ago

Windows Talos sessionrestore_no_auto_restore | timeout

Categories

(Testing :: Talos, defect, P1)

Version 3
defect

Tracking

(firefox-esr68 unaffected, firefox74 wontfix, firefox75 wontfix, firefox76 fixed)

RESOLVED FIXED
mozilla76
Tracking Status
firefox-esr68 --- unaffected
firefox74 --- wontfix
firefox75 --- wontfix
firefox76 --- fixed

People

(Reporter: RaulG, Assigned: Gijs)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [stockwell fixed:backout])

Attachments

(1 file)

Summary: TEST-UNEXPECTED-ERROR | sessionrestore_no_auto_restore | timeout → Windows Talos sessionrestore_no_auto_restore | timeout

This seems to be the important part of the log:

[task 2019-11-23T22:56:20.222Z] 22:56:20     INFO -  TEST-INFO | started process 4604 (C:\Users\task_1574539317\build\application\firefox\firefox -wait-for-browser -no-deelevate -profile c:\users\task_1574539317\appdata\local\temp\tmp2svst3\profile about:home)
[task 2019-11-23T22:56:26.965Z] 22:56:26     INFO -  PID 8064 | console.error: (new Error("SessionFile is closed", "resource:///modules/sessionstore/SessionFile.jsm", 433))
[task 2019-11-23T22:56:42.356Z] 22:56:42     INFO -  PID 8064 |
[task 2019-11-23T22:56:42.356Z] 22:56:42     INFO -  PID 8064 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2019-11-23T22:56:42.356Z] 22:56:42     INFO -  PID 8064 |
[task 2019-11-23T22:56:42.358Z] 22:56:42     INFO -  PID 8064 |
[task 2019-11-23T22:56:42.358Z] 22:56:42     INFO -  PID 8064 | ###!!! [Child][MessageChannel::SendAndWait] Error: Channel error: cannot send/recv
[task 2019-11-23T22:56:42.358Z] 22:56:42     INFO -  PID 8064 |
[task 2019-11-23T23:11:20.216Z] 23:11:20     INFO -  Timeout waiting for test completion; killing browser...
[task 2019-11-23T23:11:22.265Z] 23:11:22     INFO -  TEST-UNEXPECTED-ERROR | sessionrestore_no_auto_restore | timeout
[task 2019-11-23T23:11:22.265Z] 23:11:22    ERROR -  Traceback (most recent call last):
[task 2019-11-23T23:11:22.265Z] 23:11:22     INFO -    File "C:\Users\task_1574539317\build\tests\talos\talos\run_tests.py", line 290, in run_tests
[task 2019-11-23T23:11:22.265Z] 23:11:22     INFO -      talos_results.add(mytest.runTest(browser_config, test))
[task 2019-11-23T23:11:22.265Z] 23:11:22     INFO -    File "C:\Users\task_1574539317\build\tests\talos\talos\ttest.py", line 64, in runTest
[task 2019-11-23T23:11:22.265Z] 23:11:22     INFO -      return self._runTest(browser_config, test_config, setup)
[task 2019-11-23T23:11:22.265Z] 23:11:22     INFO -    File "C:\Users\task_1574539317\build\tests\talos\talos\ttest.py", line 216, in _runTest
[task 2019-11-23T23:11:22.265Z] 23:11:22     INFO -      debugger_args=browser_config['debugger_args']
[task 2019-11-23T23:11:22.266Z] 23:11:22     INFO -    File "C:\Users\task_1574539317\build\tests\talos\talos\talos_process.py", line 143, in run_browser
[task 2019-11-23T23:11:22.266Z] 23:11:22     INFO -      raise TalosError("timeout")
[task 2019-11-23T23:11:22.266Z] 23:11:22     INFO -  TalosError: timeout
[task 2019-11-23T23:11:22.266Z] 23:11:22     INFO -  TEST-INFO took 943588ms
[task 2019-11-23T23:11:22.266Z] 23:11:22     INFO -  SUITE-END | took 1190s
[task 2019-11-23T23:14:25.284Z] 23:14:25     INFO -  WARNING | IO Completion Port failed to signal process shutdown
[task 2019-11-23T23:14:25.284Z] 23:14:25     INFO -  Parent process 4604 exited with children alive:
[task 2019-11-23T23:14:25.284Z] 23:14:25     INFO -  PIDS: 8132, 6828
[task 2019-11-23T23:14:25.284Z] 23:14:25     INFO -  Attempting to kill them, but no guarantee of success
[task 2019-11-23T23:14:25.328Z] 23:14:25    ERROR - Return code: 2
[task 2019-11-23T23:14:25.328Z] 23:14:25  WARNING - setting return code to 2
[task 2019-11-23T23:14:25.328Z] 23:14:25    ERROR - # TBPL FAILURE #

I'm going to add some additional instrumentation to see if I can find out why the SessionFile is closed.

Assignee: nobody → mconley
See Also: → 1589796

Looks like the SessionFile stuff is a red herring. Even if we don't throw there, the test still times out. Adding more logging.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0125395d774fe0f776fe22f17a9c0524fb768f26

Okay, I think I have a sense of what's happening.

Before the run where the test is hanging, we're forcibly killing the parent process after 5 seconds of inactivity. Presumably, we're taking too long to shutdown in the parent process (likely due to bug 1597312).

Killing the parent process probably creates a busted session in the profile, or puts it in the "crash recovery state", which is why the test times out - the session isn't actually being restored.

I suspect this is bug 1597312 or similar due to the error message during the shutdown hang:

[task 2019-11-26T20:46:20.503Z] 20:46:20     INFO -  PID 5948 | console.warn: services.settings: Error retrieving the getLastModified timestamp from blocklists/addons RemoteSettingClient Message: Error: IndexedDB getLastModified() The operation failed for reasons unrelated to the database itself and not covered by any other error code.

Hey asuth, since I might have a fairly straight-forward of recreating bug 1597312 in automation (on Windows-only though :/), is there anything useful I can provide that might help solve bug 1597312? Or do you have any suggestions on how I can work around it, presuming this is the issue?

Flags: needinfo?(mconley) → needinfo?(bugmail)

I'm not sure I understand enough context about what's happening here but...

(In reply to Mike Conley (:mconley) (:⚙️) (Wayyyy behind on needinfos) from comment #6)

Before the run where the test is hanging, we're forcibly killing the parent process after 5 seconds of inactivity. Presumably, we're taking too long to shutdown in the parent process (likely due to bug 1597312).

Shouldn't this trigger a test failure at this point?

Killing the parent process probably creates a busted session in the profile, or puts it in the "crash recovery state", which is why the test times out - the session isn't actually being restored.

Shouldn't this also cause a test failure?

I'm not trying to avoid fixing bugs in our components, but there are a ton of reasons shutdown can timeout and bug 1597312 should not be an issue given that bug 1594521 should mean that remote settings should stop doing things at shutdown. (Activity stream could perhaps be ignoring shutdown, though?)

It seems like maybe the sessionstore base test class at https://searchfox.org/mozilla-central/rev/4df8821c1b824db5f40f381f48432f219d99ae36/testing/talos/talos/test.py#207 should:

  1. Set a flag or something so that if Firefox exits with an error code or reports a crash that the test fails. The crash should then be symbolicated and the minidump artifact exposed. This should help immediately demonstrate what's actually causing the hang and make it directly actionable instead of a timeout.
  2. Have a hook that checks that the reinstall files are present and they don't indicate a crash happened.
Flags: needinfo?(bugmail)

Mike is there a way to disable this test on windows until you have a fix?

Flags: needinfo?(mconley)

Just a little update here - I have crash reports being (apparently) generated on try during shutdown hangs, but for some reason they're not being collected / dumped / put into the MOZ_UPLOAD_DIR. Still trying to figure out why.

Give me until the end of the day to sort this out. If I can't get it, the best idea would probably be to back out bug 1545438 for now until I sort this out and re-land.

Flags: needinfo?(mconley)

Thank you, will do.

Whiteboard: [stockwell disable-recommended] → [stockwell disable-recommended][Comment 15]

Hey apavel, can we please back out bug 1545438 (b4b3dc473bc4) for mozilla-central and mozilla-beta until I can get this figured out?

Flags: needinfo?(apavel)

(In reply to Mike Conley (:mconley) (:⚙️) (Wayyyy behind on needinfos) from comment #19)

Hey apavel, can we please back out bug 1545438 (b4b3dc473bc4) for mozilla-central and mozilla-beta until I can get this figured out?

Backed out on both trees

Central backout: https://hg.mozilla.org/mozilla-central/rev/32f84899fc000d95337a25c14a27470d0c52e3ee
Beta backout: https://hg.mozilla.org/releases/mozilla-beta/rev/283b4b57c72dad92198af43fea25e42bd15c5047

Flags: needinfo?(apavel)
Whiteboard: [stockwell disable-recommended][Comment 15] → [stockwell fixed:backout]

The priority flag is not set for this bug.
:rwood, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(rwood)
Flags: needinfo?(rwood)
Priority: -- → P2

This is a timeout where we have comparatively little info in general. For a specific case that seems to make it worse, I'm investigating in bug 1545438, but this is happening on current central even without the patch from that bug -- but there's very little information on what causes it. What seems to happen is:

  1. we hang on shutdown (or at least, don't respond / output for 5 seconds? Loop is at https://searchfox.org/mozilla-central/rev/2fd8ffcf087bc59a8e5c962965bbb7bf230bcd28/testing/talos/talos/talos_process.py#150 ).
  2. talos_process terminates the launcher process
  3. the main process stays up
  4. we start another run on the same profile
  5. we hit the "this profile is in use" dialog on startup, and then end up hanging and hitting the main talos job timeout

I think there are 2 issues here:

  1. when terminating things, we should terminate the launcher process as well as the main process
  2. when we terminate things, we should be attaching a crash report to the job so we can debug what happened.

:aklotz, for the former, how can we update talos_process.py to do the Right Thing and also end the "main" process -- ideally while...
:gsvelto: how do we make sure we get a crash report for these crashes and/or shutdown hangs ?

Flags: needinfo?(gsvelto)
Flags: needinfo?(aklotz)

(In reply to :Gijs (back Thu 12; he/him) from comment #26)

:gsvelto: how do we make sure we get a crash report for these crashes and/or shutdown hangs ?

It seems to me that the code is already doing the right thing by calling mozcrash.kill_and_get_minidump() (see here). For that to work however an exception handler must be installed and the launcher process doesn't have one which is why I think we're not getting a crash report. On the other hand killing the browser process that way should always yield a crash report.

Flags: needinfo?(gsvelto)

(In reply to :Gijs (he/him) from comment #26)

:aklotz, for the former, how can we update talos_process.py to do the Right Thing and also end the "main" process -- ideally while...

I seem to remember having a discussion with Henrik on how to do this with job objects for a different harness. Can we adapt that code to Talos?

Flags: needinfo?(aklotz) → needinfo?(hskupin)

Hm, the only thing I can remember was for bug 1493796 but that is to keep a handle to the job process after a restart. Sorry, but I cannot remember anything else, and also my email archive doesn't contain something related to that.

Flags: needinfo?(hskupin)
See Also: → 1622257

I've filed bug 1622257 on fixing talos to shut down the launcher process, and on making it report those shutdowns as oranges on treeherder (instead of the timeout, or nothing (!) on macos/linux).

(oops)

Assignee: mconley → gijskruitbosch+bugs
Status: NEW → ASSIGNED
Priority: P2 → P1

By and large, this change accomplishes two things:

  1. Run db.close() in finally clauses so that even if db access fails, we close
    our connections. It also tries to avoid waiting on other, non-DB operations
    before calling close, to avoid the DB connection needlessly hanging around.
  2. Intercept all async database operations from the remote settings client to
    kinto and ensuring they complete before the end of profile-before-change.
    Any operations started after Services.startup.isShuttingDown (so after
    quit/restart is initiated by the user) will throw. Operations started
    beforehand are put in a set of operations, and remove themselves once
    complete. We AsyncShutdown block on that set of operations completing.
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/3b7ccffc0546
make remote settings handle shutdown gracefully, r=leplatrem,asuth
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/354489628b39
make remote settings handle shutdown gracefully, r=leplatrem,asuth
Flags: needinfo?(gijskruitbosch+bugs)
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla76

(In reply to Intermittent Failures Robot from comment #40)

4 failures in 4773 pushes (0.001 failures/push) were associated with this bug in the last 7 days.

Repository breakdown:

  • autoland: 1

Hrmpf. I filed bug 1625888.

Blocks: 1629005
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: