Closed Bug 1610712 Opened 6 years ago Closed 6 years ago

Intermittent arsenic.errors.SessionStartError: unknown error: invalid argument: can't kill an exited process

Categories

(Testing :: Condprofile, defect, P5)

defect

Tracking

(firefox75 fixed)

RESOLVED FIXED
mozilla75
Tracking Status
firefox75 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: tarek)

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

Filed by: malexandru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=285821066&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/JeNzNbtgSsiCW7ziINiAVQ/runs/1/artifacts/public/logs/live_backing.log


[task 2020-01-21T19:46:32.843Z] Successfully built pyyaml
[task 2020-01-21T19:46:32.901Z] Installing collected packages: pyyaml
[task 2020-01-21T19:46:32.939Z] Successfully installed pyyaml-5.1.2
[task 2020-01-21T19:46:33.753Z] WARNING: You are using pip version 19.2.3, however version 20.0.1 is available.
[task 2020-01-21T19:46:33.753Z] You should consider upgrading via the 'pip install --upgrade pip' command.
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.015283] Verifying Desktop Firefox binary
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.036114] Working with Firefox 74.0a1
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.036203] environ({'SHELL': '/bin/bash', 'MOZ_FETCHES': '[{"artifact": "public/build/target.common.tests.tar.gz", "extract": true, "task": "ax8g1FP5RIOdJmeHcMlduw"}, {"artifact": "public/build/target.condprof.tests.tar.gz", "extract": true, "task": "ax8g1FP5RIOdJmeHcMlduw"}, {"artifact": "public/build/target.tar.bz2", "extract": true, "task": "ax8g1FP5RIOdJmeHcMlduw"}]', 'HOSTNAME': 'taskcluster-worker', 'MOZ_AUTOMATION': '1', 'TASKCLUSTER_ROOT_URL': 'https://firefox-ci-tc.services.mozilla.com', 'PWD': '/builds/worker/fetches/condprofile', 'LOGNAME': 'worker', 'CONDPROF_ROOT': 'fetches/condprofile', 'TASKCLUSTER_PORT_80_TCP_PORT': '80', 'HOME': '/builds/worker', 'TASKCLUSTER_WORKER_GROUP': 'aws', 'LANG': 'en_US.UTF-8', 'TASKCLUSTER_NAME': '/festive_carson/taskcluster', 'TASKCLUSTER_INSTANCE_TYPE': 'c5.4xlarge', 'TASKCLUSTER_PORT': 'tcp://172.17.0.2:80', 'TASKCLUSTER_WORKER_LOCATION': '{"availabilityZone":"us-west-2c","cloud":"aws","region":"us-west-2"}', 'TERM': 'xterm', 'TOOLTOOL_CACHE': '/builds/worker/tooltool-cache', 'TASKCLUSTER_PUBLIC_IP': '34.208.51.100', 'USER': 'worker', 'TASKCLUSTER_PORT_80_TCP_PROTO': 'tcp', 'MOZ_SCM_LEVEL': '3', 'DISPLAY': ':0', 'SHLVL': '0', 'TASKCLUSTER_PROXY_URL': 'http://taskcluster', 'SCCACHE_DISABLE': '1', 'LC_ALL': 'en_US.UTF-8', 'CONDPROF_ARCHIVES': '/builds/worker/archive', 'MOZ_FETCHES_DIR': '/builds/worker/fetches', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/builds/worker/bin:/builds/worker/fetches/bin', 'RUN_ID': '1', 'TASK_ID': 'JeNzNbtgSsiCW7ziINiAVQ', 'TASKCLUSTER_WORKER_TYPE': 'b-linux', 'TASKCLUSTER_PORT_80_TCP_ADDR': '172.17.0.2', 'TASKCLUSTER_PORT_80_TCP': 'tcp://172.17.0.2:80', 'OLDPWD': '/builds/worker', '_': 'bin/python3.7'})
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.036326] Archives directory is /builds/worker/archive
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.036419] Verifying Geckodriver binary presence
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.038975] Getting https://firefox-ci-tc.services.mozilla.com/api/index/v1/task/gecko.v2.mozilla-central.latest.firefox.condprof-linux64/artifacts/public/condprof/changelog.json
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.398612] Downloading https://firefox-ci-tc.services.mozilla.com/api/index/v1/task/gecko.v2.mozilla-central.latest.firefox.condprof-linux64/artifacts/public/condprof/changelog.json
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.649147] Got the changelog from TaskCluster
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.649616] Customization condprof/../condprof/customization/intermediate.json
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.649798] Building heavy x intermediate
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.649860] Loaded options for that scenario {'max_urls': 60}
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.649954] Getting https://firefox-ci-tc.services.mozilla.com/api/index/v1/task/gecko.v2.mozilla-central.latest.firefox.condprof-linux64/artifacts/public/condprof/profile-linux64-heavy-intermediate.tgz
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:34.841470] Downloading https://firefox-ci-tc.services.mozilla.com/api/index/v1/task/gecko.v2.mozilla-central.latest.firefox.condprof-linux64/artifacts/public/condprof/profile-linux64-heavy-intermediate.tgz
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:46:50.151958] Extracting the tarball content in /tmp/tmpznr_td9z/profile
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:47:05.532888] Success, we have a profile to work with
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:47:05.532959] Updating profile located at '/tmp/tmpznr_td9z/profile'
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:47:05.532978] Reading existing metadata at /tmp/tmpznr_td9z/profile/condprofile.json
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:47:05.533071] Starting the Gecko app...
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:47:05.533098] Writing geckodriver logs in /builds/worker/archive/geckodriver-heavy-intermediate.log
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:47:05.537042] Running Webdriver on port 50536
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:47:05.537110] Running Marionette on port 2828
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:47:05.646356] heavy scenario broke!
[task 2020-01-21T19:47:05.647Z] Traceback (most recent call last):
[task 2020-01-21T19:47:05.647Z] File "condprof/../condprof/creator.py", line 130, in build_profile
[task 2020-01-21T19:47:05.647Z] async with get_session(geckodriver, firefox_instance) as session:
[task 2020-01-21T19:47:05.647Z] File "/builds/worker/fetches/condprofile/lib/python3.7/site-packages/arsenic/init.py", line 16, in aenter
[task 2020-01-21T19:47:05.647Z] self.session = await start_session(self.service, self.browser, self.bind)
[task 2020-01-21T19:47:05.647Z] File "/builds/worker/fetches/condprofile/lib/python3.7/site-packages/arsenic/init.py", line 29, in start_session
[task 2020-01-21T19:47:05.647Z] return await driver.new_session(browser, bind=bind)
[task 2020-01-21T19:47:05.647Z] File "/builds/worker/fetches/condprofile/lib/python3.7/site-packages/arsenic/webdriver.py", line 56, in new_session
[task 2020-01-21T19:47:05.647Z] err_resp["error"], err_resp.get("message", ""), original_response
[task 2020-01-21T19:47:05.647Z] arsenic.errors.SessionStartError: unknown error: invalid argument: can't kill an exited process
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:47:05.646790] Could not start the session, check /builds/worker/archive/geckodriver-heavy-intermediate.log first
[task 2020-01-21T19:47:05.647Z] [2020-01-21T19:47:05.646852] Something went wrong on this one.
[task 2020-01-21T19:47:05.647Z] Traceback (most recent call last):
[task 2020-01-21T19:47:05.647Z] File "condprof/runner.py", line 175, in run_all
[task 2020-01-21T19:47:05.647Z] res.append(await one_run(scenario, customization))
[task 2020-01-21T19:47:05.647Z] File "condprof/runner.py", line 150, in one_run
[task 2020-01-21T19:47:05.647Z] ).run(not args.visible)
[task 2020-01-21T19:47:05.647Z] File "condprof/../condprof/creator.py", line 65, in run
[task 2020-01-21T19:47:05.647Z] metadata = await self.build_profile(device, headless)
[task 2020-01-21T19:47:05.647Z] File "condprof/../condprof/creator.py", line 130, in build_profile
[task 2020-01-21T19:47:05.647Z] async with get_session(geckodriver, firefox_instance) as session:
[task 2020-01-21T19:47:05.647Z] File "/builds/worker/fetches/condprofile/lib/python3.7/site-packages/arsenic/init.py", line 16, in aenter
[task 2020-01-21T19:47:05.647Z] self.session = await start_session(self.service, self.browser, self.bind)
[task 2020-01-21T19:47:05.647Z] File "/builds/worker/fetches/condprofile/lib/python3.7/site-packages/arsenic/init.py", line 29, in start_session
[task 2020-01-21T19:47:05.647Z] return await driver.new_session(browser, bind=bind)
[task 2020-01-21T19:47:05.647Z] File "/builds/worker/fetches/condprofile/lib/python3.7/site-packages/arsenic/webdriver.py", line 56, in new_session
[task 2020-01-21T19:47:05.647Z] err_resp["error"], err_resp.get("message", ""), original_response
[task 2020-01-21T19:47:05.647Z] arsenic.errors.SessionStartError: unknown error: invalid argument: can't kill an exited process
[task 2020-01-21T19:47:05.647Z] Traceback (most recent call last):
[task 2020-01-21T19:47:05.648Z] File "condprof/runner.py", line 194, in <module>
[task 2020-01-21T19:47:05.648Z] main()
[task 2020-01-21T19:47:05.648Z] File "condprof/runner.py", line 184, in main
[task 2020-01-21T19:47:05.648Z] failures, results = loop.run_until_complete(run_all(args))
[task 2020-01-21T19:47:05.648Z] File "/usr/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
[task 2020-01-21T19:47:05.648Z] return future.result()
[task 2020-01-21T19:47:05.648Z] File "condprof/runner.py", line 175, in run_all
[task 2020-01-21T19:47:05.648Z] res.append(await one_run(scenario, customization))
[task 2020-01-21T19:47:05.648Z] File "condprof/runner.py", line 150, in one_run
[task 2020-01-21T19:47:05.648Z] ).run(not args.visible)
[task 2020-01-21T19:47:05.648Z] File "condprof/../condprof/creator.py", line 65, in run
[task 2020-01-21T19:47:05.648Z] metadata = await self.build_profile(device, headless)
[task 2020-01-21T19:47:05.648Z] File "condprof/../condprof/creator.py", line 130, in build_profile
[task 2020-01-21T19:47:05.648Z] async with get_session(geckodriver, firefox_instance) as session:
[task 2020-01-21T19:47:05.648Z] File "/builds/worker/fetches/condprofile/lib/python3.7/site-packages/arsenic/init.py", line 16, in aenter
[task 2020-01-21T19:47:05.648Z] self.session = await start_session(self.service, self.browser, self.bind)
[task 2020-01-21T19:47:05.648Z] File "/builds/worker/fetches/condprofile/lib/python3.7/site-packages/arsenic/init.py", line 29, in start_session
[task 2020-01-21T19:47:05.648Z] return await driver.new_session(browser, bind=bind)
[task 2020-01-21T19:47:05.648Z] File "/builds/worker/fetches/condprofile/lib/python3.7/site-packages/arsenic/webdriver.py", line 56, in new_session
[task 2020-01-21T19:47:05.648Z] err_resp["error"], err_resp.get("message", ""), original_response
[task 2020-01-21T19:47:05.648Z] arsenic.errors.SessionStartError: unknown error: invalid argument: can't kill an exited process
[task 2020-01-21T19:47:05.709Z] Unclosed client session
[task 2020-01-21T19:47:05.709Z] client_session: <aiohttp.client.ClientSession object at 0x7fa7ab1744a8>
[task 2020-01-21T19:47:05.709Z] Unclosed connector
[task 2020-01-21T19:47:05.709Z] connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x7fa7ab18f4c8>, 5830.749148792)]']
[task 2020-01-21T19:47:05.709Z] connector: <aiohttp.connector.TCPConnector object at 0x7fa7ab174320>
[fetches 2020-01-21T19:47:05.729Z] removing /builds/worker/fetches
[fetches 2020-01-21T19:47:06.614Z] finished
[taskcluster 2020-01-21 19:47:08.601Z] === Task Finished ===
[taskcluster 2020-01-21 19:47:08.852Z] Unsuccessful task run with exit code: 1 completed in 262.881 seconds

Assignee: nobody → tarek
Component: Task Configuration → Condprofile
Product: Firefox Build System → Testing

This happens because when geckodriver starts, we get this error:

"""
This profile was last used with a newer version of this application. Please create a new profile.
"""

We need to figure out why there's a mismatch between the profile version and the firefox that is running here

pass the --allow-downgrade option

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla75

(In reply to Tarek Ziadé (:tarek) from comment #1)

We need to figure out why there's a mismatch between the profile version and the firefox that is running here

Is there a follow-up where this is being figured out? Passing "--allow-downgrade" defeats the primary mechanism that avoids bug 1246615 situations and may be resulting in a class of test automation failures like bug 1609295 where Firefox runs in a fundamentally broken state because it can't handle a profile from the future (or results in data from the profile being silently purged as a failsafe to workaround the profile from the future which might result in other confusing failures).

Flags: needinfo?(tarek)

No, it's in my pile -- not a P1 because it happened only in a few occasions (probably a bad timing when the Firefox nightly version has just changed) and has no impact to my knowledge.

Your comment seems to imply two things:

  1. is --allow-downgrade completely broken ? are we planning to remove it ?

  2. using that option would be the cause of failures in bug 1609295 ?

I am wondering why it would be the case because once the profile is updated by the cron, it "replaces" the previous artifact,
(we don't create new profiles every day, we update the profile from the previous day) -- so if --allow-downgrade would put the profile
in a completely broken state, things would be broken on every run and every day since the last time the profile was updated, which is
not the case.

Flags: needinfo?(tarek) → needinfo?(bugmail)

(In reply to Tarek Ziadé (:tarek) from comment #10)

  1. is --allow-downgrade completely broken ? are we planning to remove it ?

https://groups.google.com/d/msg/mozilla.dev.platform/hVeJSYjSnO4/_LCs6LUuBAAJ provides more context, but I'd characterize it as an "I know what I'm doing and I know I'm voiding my warranty" flag similar in nature to a lot of dangerous preferences. It probably should have a much scarier description under "--help".

  1. using that option would be the cause of failures in bug 1609295 ?

I am wondering why it would be the case because once the profile is updated by the cron, it "replaces" the previous artifact,
(we don't create new profiles every day, we update the profile from the previous day) -- so if --allow-downgrade would put the profile
in a completely broken state, things would be broken on every run and every day since the last time the profile was updated, which is
not the case.

In terms of version-checking and breakage, the situation is this:

  • nsAppRunner checks the contents of a compatibility.ini file in the profile directory at startup in CheckCompatibility which uses CompareCompatVersions which checks for version number changes and then buildid revisions on the app and platform.
[Compatibility]
LastVersion=76.0a1_20200313095322/20200313095322
LastOSABI=Linux_x86_64-gcc3
LastPlatformDir=/PATH/TO/FIREFOX/BINARY/DIR
LastAppDir=/PATH/TO/FIREFOX/BINARY/DIR/browser
  • This is really a hygiene/canary/warranty check. The fact that the build number is from the future on its own doesn't mean there will be problems. The problem comes when a system that stores data in the profile revs its schema and the older Firefox version is confronted with version (N+1) and it only understands version (N).
  • This is intended primarily to make sure that the user's profile is not bouncing between channels/Firefox release versions, but it also verifies that the user isn't going back to previous builds of the same revision which can result in the same problems, but with decreased probability of problems.
  • Any breakage from downgrading is going to depend on what subsystems, if any, changed their formats and potentially what changed. Some subsystems like Places perform a naive soft downgrade where they just mark that they ran with an older schema version and hope for the best. (This lets upgrade logic run again.) Some subsystems like QuotaManager end up also having to do harder version cliffs because they expose data to web content and a downgrade can result in loss of semantics. (QuotaManager/clients also do minor version upgrades that can have soft downgrade semantics. QuotaManager is also planning to start wiping out data from the future when encountered, which turns failure of the "things using this subsystem no longer work" into "system/user data gets thrown away and things have to work from scratch".

So passing the flag won't result in breakage 100% of the time. But the big question is why the flag is ever needed. It suggests that a single profile is being used for different channels/versions and/or try builds where someone can be pushing an older version or something like that. If the time-travelling encountered is only a small delta it may appear like there's just some weird intermittent failures that go away after rebasing or something like that.

If we want tests to have fidelity (other than a test that's explicitly about checking what happens in downgrades for some kind of release management heads-up or some kind of lazy bisection), then we should never be using "--allow-downgrade".

I know in some Slack thread it sounded like raptor tests at least were using a profile generated with a nightly profile against multiple channels to speed up the test running or something. In cases like this it would be appropriate to use some combination of:

  • Have a profile per channel.
  • Make sure when picking a pre-built profile that instead of always using "latest" from taskcluster that some mechanism is used to walk the hg revision chains so that it's "find the first artifact ARTY that's an ancestor of revision REV".
Flags: needinfo?(bugmail)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: