Closed Bug 1699423 Opened 3 years ago Closed 3 years ago

osx tier-3 browsertime chrome/chromium pageload tests are failing to run on new osx 10.15 machines

Categories

(Testing :: Raptor, defect, P2)

defect

Tracking

(firefox-esr78 unaffected, firefox-esr91 unaffected, firefox86 unaffected, firefox87 unaffected, firefox88 wontfix, firefox92 wontfix, firefox93 wontfix, firefox94 fixed)

RESOLVED FIXED
94 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox86 --- unaffected
firefox87 --- unaffected
firefox88 --- wontfix
firefox92 --- wontfix
firefox93 --- wontfix
firefox94 --- fixed

People

(Reporter: jmaher, Assigned: dhouse)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file)

pageload tests for chrome/chromium browsers via browsertime are failing on osx:
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=8ce6e030833655e82d0e576cd937d442b4178b12

these are tier-3, but are expected to run. I have a patch to turn these on.

A few errors from the log:

[task 2021-03-18T05:21:27.486Z] 05:21:27     INFO -  raptor-perftest Info: main raptor init, config is: {'verbose': False, 'e10s': True, 'app': 'chromium', 'symbols_path': 'https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/btBYMy-jR4eSH7-ATzWB1A/artifacts/public/build/target.crashreporter-symbols.zip', 'power_test': False, 'run_local': False, 'chimera': True, 'cold': True, 'gecko_profile_interval': None, 'binary': '/opt/worker/tasks/task_161604478481098/fetches/chrome-mac/Chromium.app/Contents/MacOS/Chromium', 'enable_control_server_wait': False, 'no_conditioned_profile': True, 'environment': {'MOZ_WEBRENDER': '1'}, 'platform': 'mac', 'conditioned_profile_scenario': 'settled', 'extra_prefs': {}, 'gecko_profile': False, 'gecko_profile_entries': None, 'disable_perf_tuning': False, 'host': '127.0.0.1', 'enable_fission': False, 'memory_test': False, 'enable_webrender': True, 'obj_path': None, 'live_sites': False, 'device_name': None, 'project': 'mozilla-central', 'cpu_test': False, 'is_release_build': False, 'using_condprof': False, 'processor': 'x86_64'}
[task 2021-03-18T05:21:27.488Z] 05:21:27     INFO -  raptor-perftest Info: Merging profile: /opt/worker/tasks/task_161604478481098/build/tests/raptor/raptor/profile_data/base
[task 2021-03-18T05:21:27.490Z] 05:21:27     INFO -  raptor-perftest Info: Merging profile: /opt/worker/tasks/task_161604478481098/build/tests/raptor/raptor/profile_data/common
[task 2021-03-18T05:21:27.490Z] 05:21:27     INFO -  raptor-perftest Info: Merging profile: /opt/worker/tasks/task_161604478481098/build/tests/raptor/raptor/profile_data/perf
[task 2021-03-18T05:21:27.490Z] 05:21:27     INFO -  raptor-perftest Info: Merging profile: /opt/worker/tasks/task_161604478481098/build/tests/raptor/raptor/profile_data/raptor
[task 2021-03-18T05:21:27.490Z] 05:21:27     INFO -  raptor-perftest Info: Local browser profile: /var/folders/l1/qh5x5zxd1j36jvq5yg7nn4r0000014/T/tmpoBU9KC.mozrunner/Default
[task 2021-03-18T05:21:27.491Z] 05:21:27     INFO -  raptor-browsertime Info: Removing mozprofile delimiters from browser profile
[task 2021-03-18T05:21:27.491Z] 05:21:27 CRITICAL -  raptor-browsertime Critical: [chromium] Exception [Errno 2] No such file or directory: '/var/folders/l1/qh5x5zxd1j36jvq5yg7nn4r0000014/T/tmpoBU9KC.mozrunner/Default/user.js' while removing mozprofile delimiters
[task 2021-03-18T05:21:27.491Z] 05:21:27     INFO -  raptor-browsertime Info: cwd: '/opt/worker/tasks/task_161604478481098/build'

I think ^ is not a real error, but this next one looks more problematic is:

task 2021-03-18T05:21:28.242Z] 05:21:28     INFO -  raptor-mitmproxy Info: INFO - File mitmproxy-5.1.1-osx.tar.gz fetched from https://tooltool.mozilla-releng.net/ as /opt/worker/tasks/task_161604478481098/testing/mozproxy/mitmdump-5.1.1/tmp8K2VXn
[task 2021-03-18T05:21:28.268Z] 05:21:28     INFO -  raptor-mitmproxy Info: INFO - File integrity verified, renaming tmp8K2VXn to mitmproxy-5.1.1-osx.tar.gz
[task 2021-03-18T05:21:28.268Z] 05:21:28     INFO -  raptor-mitmproxy Info: INFO - Updating local cache /builds/tooltool_cache...
[task 2021-03-18T05:21:28.268Z] 05:21:28     INFO -  raptor-mitmproxy Info: INFO - Creating cache in /builds/tooltool_cache...
[task 2021-03-18T05:21:28.268Z] 05:21:28     INFO -  raptor-mitmproxy Info: WARNING - Impossible to add file mitmproxy-5.1.1-osx.tar.gz to cache folder /builds/tooltool_cache
[task 2021-03-18T05:21:28.268Z] 05:21:28    ERROR -  raptor-mitmproxy Info: Traceback (most recent call last):
[task 2021-03-18T05:21:28.268Z] 05:21:28     INFO -  raptor-mitmproxy Info:   File "/opt/worker/tasks/task_161604478481098/mozharness/external_tools/tooltool.py", line 1159, in fetch_files
[task 2021-03-18T05:21:28.268Z] 05:21:28     INFO -  raptor-mitmproxy Info:     os.makedirs(cache_folder, 0o0700)
[task 2021-03-18T05:21:28.268Z] 05:21:28     INFO -  raptor-mitmproxy Info:   File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 150, in makedirs
[task 2021-03-18T05:21:28.268Z] 05:21:28     INFO -  raptor-mitmproxy Info:     makedirs(head, mode)
[task 2021-03-18T05:21:28.268Z] 05:21:28     INFO -  raptor-mitmproxy Info:   File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 157, in makedirs
[task 2021-03-18T05:21:28.269Z] 05:21:28     INFO -  raptor-mitmproxy Info:     mkdir(name, mode)
[task 2021-03-18T05:21:28.269Z] 05:21:28     INFO -  raptor-mitmproxy Info: OSError: [Errno 30] Read-only file system: '/builds'
[task 2021-03-18T05:21:28.269Z] 05:21:28     INFO -  raptor-mitmproxy Info: INFO - untarring "mitmproxy-5.1.1-osx.tar.gz"
[task 2021-03-18T05:21:28.423Z] 05:21:28     INFO -  raptor-mitmproxy Info: downloading mitmproxy pageset
[task 2021-03-18T05:21:28.528Z] 05:21:28     INFO -  raptor-mitmproxy Info: INFO - File mitm5-linux-firefox-amazon.zip not present in local cache folder /builds/tooltool_cache
[task 2021-03-18T05:21:28.528Z] 05:21:28     INFO -  raptor-mitmproxy Info: INFO - Attempting to fetch from 'https://tooltool.mozilla-releng.net/'...
[task 2021-03-18T05:21:29.060Z] 05:21:29     INFO -  raptor-mitmproxy Info: INFO - File mitm5-linux-firefox-amazon.zip fetched from https://tooltool.mozilla-releng.net/ as /opt/worker/tasks/task_161604478481098/testing/mozproxy/tmpfSOszU
[task 2021-03-18T05:21:29.061Z] 05:21:29     INFO -  raptor-mitmproxy Info: INFO - File integrity verified, renaming tmpfSOszU to mitm5-linux-firefox-amazon.zip
[task 2021-03-18T05:21:29.062Z] 05:21:29     INFO -  raptor-mitmproxy Info: INFO - Updating local cache /builds/tooltool_cache...
[task 2021-03-18T05:21:29.062Z] 05:21:29     INFO -  raptor-mitmproxy Info: INFO - Creating cache in /builds/tooltool_cache...
[task 2021-03-18T05:21:29.064Z] 05:21:29     INFO -  raptor-mitmproxy Info: WARNING - Impossible to add file mitm5-linux-firefox-amazon.zip to cache folder /builds/tooltool_cache
[task 2021-03-18T05:21:29.065Z] 05:21:29    ERROR -  raptor-mitmproxy Info: Traceback (most recent call last):
[task 2021-03-18T05:21:29.065Z] 05:21:29     INFO -  raptor-mitmproxy Info:   File "/opt/worker/tasks/task_161604478481098/mozharness/external_tools/tooltool.py", line 1159, in fetch_files
[task 2021-03-18T05:21:29.065Z] 05:21:29     INFO -  raptor-mitmproxy Info:     os.makedirs(cache_folder, 0o0700)
[task 2021-03-18T05:21:29.065Z] 05:21:29     INFO -  raptor-mitmproxy Info:   File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 150, in makedirs
[task 2021-03-18T05:21:29.065Z] 05:21:29     INFO -  raptor-mitmproxy Info:     makedirs(head, mode)
[task 2021-03-18T05:21:29.065Z] 05:21:29     INFO -  raptor-mitmproxy Info:   File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 157, in makedirs
[task 2021-03-18T05:21:29.065Z] 05:21:29     INFO -  raptor-mitmproxy Info:     mkdir(name, mode)
[task 2021-03-18T05:21:29.065Z] 05:21:29     INFO -  raptor-mitmproxy Info: OSError: [Errno 30] Read-only file system: '/builds'
[task 2021-03-18T05:21:29.121Z] 05:21:29     INFO -  raptor-mitmproxy Info: Adding /opt/worker/tasks/task_161604478481098/testing/mozproxy/mitm5-linux-firefox-amazon.zip to recording list

^ shows /builds as readonly, but we have configured 'tooltool_cache': '/builds/tooltool_cache', so it is using that- but then it self recovers and uses the temp folder location in /opt

Lastly we fail to get results:

[task 2021-03-18T05:21:31.621Z] 05:21:31     INFO -  raptor-browsertime Info: PATH: /opt/worker/tasks/task_161604478481098/fetches/ffmpeg-4.1.1-macos64-static/bin:/opt/worker/tasks/task_161604478481098/build/venv/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
[task 2021-03-18T05:21:32.739Z] 05:21:32     INFO -  raptor-browsertime Info: Running tests using Chrome - 25 iteration(s)
[task 2021-03-18T05:21:34.380Z] 05:21:34     INFO -  raptor-browsertime Info: Starting a browsertime pageload
[task 2021-03-18T05:21:34.380Z] 05:21:34     INFO -  raptor-browsertime Info: Waiting for 30000 ms (post_startup_delay)
[task 2021-03-18T05:22:04.432Z] 05:22:04     INFO -  raptor-browsertime Info: Navigating to about:blank
[task 2021-03-18T05:22:04.433Z] 05:22:04     INFO -  raptor-browsertime Info: Navigating to url about:blank iteration 1
[task 2021-03-18T05:22:09.527Z] 05:22:09     INFO -  raptor-browsertime Info: Cycle 0, waiting for 1000 ms
[task 2021-03-18T05:22:10.570Z] 05:22:10     INFO -  raptor-browsertime Info: Cycle 0, starting the measure
[task 2021-03-18T05:22:10.570Z] 05:22:10     INFO -  raptor-browsertime Info: Testing url https://www.amazon.com/s?k=laptop&ref=nb_sb_noss_1 iteration 1
[task 2021-03-18T05:24:10.638Z] 05:24:10     INFO -  raptor-mitmproxy Info: MitmproxyDesktop stop!!
[task 2021-03-18T05:24:10.638Z] 05:24:10     INFO -  raptor-mitmproxy Info: Mitmproxy stop!!
[task 2021-03-18T05:24:10.638Z] 05:24:10     INFO -  raptor-mitmproxy Info: Stopping mitmproxy playback, killing process 916
[task 2021-03-18T05:24:10.804Z] 05:24:10     INFO -  raptor-mitmproxy Info: Successfully killed the mitmproxy playback process
[task 2021-03-18T05:24:10.804Z] 05:24:10     INFO -  perftest-results-handler Info: retrieving browsertime test results
[task 2021-03-18T05:24:10.804Z] 05:24:10     INFO -  perftest-results-handler Critical: [chromium] unable to find browsertime results at /opt/worker/tasks/task_161604478481098/build/blobber_upload_dir/browsertime-results/amazon/browsertime.json
[task 2021-03-18T05:24:10.804Z] 05:24:10    ERROR -  raptor-main Critical: [chromium] TEST-UNEXPECTED-FAIL: no raptor test results were found for amazon
[task 2021-03-18T05:24:10.918Z] 05:24:10    ERROR - Return code: 1

I don't know why we fail to get results. Pageload tests work on firefox browser in the referenced try push

as a side note, these were accidentally turned off on Monday March 8th, but prior to that they were running in automation with the same failures.

We know that chrome/chromium works due to raptor results, we know that pageload works due to firefox results, I am not sure what is broken with browsertime chrome/chromium pageload. These also worked well on 10.14 machines.

:sparky, I am not sure of how to interpret all of this, do you have a recommendation for where to hack on this and/or debug this?

Flags: needinfo?(gmierz2)

Hmm, it looks like browsertime is having trouble making the url request based on the mitmproxy log which shows that the amazon url is never requested (https://firefoxci.taskcluster-artifacts.net/Fd9JAxMJT9ezQYhI4ZF_8g/0/public/test_info/mitmproxy.log). You could try to get more information from it by providing the --verbose argument to the tasks. Also, you can remove this line so we don't immediately kill browsertime to get more information: https://searchfox.org/mozilla-central/source/testing/raptor/raptor/browsertime/base.py#425

Flags: needinfo?(gmierz2)

I have a log from a try run

[task 2021-03-18T22:09:26.779Z] 22:09:26     INFO -  raptor-browsertime Info: [browsertime] PageCompleteCheck returned true
[task 2021-03-18T22:09:26.779Z] 22:09:26     INFO -  raptor-browsertime Info: [browsertime] Cycle 0, waiting for 1000 ms
[task 2021-03-18T22:09:27.785Z] 22:09:27     INFO -  raptor-browsertime Info: [browsertime] Cycle 0, starting the measure
[task 2021-03-18T22:09:27.786Z] 22:09:27     INFO -  raptor-browsertime Info: [browsertime.command.measure] Testing url https://www.amazon.com/s?k=laptop&ref=nb_sb_noss_1 iteration 1
[task 2021-03-18T22:09:27.786Z] 22:09:27     INFO -  raptor-browsertime Info: [browsertime.video] Add orange color
[task 2021-03-18T22:09:27.807Z] 22:09:27     INFO -  raptor-browsertime Info: [browsertime.video] Start FFMPEG with ["-hide_banner","-video_size","1024x768","-f","x11grab","-framerate",30,"-probesize","10M","-y","-draw_mouse","0","-i",":0.0+0,0","-codec:v","libx264rgb","-crf",0,"-preset","ultrafast","-vf","pad=ceil(iw/2)*2:ceil(ih/2)*2","/home/cltbld/tasks/task_161610527768212/build/blobber_upload_dir/browsertime-results/amazon/x11-tmp.mp4"]
[task 2021-03-18T22:09:27.956Z] 22:09:27     INFO -  raptor-browsertime Info: [browsertime.video] Input #0, x11grab, from ':0.0+0,0':
[task 2021-03-18T22:09:27.956Z] 22:09:27     INFO -  raptor-browsertime Info:   Duration: N/A, start: 1616105367.841895, bitrate: N/A
[task 2021-03-18T22:09:27.957Z] 22:09:27     INFO -  raptor-browsertime Info:
[task 2021-03-18T22:09:27.957Z] 22:09:27     INFO -  raptor-browsertime Info:     at Socket.readableStream.on.data (/home/cltbld/tasks/task_161610527768212/fetches/browsertime/node_modules/browsertime/lib/video/screenRecording/desktop/ffmpegRecorder.js:76:13)
[task 2021-03-18T22:09:27.957Z] 22:09:27     INFO -  raptor-browsertime Info:     at Socket.emit (events.js:203:15)
[task 2021-03-18T22:09:27.958Z] 22:09:27     INFO -  raptor-browsertime Info:     at addChunk (_stream_readable.js:288:12)
[task 2021-03-18T22:09:27.958Z] 22:09:27     INFO -  raptor-browsertime Info:     at readableAddChunk (_stream_readable.js:269:11)
[task 2021-03-18T22:09:27.958Z] 22:09:27     INFO -  raptor-browsertime Info:     at Socket.Readable.push (_stream_readable.js:224:10)
[task 2021-03-18T22:09:27.959Z] 22:09:27     INFO -  raptor-browsertime Info:     at Pipe.onStreamRead [as onread] (internal/stream_base_commons.js:94:17)
[task 2021-03-18T22:09:27.959Z] 22:09:27     INFO -  raptor-browsertime Info: [browsertime.video]     Stream #0:0: Video: rawvideo (BGR[0] / 0x524742), bgr0, 1024x768, 30 fps, 30 tbr, 1000k tbn, 1000k tbc
[task 2021-03-18T22:09:27.959Z] 22:09:27     INFO -  raptor-browsertime Info: Stream mapping:
[task 2021-03-18T22:09:27.959Z] 22:09:27     INFO -  raptor-browsertime Info:   Stream #0:0 -> #0:0 (rawvideo (native) -> h264 (libx264rgb))
[task 2021-03-18T22:09:27.959Z] 22:09:27     INFO -  raptor-browsertime Info: Press [q] to stop, [?] for help
[task 2021-03-18T22:09:27.959Z] 22:09:27     INFO -  raptor-browsertime Info:
[task 2021-03-18T22:09:27.959Z] 22:09:27     INFO -  raptor-browsertime Info:     at Socket.readableStream.on.data (/home/cltbld/tasks/task_161610527768212/fetches/browsertime/node_modules/browsertime/lib/video/screenRecording/desktop/ffmpegRecorder.js:76:13)
[task 2021-03-18T22:09:27.959Z] 22:09:27     INFO -  raptor-browsertime Info:     at Socket.emit (events.js:203:15)
[task 2021-03-18T22:09:27.959Z] 22:09:27     INFO -  raptor-browsertime Info:     at addChunk (_stream_readable.js:288:12)
[task 2021-03-18T22:09:27.960Z] 22:09:27     INFO -  raptor-browsertime Info:     at readableAddChunk (_stream_readable.js:269:11)
[task 2021-03-18T22:09:27.960Z] 22:09:27     INFO -  raptor-browsertime Info:     at Socket.Readable.push (_stream_readable.js:224:10)
[task 2021-03-18T22:09:27.960Z] 22:09:27     INFO -  raptor-browsertime Info:     at Pipe.onStreamRead [as onread] (internal/stream_base_commons.js:94:17)
[task 2021-03-18T22:09:27.961Z] 22:09:27     INFO -  raptor-browsertime Info: [browsertime.video] Started ffmpeg with -hide_banner -video_size 1024x768 -f x11grab -framerate 30 -probesize 10M -y -draw_mouse 0 -i :0.0+0,0 -codec:v libx264rgb -crf 0 -preset ultrafast -vf pad=ceil(iw/2)*2:ceil(ih/2)*2 /home/cltbld/tasks/task_161610527768212/build/blobber_upload_dir/browsertime-results/amazon/x11-tmp.mp4

could ffmpeg need some library or dependency which isn't there on the new 10.15 machines? browsertime pageload seems to work for firefox, just not chrome/chromium, so I suspect software works, just some config/compatibility.

:sparky, any further ideas?

Flags: needinfo?(gmierz2)

Set release status flags based on info from the regressing bug 1694185

(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #3)

could ffmpeg need some library or dependency which isn't there on the new 10.15 machines? browsertime pageload seems to work for firefox, just not chrome/chromium, so I suspect software works, just some config/compatibility.

:sparky, any further ideas?

Hmm, it would be good to check it ffmpeg works on Firefox, you can do this by passing --browsertime-no-ffwindowrecorder. Based on the artifacts that I'm seeing in the chrome run, it should work fine.

That said, I think the issue is coming from here:

[task 2021-03-18T22:09:52.979Z] 22:09:52     INFO -  raptor-browsertime Info: [browsertime.video] Writing to /home/cltbld/tasks/task_161610527768212/build/blobber_upload_dir/browsertime-results/amazon/pages/www_amazon_com/s/query-79238515/data/video/1.mp4
[task 2021-03-18T22:09:52.979Z] 22:09:52     INFO -  raptor-browsertime Info: events.js:174
[task 2021-03-18T22:09:52.979Z] 22:09:52     INFO -  raptor-browsertime Info:       throw er; // Unhandled 'error' event
[task 2021-03-18T22:09:52.979Z] 22:09:52     INFO -  raptor-browsertime Info:       ^
[task 2021-03-18T22:09:52.979Z] 22:09:52     INFO -  raptor-browsertime Info:
[task 2021-03-18T22:09:52.979Z] 22:09:52     INFO -  raptor-browsertime Info: Error: ENOENT: no such file or directory, open '/home/cltbld/tasks/task_161610527768212/build/blobber_upload_dir/browsertime-results/amazon/chromedriver.log'
[task 2021-03-18T22:09:52.979Z] 22:09:52     INFO -  raptor-browsertime Info: Emitted 'error' event at:
[task 2021-03-18T22:09:52.979Z] 22:09:52     INFO -  raptor-browsertime Info:     at lazyFs.open (internal/fs/streams.js:115:12)
[task 2021-03-18T22:09:52.979Z] 22:09:52     INFO -  raptor-browsertime Info:     at FSReqWrap.args [as oncomplete] (fs.js:140:20)

Browsertime is searching for a chromedriver.log file, but it doesn't exist. It's actually called chromedriver-1.log.gz now, though it's possible that browsertime renamed it to this so maybe this is a race condition. That said, when I try to download the chromedriver.log artifact from here, I am hit with an access denied error, maybe this is related to the /buildsread-only issue? https://firefoxci.taskcluster-artifacts.net/A86xu0mzRLO4N_sQOaC_bg/0/public/test_info/browsertime-results/amazon/chromedriver-1.log.gz

Flags: needinfo?(gmierz2)
See Also: → 1701205

I am not sure how to debug this. Can you show me where in the code chromedriver.log is created?

Flags: needinfo?(gmierz2)

The chromedriver.log file is created by chromedriver, but browsertime modifies it here: https://github.com/sitespeedio/browsertime/blob/38dbc1c1285773ef571938ea94de2b62243c64d8/lib/chrome/webdriver/chromium.js#L162-L163

Flags: needinfo?(gmierz2)
See Also: → 1704092

:sparky could you provide an update? there's interest in having macOS firefox/chrome comparisons ahead of MR2 so we may need to increase the priority here. Alternatively, I know :jmaher is restoring some of the older Apple hardware for fission so we could use that, but I'd prefer to get this fixed and avoid taking that step back.

Flags: needinfo?(gmierz2)

Increasing the priority as we've been without Chrome/Chromium results on macOS for a couple of months now. I haven't had time to access the mac mini loaner, and I'm also not entirely sure what I'd do to troubleshoot this.

:jmaher are there any OS updates we could apply to see if they help? Are there any macOS experts that might be able to help us here?

Flags: needinfo?(jmaher)
Priority: P5 → P2

I'm pretty sure the issue is that we need to accept some screen capture permissions for FFMPEG. This is the command we need to run to enable that permission programatically:

sudo sqlite3 "/Library/Application Support/com.apple.TCC/TCC.db" "insert into access (service, client, client_type, allowed, prompt_count, indirect_object_identifier, flags) values ('kTCCServiceScreenCapture', '/bin/bash', 1, 1, 1, 'UNUSED', 0)"
Flags: needinfo?(gmierz2)

:dividehex, can you run the sqlite command from comment 12 on the osx R8 pool?

Flags: needinfo?(jmaher) → needinfo?(jwatkins)

This is a no-go on the r8s

sudo sqlite3 "/Library/Application Support/com.apple.TCC/TCC.db" "insert into access (service, client, client_type, allowed, prompt_count, indirect_object_identifier, flags) values ('kTCCServiceScreenCapture', '/bin/bash', 1, 1, 1, 'UNUSED', 0)"
Error: attempt to write a readonly database
Flags: needinfo?(jwatkins)

:sparky- this isn't working on our machines- should we do more investigation, or do you have a backup technique

Flags: needinfo?(gmierz2)

:dividehex, I think that command might need to be run in the recovery mode terminal.

I'm not sure what else we could do, I'd have to look around some more. Alternatively, if someone knows of something else we could try, what we're trying to do here with that command is make sure that we don't need to obtain user permission to capture/record the screen.

Flags: needinfo?(gmierz2) → needinfo?(jwatkins)

I'll do a little more research into this but the short of it is, Apple requires an actually Person to allow application control over various things like screen capture, mic, camera, etc. This command might actually work if I login through the GUI and execute it
There is currently no way to whitelist apps preemptively via puppet or MDM. Apple only allows you to administratively deny apps preemptively.
It is very frustrating and we aren't the only ones that suffer from this Apple enforced security policy.
I did find a script that has a little more info on it and we might be able to do something on initial deployment in order to get this to work. I'll continue to look into this but mind you I am pulled in many directions these days.
https://github.com/bp88/JSS-Scripts/blob/master/TCC.db%20Modifier.sh

(In reply to Greg Mierzwinski [:sparky] from comment #16)

:dividehex, I think that command might need to be run in the recovery mode terminal.

I'm not sure what else we could do, I'd have to look around some more. Alternatively, if someone knows of something else we could try, what we're trying to do here with that command is make sure that we don't need to obtain user permission to capture/record the screen.

I don't think this will work in recovery because the filesystem this lives on is still read-only. We can still try it though.

Flags: needinfo?(jwatkins)

(In reply to Jake Watkins [:dividehex] from comment #18)

(In reply to Greg Mierzwinski [:sparky] from comment #16)

:dividehex, I think that command might need to be run in the recovery mode terminal.

I'm not sure what else we could do, I'd have to look around some more. Alternatively, if someone knows of something else we could try, what we're trying to do here with that command is make sure that we don't need to obtain user permission to capture/record the screen.

I don't think this will work in recovery because the filesystem this lives on is still read-only. We can still try it though.

:dividehex do you have an update?

Flags: needinfo?(jwatkins)
Assignee: nobody → jwatkins
Status: NEW → ASSIGNED

:sparky could we temporarily run the tests without visual metrics on macOS?

Flags: needinfo?(gmierz2)

Yup, we can definitely do that.

Flags: needinfo?(gmierz2)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚GMT from comment #19)

:dividehex do you have an update?

I haven't had a chance to work on this but I'll pick it back up on Friday. Sorry for the delay.

Flags: needinfo?(jwatkins)

I've done some testing with changing SIP and executing the direct TCC.db sql injection. Here are my findings:

  • TCC.db is protected from direct writes. It gets updated through a protected OS api. This is how mdm profiles are able to make changes within Apples limits and restrictions.
  • Preemptively whitelisting application in TCC with MDM is still not allowed for Camera, Microphone, and Screencapture per Apple's profile documentation
  • Apple really wants a human user to approve, via the GUI, applications trying to access the Camera, Microphone or Screencapture
  • The TCC.db can be directly written while booted in Recovery mode
  • It can also be written to if SIP is disabled. We do not disable SIP on Apple Mac Minis since this significantly reduces the security of the host.

With that said, it is possible for us to preemptively inject the application whitelist using the recovery mode. Since this is going to require worker downtime and coordination with datacenter remote hands, I want to verify this actually achieves the desired result and subsequent solution.

Is /bin/bash the correct application to allow?

If I setup a loaner mac mini with this TCC.db change done, is someone available to test it works before rolling out to the rest of the workers?

:dividehex, I can work with :davehunt/etc. to test on a loaner. could you instead do this and create a pool t-osx-r8-qa or -qa1, etc. Then I could test on try and if needed get in via vnc/ssh to fix stuff.

I'm going to allocate macmini-r8-126.test.releng.mdc1.mozilla.com for testing. I'll let you know the pool id as soon as QTS has finished the recovery mode/TCC.db insert.

QTS has disabled SIP on the loaner and I've handed it over to jmaher to test on.
Per slack convo with jmaher:

anyway,  I set the TCC.db allow for the ScreenCapture service
10:17
-----------------------------------------------------------------------
 Service                             | Client
-----------------------------------------------------------------------
 kTCCServiceAccessibility            | /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/AE.framework/Versions/A/Support/AEServer
 kTCCServiceDeveloperTool            | com.apple.Terminal
 kTCCServicePostEvent                | com.apple.screensharing.agent
 kTCCServiceScreenCapture            | /bin/bash
 kTCCServiceScreenCapture            | com.apple.screensharing.agent
 kTCCServiceSystemPolicyAllFiles     | /usr/libexec/sshd-keygen-wrapper
 kTCCServiceSystemPolicyAllFiles     | /usr/sbin/sshd

10:18
and I had qts disable SIP
[jwatkins@macmini-r8-126.test.releng.mdc1.mozilla.com ~]$ csrutil status
System Integrity Protection status: disabled.
10:19
That will allow you to manipulate the TCC.db directly without having to go into the recovery mode
10:20
I don't want to make disabling SIP the solution here but if you find an tcc exception that works, we can have qts go into recovery for the rest of the r8s and set it without having to completely disable SIP

I have verified via vnc and taskcluster that:

sudo sqlite3 "/Library/Application Support/com.apple.TCC/TCC.db" "select * from access where service='kTCCServiceScreenCapture'"
kTCCServiceScreenCapture|/bin/bash|1|1|1||||UNUSED||0|1629156426
kTCCServiceScreenCapture|com.apple.Terminal|0|1|1||||UNUSED||0|1629130209
kTCCServiceScreenCapture|com.apple.screensharing.agent|0|1|1||||UNUSED||0|1613524400

will allow chrome/chromium tests to run in CI. :dividehex will work with QTS to get this rolled out to our pool of R8's. I don't know the timeline, but I would expect this to take a week or two.

I'm picking this up since Jake will be out. I read through the backlog, and I understand the needs and where we're at.

I agree disabling SIP will be the cleanest solution from the possible hacks I can imagine here*, and I'd like to follow-up, after we get it working and into production, with testing turning SIP back on after we've enabled the screen capture. That might preserve the safety of SIP, and simulate the typical user environment.

I'll ask QTS to disable SIP in steps across the R8 pool.

  • (scripted vnc would be complicated, and a long workaround with full disk access could be brittle at future upgrades)

I'm asking QTS first to disable SIP on macmini-r8-{120..127} and I'm putting these into the r8-staging pool. Then we could run some tests to make sure it works (and how we're applying the change and get that into the configuration management bootstrap or puppet).

QTS disabled SIP on the r8-staging pool, macmini-r8-{120..127}: https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-t-osx-1015-r8-staging

I applied the direct tcc.db insert and verified the current state (macmini-r8-126 was the previously manually tested loaner and also has access granted to Terminal):

sudo sqlite3 "/Library/Application Support/com.apple.TCC/TCC.db" "insert into access (service, client, client_type, allowed, prompt_count, indirect_object_identifier, flags) values (\"kTCCServiceScreenCapture\", \"/bin/bash\", 1, 1, 1, \"UNUSED\", 0)"
[...]
$ hn_tail=".test.releng.mdc1.mozilla.com"; for I in {120..127}; do hn=macmini-r8-$I; ( ssh -o LogLevel=ERROR -o StrictHostKeyChecking=no -o ConnectTimeout=120 -o UserKnownHostsFile=/dev/null ${hn}${hn_tail} 'printf "%s " $(hostname) "$(csrutil status)" "$(sudo grep workerType\" /opt/worker/generic-worker.conf.yaml)"; echo; sudo sqlite3 "/Library/Application Support/com.apple.TCC/TCC.db" "select * from access where service=\"kTCCServiceScreenCapture\""                                                                                    ');  done
macmini-r8-120.test.releng.mdc1.mozilla.com System Integrity Protection status: disabled.   "workerType": "gecko-t-osx-1015-r8-staging", 
kTCCServiceScreenCapture|/bin/bash|1|1|1||||UNUSED||0|1629935852
kTCCServiceScreenCapture|com.apple.screensharing.agent|0|1|1||||UNUSED||0|1613096512
macmini-r8-121.test.releng.mdc1.mozilla.com System Integrity Protection status: disabled.   "workerType": "gecko-t-osx-1015-r8-staging", 
kTCCServiceScreenCapture|/bin/bash|1|1|1||||UNUSED||0|1629935886
kTCCServiceScreenCapture|com.apple.screensharing.agent|0|1|1||||UNUSED||0|1613097599
macmini-r8-122.test.releng.mdc1.mozilla.com System Integrity Protection status: disabled.   "workerType": "gecko-t-osx-1015-r8-staging", 
kTCCServiceScreenCapture|/bin/bash|1|1|1||||UNUSED||0|1629935886
kTCCServiceScreenCapture|com.apple.screensharing.agent|0|1|1||||UNUSED||0|1613094865
macmini-r8-123.test.releng.mdc1.mozilla.com System Integrity Protection status: disabled.   "workerType": "gecko-t-osx-1015-r8-staging", 
kTCCServiceScreenCapture|/bin/bash|1|1|1||||UNUSED||0|1629935886
kTCCServiceScreenCapture|com.apple.screensharing.agent|0|1|1||||UNUSED||0|1613150494
macmini-r8-124.test.releng.mdc1.mozilla.com System Integrity Protection status: disabled.   "workerType": "gecko-t-osx-1015-r8-staging", 
kTCCServiceScreenCapture|/bin/bash|1|1|1||||UNUSED||0|1629935886
kTCCServiceScreenCapture|com.apple.screensharing.agent|0|1|1||||UNUSED||0|1613095646
macmini-r8-125.test.releng.mdc1.mozilla.com System Integrity Protection status: disabled.   "workerType": "gecko-t-osx-1015-r8-staging", 
kTCCServiceScreenCapture|/bin/bash|1|1|1||||UNUSED||0|1629935886
kTCCServiceScreenCapture|com.apple.screensharing.agent|0|1|1||||UNUSED||0|1613095642
macmini-r8-126.test.releng.mdc1.mozilla.com System Integrity Protection status: disabled.   "workerType": "gecko-t-osx-1015-r8-staging", 
kTCCServiceScreenCapture|/bin/bash|1|1|1|úÞ
                                           |||UNUSED||0|1629156482
kTCCServiceScreenCapture|com.apple.Terminal|0|1|1|úÞ
                                                    |||UNUSED||0|1629130209
kTCCServiceScreenCapture|com.apple.screensharing.agent|0|1|1||||UNUSED||0|1613524400
macmini-r8-127.test.releng.mdc1.mozilla.com System Integrity Protection status: disabled.   "workerType": "gecko-t-osx-1015-r8-staging", 
kTCCServiceScreenCapture|/bin/bash|1|1|1||||UNUSED||0|1629935892
kTCCServiceScreenCapture|com.apple.screensharing.agent|0|1|1||||UNUSED||0|1613095648

:jmaher could you run the chromedriver tests against the releng-hardware/gecko-t-osx-1015-r8-staging pool to verify the tcc.db change? If it works, then I'll ask QTS to disable SIP across half of the macmini r8 production pool (and we can give it a few days to make sure it is sane for everything else).

Flags: needinfo?(jmaher)

this is failing for me, but looking at your query from comment 30, I don't see com.apple.Terminal except on macmini-r8-126. Is it possible to add that in to these machines and I can try again?

Flags: needinfo?(jmaher) → needinfo?(dhouse)

:jmaher did you test through a try push or through vnc? I think terminal is only required when running it from terminal (from the gui/vnc). If it is running from terminal in the task, then I'll add terminal also.

Flags: needinfo?(dhouse) → needinfo?(jmaher)

I did the tests via try server- quite likely terminal is only needed when in with VNC. But as it stands these tests are not running successfully whereas they did on try (~2 weeks ago - code could have changed, had the terminal for screencapture permissions, maybe other adjustments to allow for vnc/loaner/etc.?)

Flags: needinfo?(jmaher)

Thanks! I'll compare with the tcc.db contents in the loaner to make sure I'm not missing something there. And I may try testing through vnc on one of these and the loaner to reproduce it (and to see if it works through vnc on these new ones but not in a task)

I copied the full /opt/worker/tasks/jmaher tree over to macmini-r8-125 and ran it there (and I re-ran it on macmini-r8-126, the loaner) and the results show ScreenCapture blocked for sshd (since I'd ssh'd in).

I'll set the tcc.db to allow from ssh and Terminal and retry.

loaner (!26):

2021-08-26 18:01:16.284575+0000 0x13c0c    Error       0xf4c5               162    0    tccd: [com.apple.TCC:access] notifyUserOfDeniedAccessBy: for  /usr/libexec/sshd-keygen-wrapper fails when requestor has UID 0
2021-08-26 18:01:16.284577+0000 0x13c0c    Default     0xf4c5               162    0    tccd: [com.apple.TCC:access] Service kTCCServiceScreenCapture does not allow prompting; returning denied.

non-loaner (125):

2021-08-26 17:48:38.276112+0000 0x2e386    Error       0x1fbb5              160    0    tccd: [com.apple.TCC:access] notifyUserOfDeniedAccessBy: for  /usr/libexec/sshd-keygen-wrapper fails when requestor has UID 0
2021-08-26 17:48:38.276113+0000 0x2e386    Default     0x1fbb5              160    0    tccd: [com.apple.TCC:access] Service kTCCServiceScreenCapture does not allow prompting; returning denied.

It worked for me when manually running from ssh (once I had added ssh as allowed for screen capture).

https://treeherder.mozilla.org/jobs?repo=try&revision=aae6d4dbf64ba32e814f5ab11845163c12976e53 I think this is the last push you'd had for chromium/browsertime on the r8's, and it went to the production pool. Maybe I'm missing the tasks that went to -staging. I'll try re-running on the r8-staging pool.

Could you try against r8-staging again?

My re-runs of task, d6FB7W9dQXiRiPxzU2IZLQ (https://treeherder.mozilla.org/jobs?repo=try&revision=aae6d4dbf64ba32e814f5ab11845163c12976e53&selectedTaskRun=d6FB7W9dQXiRiPxzU2IZLQ.0), complete (like https://firefox-ci-tc.services.mozilla.com/tasks/HWhQ3bXlQ4SOmOrmMPWbWg/runs/0) on the -staging pool.

So I think it will work without adding access for Terminal (or ssh).

Flags: needinfo?(jmaher)

Great! Then I'll ask QTS to disable SIP on a block of the R8 production pool (and I'll apply the tcc.db change to those when they're done). And I'll put the block back into the pool and we can review in a few days to may sure they're okay (for other tests).

since there isn't a rush on this, I'll ask QTS to disable SIP later this afternoon -- I got auto-banned again yesterday (and it took me a few hours to realize it combined with my cable outage. I really need to sort the auto-ban out with security), and since today is quiet/well I'll wait to ping anyone (the ban should clear this afternoon).

I asked QTS to disable SIP on the rest of the 72 r8 macminis in the IT41 rack (where the test set are). So that will apply to macmini-r8-{73..144}. That leaves macmini-r8-{{1..72},{145..288}} (about 208 to go, skipping builders).
When they're done, I'll apply the tcc.db insert for bash to screencapture and then make sure they are taking production tasks. If there are no problems, next week I'll step through the rest of the pool with qts.

QTS disabled SIP on the range of production mac workers, macmini-r8-{73..144}. I confirmed it is disabled this morning. It looks like the workers in this range are completing tasks (green), https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-t-osx-1015-r8.
If there are no surprise failures by Monday, then I'll ask QTS to disable SIP on another batch.

Flags: needinfo?(dhouse)

I asked QTS to disable SIP on macmini-r8-{{145..194},{201..216}} (top half of racks IT42,IT44).
Once set on those, we'll be at about half.

I've asked QTS to disable SIP on the remainder (2 racks, 144) of the r8 mac minis: macmini-r8-{{1..72},{217..288}}

QTS was faster than expected (much faster than the sla) on the other requests, but hasn't disabled sip on this last batch yet. I checked over all of the r8's, and these last two racks still have sip enabled. There are a few scattered ones that also have sip enabled (maybe the disable was incorrect or these flipped back to enabled after a crash or something. I will get them re-disabled and the tcc.db change applied).
I want to set up monitoring of the SIP status and tcc.db contents. -- or rather I should monitor for tcc blocks in the logs (will work with sip enabled, and could show new problems).

QTS finished disabling SIP across all of the R8's yesterday morning. I confirmed all but 3:
macmini-r8-78.test.releng.mdc1.mozilla.com System Integrity Protection status: enabled. "workerType": "gecko-t-osx-1015-r8",
macmini-r8-75.test.releng.mdc1.mozilla.com System Integrity Protection status: enabled. "workerType": "gecko-t-osx-1015-r8",
macmini-r8-273.test.releng.mdc1.mozilla.com System Integrity Protection status: enabled. "workerType": "gecko-t-osx-1015-r8",

I'll ask them to re-disable these and I'll apply the tcc.db change across the entire pool.

I applied the tcc.db insert across the full pool after verifying they all have sip disabled now.

Assignee: jwatkins → dhouse
Flags: needinfo?(dhouse)
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/37656bdc24fc
enable osx 10.15 tp6 tests on chrome/chromium. r=perftest-reviewers,AlexandruIonescu
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 94 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: