Closed Bug 1606981 Opened 2 years ago Closed 2 years ago

Crash in [@ core::result::unwrap_failed] with crash reason "can determine file rights: 28" in graphite font code

Categories

(Core :: Graphics: Text, defect)

Unspecified
Linux
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla74
Tracking Status
firefox-esr68 --- unaffected
firefox72 --- unaffected
firefox73 --- disabled
firefox74 + verified

People

(Reporter: dbaron, Assigned: shravanrn)

References

(Regression)

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-308aef5d-8bbf-4ed8-b6a3-da0a80200104 and similar. Most of the crashes seem to occur on either wikipedia or google. My dad (who runs nightly on Linux) is seeing this crash basically 100% of the time when going to any Google page. (The above crash isn't one of my dad's, though.) Flipping the gfx.font_rendering.graphite.enabled pref to false made the crash go away.

The crashes appear to have started in this regression range between nightly builds, although I don't see anything obvious.

The top of the stack looks like this:

0 	libxul.so 	RustMozCrash 	mozglue/static/rust/wrappers.cpp:17 	context
1 	libxul.so 	mozglue_static::panic_hook 	mozglue/static/rust/lib.rs:89 	cfi
2 	libxul.so 	core::ops::function::Fn::call 	src/libcore/ops/function.rs:69 	cfi
3 	libxul.so 	std::panicking::rust_panic_with_hook 	src/libstd/panicking.rs:477 	cfi
4 	libxul.so 	std::panicking::continue_panic_fmt 	src/libstd/panicking.rs:380 	cfi
5 	libxul.so 	rust_begin_unwind 	src/libstd/panicking.rs:307 	cfi
6 	libxul.so 	core::panicking::panic_fmt 	src/libcore/panicking.rs:85 	cfi
7 	libxul.so 	core::result::unwrap_failed 	src/libcore/result.rs:1165 	cfi
8 	libxul.so 	<lucet_wasi::fdentry::FdEntry as std::sys::unix::ext::io::FromRawFd>::from_raw_fd 	third_party/rust/lucet-wasi/src/fdentry.rs:24 	cfi
9 	libxul.so 	lucet_wasi::ctx::WasiCtxBuilder::raw_fd 	third_party/rust/lucet-wasi/src/ctx.rs:124 	cfi
10 	libxul.so 	lucet_wasi::ctx::WasiCtxBuilder::inherit_stdio 	third_party/rust/lucet-wasi/src/ctx.rs:62 	cfi
11 	libxul.so 	lucet_load_module 	third_party/rust/rlbox_lucet_sandbox/src/create.rs:28 	cfi
12 	libxul.so 	rlbox::rlbox_lucet_sandbox::impl_create_sandbox(char const*, bool) 	third_party/rust/rlbox_lucet_sandbox/include/rlbox_lucet_sandbox.hpp:481 	cfi
13 	libxul.so 	auto rlbox::rlbox_sandbox<rlbox::rlbox_lucet_sandbox>::create_sandbox<char const*, bool>(char const*, bool) 	third_party/rlbox/include/rlbox_sandbox.hpp:354 	cfi
14 	libxul.so 	gfxFontEntry::GrSandboxData::GrSandboxData() 	gfx/thebes/gfxFontEntry.cpp:619 	cfi
15 	libxul.so 	gfxFontEntry::GetGrFace() 	gfx/thebes/gfxFontEntry.cpp:703 	cfi
16 	libxul.so 	gfxFontEntry::HasGraphiteSpaceContextuals() 	gfx/thebes/gfxFontEntry.cpp:777 	cfi
17 	libxul.so 	bool gfxFont::SplitAndInitTextRun<char16_t>(mozilla::gfx::DrawTarget*, gfxTextRun*, char16_t const*, unsigned int, unsigned int, mozilla::unicode::Script, mozilla::gfx::ShapedTextFlags) 	gfx/thebes/gfxFont.cpp:3024 	cfi
18 	libxul.so 	void gfxFontGroup::InitScriptRun<char16_t>(mozilla::gfx::DrawTarget*, gfxTextRun*, char16_t const*, unsigned int, unsigned int, mozilla::unicode::Script, gfxMissingFontRecorder*) 	gfx/thebes/gfxTextRun.cpp:2511 	cfi
19 	libxul.so 	gfxFontGroup::MakeTextRun(char16_t const*, unsigned int, gfxTextRunFactory::Parameters const*, mozilla::gfx::ShapedTextFlags, nsTextFrameUtils::Flags, gfxMissingFontRecorder*) 	gfx/thebes/gfxTextRun.cpp:2290 	cfi
20 	libxul.so 	BuildTextRunsScanner::FlushFrames(bool, bool) 	layout/generic/nsTextFrame.cpp:1646 	cfi
21 	libxul.so 	nsTextFrame::EnsureTextRun(nsTextFrame::TextRunType, mozilla::gfx::DrawTarget*, nsIFrame*, nsLineList_iterator const*, unsigned int*) 	layout/generic/nsTextFrame.cpp:2979 	cfi

This is by far the most common crash on today's Linux nightlies.

Bug 1569370 seems like the most likely regressing bug, I think, but that's solely conjecture.

Regressed by: 1569370

We backed out the graphite change because it seemed like a bad thing to have landed during the soft freeze, though this is something else that would be important to fix before we reland.

Shravan, can you look at this? I can send you particular URLs of pages that crash offline if dbaron's STR don't produce anything for you.

Flags: needinfo?(shravanrn)

Yup, looking at this now. @Nathan - Yup, if you can send me some URLs that crashed this, that would help as well.

Flags: needinfo?(shravanrn)
Assignee: nobody → shravanrn

So I just checked on the nightly release, it seems to work fine on my machine, so this looks like its causing issues in some environments but not others. The crux of the issue seems to occur when attempting to redirect the stdout, stdin and stderr to the standard terminals... So we have 2 options here

  1. A simple fix - this would simply bypass the issue by not redirect the io. Thus any printf's inside graphite will no-op (be redirected to /dev/null). This fix should be ok for graphite use case, however, we may decide to sandbox some library in the future that needs to print to stdout, stderr.

  2. Fix the io redirection errors - For this, I think we would need to find an environment where we can recreate the errors. I am not able to repro the issue on try or my machine.
    @Nathan - if you have a machine where you are able to repro this, let me know

@Nathan - For the moment, I am leaning towards the simple fix as the easiest option for now, and can investigate better fixes should the need arise. Thoughts?

Flags: needinfo?(nfroyd)

(In reply to Shravan Narayan from comment #5)

  1. Fix the io redirection errors - For this, I think we would need to find an environment where we can recreate the errors. I am not able to repro the issue on try or my machine.

You could try installing a graphite font and selecting it as your default Firefox font -- that might improve the chance of reproducing the issue. E.g. install the Charis SIL font family (available from https://software.sil.org/charis/download/ if your distro doesn't provide it as a package), and configure the Firefox font preferences to use it as the default.

@Johnathan - Unfortunately, this error is not graphite specific... rather, it is an error in wasm sandbox setup in some environments. It occurs during sandbox creation, but before we load the graphite code. I tested the sandbox creation code path on my machine by visiting the graphite micro-benchmark you had sent earlier... My best guess right now (based on the fact that we are seeing different behavior in different environments) the wasm sandbox runtime is assuming behavior that certain OS kernel combinations may not provide or perhaps there is some interatcion of the linux sandbox, wasm runtime and os kernel versions... I will look into this further. For now, I have submitted a fix that just disables IO redirection.

Fwiw, I will also try your suggested approach to see what happens.

Duplicate of this bug: 1606977

@jonathan - setting charis to the default does not repro the issue on my machine

Summarizing and updating all my previous posts.

Problem Description

  • After analyzing the crashes, it appears that the failure occurs in the sandbox creation step, during the process of redirecting the sandboxed code's io to stdio.
  • The redirection involves making syscalls to statx or stat64 and subsequently to the syscall fcntl on the file descriptors 0, 1 and 2.
  • The crash itself occurs in one of the calls to either statx and stat64
  • These crashes seem to occur selectively in some environments but not others. For instance, I have been unable to repro this issue on my machine after multiple repeated attempts, including Jonathan's earlier suggestion
  • After looking at some of the crashes, at first glance there is no clear consistent OS or kernel versions that produce this crash. Thus, I suspect the most likely reason this occurs is either (a) due to the fact that these syscalls are disallowed by the seccomp sandbox, but a race between enabling seccomp and creating the wasm sandbox permits these syscalls on some machines (b) it does have something to do with kernel versions and some kernels seem to implement seccomp sandboxes in a way that does not permit these syscalls

Possible Fixes

  • As described earlier, there are 2 fixes possible here --- a simple fix that simply disables stdio redirection (which means printf inside the wasm sandbox will no-op), and a more complicated fix that changes the sandbox creation code to tread more carefully around potentially disallowed syscalls.
  • I have updated the upstream rlbox library to support applying either the simple or complicated fix by passing a parameter during sandbox creation.
  • For the moment, the phab commit I have submitted uses the simple approach of disabling stdio redirection as this seems the safest path going forward. I believe this should be sufficient to fix the current set of crashes.
  • Unfortunately until we have a clear way of testing the complicated fix, there is no immediate way to know if the complicated fix operates correctly in all environments. I think we may have to investigate more approaches to repro this issue in a test environment.

@Nathan - If the above writeup makes sense, and the phab submission looks ok, could we look into landing the above fixes? Also any thoughts on the repro'ing the crash so we can test the complicated fix is welcome.

In case it's relevant, the machine where my dad saw this is running Fedora 30.

(In reply to David Baron :dbaron: 🏴󠁵󠁳󠁰󠁡󠁿 ⌚UTC-5 from comment #12)

In case it's relevant, the machine where my dad saw this is running Fedora 30.

Oh interesting. This may be useful to test. Will follow up after testing in the next 24 hours or so.

I'm seeing the issue on my computer, using Ubuntu 18.04 (Linux 4.15.0-72-generic).

(Figured I'd share my info, if OS versions are interesting.)

Thanks for sharing. I have successfully repro'd the issue on Fedora 31. I will test my fixes here, and follow up shortly.

Just wanted to post a quick update. It turns out that some of the crashes here may have been through 2 different bugs.

  1. A crash with the message "can determine file rights: 28"
  2. A crash without the above message (Bug 1606977 and Bug 1607278)

We have successfully repro'd and resolved the 2nd issue above, and the fix will be applied shortly. However, we are still having some trouble repro'ing issue 1 above and I'm am continuing to investigate this.

@David Baron, @Dan Wolff: If possible, could you please confirm the crashes you are seeing on your various machines (@David: if possible including your dad's machine) prints the message "can determine file rights: 28". This would be tremendously useful for my efforts.
You would need to get the relevant nightly with this issue from here. Please watch out --- the nightly will try to auto update immediately, and may update to a version with this commit backed out upon restart.

Flags: needinfo?(dbaron)
Flags: needinfo?(danweiss)

(In reply to Shravan Narayan from comment #16)

Just wanted to post a quick update. It turns out that some of the crashes here may have been through 2 different bugs.

  1. A crash with the message "can determine file rights: 28"
  2. A crash without the above message (Bug 1606977 and Bug 1607278)

We have successfully repro'd and resolved the 2nd issue above, and the fix will be applied shortly. However, we are still having some trouble repro'ing issue 1 above and I'm am continuing to investigate this.

@David Baron, @Dan Wolff: If possible, could you please confirm the crashes you are seeing on your various machines (@David: if possible including your dad's machine) prints the message "can determine file rights: 28". This would be tremendously useful for my efforts.
You would need to get the relevant nightly with this issue from here. Please watch out --- the nightly will try to auto update immediately, and may update to a version with this commit backed out upon restart.

I am David's dad. I just tested the version you described, and it crashed when I tried to access gmail.com. (This was what caused the crash initially.) Thus, this version is not fully fixed. (David cannot test this. He was visiting us in Philadelphia when he reported the bug but has now returned to California.

I assume you will let me know if a crash report would help. I might have it.

Flags: needinfo?(dbaron)

Hi Jonathan Baron,
Thanks a bunch for testing this! Yes, the link I provided is the original nightly so is definitely expected to crash. However, I am looking for one piece of information about the crash --- in particular, if you launch firefox via terminal and visit gmail, does it crash with the message "can determine file rights: 28" or does it crash without an error message. Please let me know. Alternately, a crash report would be fine too, as I can determine this info from there as well.

Flags: needinfo?(baron)

(In reply to Jonathan Baron from comment #17)

I am David's dad. I just tested the version you described, and it crashed when I tried to access gmail.com. (This was what caused the crash initially.) Thus, this version is not fully fixed. (David cannot test this. He was visiting us in Philadelphia when he reported the bug but has now returned to California.

Thank you for testing this!

I assume you will let me know if a crash report would help. I might have it.

If you go to about:crashes, you'll see a record of all the crashes Firefox has submitted to date. Ideally, your most recent crash (which presumably is the attempt to run Shravan's build) will be in that list. Ideally, you should also be able to click the "View" button associated with the crash, which will open a crash-stats.mozilla.org page, which you can then provide here.

Hello, my email was added to this bug by accident.

Flags: needinfo?(danweiss)

Sorry about that! Have fixed the needinfo.

Flags: needinfo?(326374)

(In reply to Nathan Froyd [:froydnj] from comment #19)

(In reply to Jonathan Baron from comment #17)

I am David's dad. I just tested the version you described, and it crashed when I tried to access gmail.com. (This was what caused the crash initially.) Thus, this version is not fully fixed. (David cannot test this. He was visiting us in Philadelphia when he reported the bug but has now returned to California.

Thank you for testing this!

I assume you will let me know if a crash report would help. I might have it.

If you go to about:crashes, you'll see a record of all the crashes Firefox has submitted to date. Ideally, your most recent crash (which presumably is the attempt to run Shravan's build) will be in that list. Ideally, you should also be able to click the "View" button associated with the crash, which will open a crash-stats.mozilla.org page, which you can then provide here.

I submitted both crash reports. And, indeed, the reason for the crash is "can determine file rights: 28"

Sorry for not answering that question before.

The last report had this label
ID: 5c83b606-10d7-47d3-a666-e13c90200107
Signature: core::result::unwrap_failed

Thanks a bunch @Jonathan Baron! Investigating further now.

Flags: needinfo?(nfroyd)

(In reply to Shravan Narayan from comment #16)

Just wanted to post a quick update. It turns out that some of the crashes here may have been through 2 different bugs.

  1. A crash with the message "can determine file rights: 28"
  2. A crash without the above message (Bug 1606977 and Bug 1607278)

We have successfully repro'd and resolved the 2nd issue above, and the fix will be applied shortly. However, we are still having some trouble repro'ing issue 1 above and I'm am continuing to investigate this.

@David Baron, @Dan Wolff: If possible, could you please confirm the crashes you are seeing on your various machines (@David: if possible including your dad's machine) prints the message "can determine file rights: 28". This would be tremendously useful for my efforts.
You would need to get the relevant nightly with this issue from here. Please watch out --- the nightly will try to auto update immediately, and may update to a version with this commit backed out upon restart.

The bug that I've seen previously is Bug 1606977. All my crashes seem to contain the message "can determine file rights: 28".

Using the linked binary (20200103202240), I'm unable to reproduce the issue. I'm also unable to reproduce it with the current nightly (20200107095722).

In case it's helpful, here's a list of all tab crashes that I've encountered, in date order:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Flags: needinfo?(326374)

Thanks Dan for the detailed report! This is definitely useful

Using the linked binary (20200103202240), I'm unable to reproduce the issue.

Oh that is very curious, because that was the original unmodified nightly which originally caused the issue. This should have had identically behavior as before. This does indicate that it is challenging to repro this issue.

I'm also unable to reproduce it with the current nightly (20200107095722).

Yup, this change has been backed out temporarily.

Separately, I have failed to repro this issue on stock

  • Ubuntu 19.10
  • Ubuntu 18.04 LTS
  • Fedora 31
  • Debian 9 " streches" with backport kernel
  • opensuse 15.1
  • centos 7.7

Continuing to investigate...

Hi all,
I think I need your help again. Unfortunately, despite testing on multiple environments, I am unable to repro this issue and therefore unable to test my fix.

Therefore I have built release versions of Firefox with two different fixes for this issue. It would extremely useful if anyone seeing this issue could test this. @Johnathan Baron, @Dan Wolff, any help would be much appreciated here!

To test, simply download the two firefox builds listed below and browse the same websites that caused the crash.

As before, please watch out --- the nightly will try to auto update immediately, and may update to a version with this commit backed out upon restart.

Build 1 (NoStdio) - Download here

Build 2 (WithStdio) - Download here

Please let me know if both these build are ok, crash etc.

Flags: needinfo?(baron) → needinfo?(326374)
Flags: needinfo?(baron)

(In reply to Shravan Narayan from comment #26)

Hi all,
I think I need your help again. Unfortunately, despite testing on multiple environments, I am unable to repro this issue and therefore unable to test my fix.

Therefore I have built release versions of Firefox with two different fixes for this issue. It would extremely useful if anyone seeing this issue could test this. @Johnathan Baron, @Dan Wolff, any help would be much appreciated here!

To test, simply download the two firefox builds listed below and browse the same websites that caused the crash.

As before, please watch out --- the nightly will try to auto update immediately, and may update to a version with this commit backed out upon restart.

Build 1 (NoStdio) - Download here

Build 2 (WithStdio) - Download here

Please let me know if both these build are ok, crash etc.

I just tested both of these and neither of them replicates the crash with gmail.com. I cannot promise that nothing else has changed on my computer, as I did update a few RPMs since the last time I tried this, but it seems likely that both of these succeed in fixing the original problem, if that is possible at all.

Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/be971bbc036c
Fix crash in IO redirection in wasm sandbox libs r=froydnj

On 2020-01-03 I could easily reproduce this crash on Ubuntu 19.10 with build 20200103094738 in a new profile when visiting Wikipedia and was able to avoid it by setting gfx.font_rendering.graphite.enabled = false. Today I can no longer reproduce it at all with the same build in a new profile on the same system.

I tried visiting a snapshot of the Wikipedia front page from that day, switching back to the kernel at the time (5.3.0-24-generic) and rolling back an OpenSymbol font update from 6.3.4 to 6.3.3 but still no crash. I cannot explain why it is working now. I tried the two supplied builds in Comment 26 and there were no crashes.

@Kestrel: thanks for that information. That is very helpful as it clearly indicates we have to do some additional investigating or monitoring when we do eventually roll out this build

Out of curiosity, for people who are affected by this bug, what does ls -l /dev/null say on the machine that's exhibiting these crashes?

(In reply to Nathan Froyd [:froydnj] from comment #31)

Out of curiosity, for people who are affected by this bug, what does ls -l /dev/null say on the machine that's exhibiting these crashes?

I don't know if I count anymore as "affected by this bug". But it says "/dev/null", without the quotes.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla74

(In reply to Nathan Froyd [:froydnj] from comment #31)

what does ls -l /dev/null say on the machine that's exhibiting these crashes?

crw-rw-rw- 1 root root 1, 3 Jan 10 13:30 /dev/null
Flags: qe-verify-
Flags: qe-verify- → qe-verify+

Reproduced the initial crash using an old NIghtly on Ubuntu 18.04 64bit https://crash-stats.mozilla.org/report/index/a2945c96-0838-490b-b355-4b4230200331
Verified that the crash does not occur anymore on Fx 74.0 and Fx 75.0b11 on Ubuntu 18.04 64bit, Windows 10 64bit and macOS 10.15.4.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
Flags: needinfo?(326374)
Flags: needinfo?(jonathanbaron7)
You need to log in before you can comment on or make changes to this bug.