many wpt reftests fail on new android 14.0 emulator
Categories
(Testing :: General, defect, P1)
Tracking
(Not tracked)
People
(Reporter: jmaher, Assigned: jmaher)
References
(Depends on 4 open bugs)
Details
Attachments
(4 files, 1 obsolete file)
here is a try push showing many wpt reftest failures. These seem to have small differences (nothing visually different).
The big change here is we run a new version of the android emulator, from android os version 7.0 -> 14.0.
We did change the resolution as well.
I would like to know if there is anything to change on the emulator, or if we should just add a few hundred fuzzy conditions.
Updated•10 months ago
|
Comment 1•10 months ago
|
||
They appear to mostly (all? Treeherder stopped working for me halfway through looking...) be the "g" character in the text. Perhaps some font smoothing setting change in newer android versions, or some such. Seems to me like updating the fuzzy annotations is fine
| Assignee | ||
Comment 2•10 months ago
|
||
I had :jfkthame look at the failures originally- there were no specific issues with the fonts (antialiasing, smoothing, etc.). I will leave this open for a day or two in case there are other thoughts- then work to get the fuzzy annotations in a patch.
Updated•10 months ago
|
| Assignee | ||
Comment 3•10 months ago
|
||
following up here I have annotated most of the failures. This was really frustrating despite tools to do the work, it required >100 try pushes. The pattern I see is that we render tests correctly, then after some number of tests, we fail to render a few pixels. The browser restarts and the next test runs green and so forth.
After adjusting the failing test, it doesn't cause the browser to restart, so the "next test" that ran green as the first test in the browser is now failing.
This gets even more frustrating because the majority of the fuzzy is 14 pixels, but if i have a bunch of them in a single browser session many start rendering not only the g, but the , and sometimes the y incorrectly- this increases the fuzzy pixels (21, 29, 31, 34, 39, 56, 64). In some cases with >1 g in the text, it would default to more pixels on the first time it failed.
either way, I have 26 failures that seem to be more problematic than just a misrendered character. You can see this on a try push (note: some are simple g characters, but most are more involved)
:jnicol/:gw, mind taking a look at ^ try push and seeing if anything stands out?
| Assignee | ||
Comment 4•10 months ago
|
||
Updated•10 months ago
|
Comment 5•9 months ago
|
||
I agree this seems quite problematic. Both the remaining failures, but also the behaviour you describe where restarting the browser affects which tests reproduce the bug. I'll have a think if there's anything I can think of to try, but I'm pretty stumped to be honest.
| Assignee | ||
Comment 6•9 months ago
|
||
I can accept the other failures- it is frustrating, but very green with the patch (thanks to jgraham for the r+). I can land it, but it will be a pain to cleanup if we determine there is a fix to the emulator or browser to fix this.
The newer failures, I wanted an eye on them since they are not just rending part of a letter and potentially more involved. Maybe this is ok, the number is smaller, but some are way off.
I will wait a day before landing the 1000+ files that adjust the dangling g and related text rendering. Once landed I can repush and get the new failures- this buys some time for other eyes on the problem.
Comment 7•9 months ago
|
||
No real ideas from me, sorry. Were you able to confirm if it was related to the emulator version change or the resolution change you mentioned?
Comment 8•9 months ago
|
||
One notable datapoint is that this seems to only affect wpt reftests, not gecko reftests. Those are supposed to be doing rather similar things, but any differences in the configuration or reftest window setup might be useful points of investigation.
Comment 9•9 months ago
|
||
I've been digging into this a little but haven't found any answers yet. I bisected the emulator version running these tests locally, and at least when running css/CSS2/backgrounds folder, from android 12 onwards the css/CSS2/backgrounds/background-repeat-applies-to-* tests have the issue with the "g", whereas 11 and earlier do not.
I started reducing the list of tests running in the css/CSS2/backgrounds dir, and managed to get it down to just these:
css/CSS2/backgrounds/background-image-003.xht
css/CSS2/backgrounds/background-position-001.xht
css/CSS2/backgrounds/background-position-002.xht
css/CSS2/backgrounds/background-position-applies-to-001a.xht
css/CSS2/backgrounds/background-repeat-applies-to-001.xht
running those tests is enough to reproduce the "g" issue in background-repeat-applies-to-*. removing any of the earlier ones makes those tests pass. This doesn't necessarily mean these are the only set of tests I could have reduced it to, perhaps if I'd started removing tests in a different order I would have got a different result.
background-image-003.xht uses the same reference test as background-repeat-applies-to-* (ref-filled-green-100px-square.xht). removing the <strong> from these avoids the bug.
Strangely removing the contents of the background-position-00[12].xht tests, ie an empty <body>, does not avoid the bug. but skipping the tests entirely does. removing the contents of background-position-applies-to-001a.xht does however avoid the bug.
very strange
Comment 10•9 months ago
|
||
I can reduce the background-repeat-applies-to-001.xht test and its reference to just have the "g" character and it still reproduces.
Running the above list of tests, we first load the green square reference. we rasterize a g in webrender.
the next few tests do something that means when we get around to the final test we are using a new fontkey with webrender. which means we must rerasterize, and there is a slight discrepancy in the glyph.
I'm not sure whether slight discrepancies are to be expected, or whether this is a bug perhaps in the freetype on more recent android versions. In Android 11, we also get a new font key for the final test, but it appears to rasterize identically.
One thing I noticed looking at this is we only appear to load the reference test ref-filled-green-100px-square.xht once, for the first test. But we don't load it again for the last test. Does WPT therefore cache the result of the first load? Perhaps that would explain why we only see this in WPT reftests and not gecko reftests? Even if we periodically re-rasterize and get a different result, as long as the test and the reference page use the same cached glyph we wouldn't notice... James, does WPT do that?
| Assignee | ||
Comment 11•9 months ago
|
||
I am holding off on landing anything. I had tried some other work with just adding a blanket fuzzy statement in __dir__.ini, but this was not working as expected- many new tests were failing which was confusing.
thanks :jnicol for looking into this. If there are tweaks to make to the OS image let me know- if we need to uninstall/install fonts, adjust resolution or run with other flags to invoke different graphics modes. Likewise if we suspect there are resource issues, I can try running on a beefier instance.
For reference there are 4 test suites that are problematic:
- wpt reftest
- wpt wdspec
- xpcshell
- junit
the original reftest harness runs just fine, wpt regular tests run just fine, mochitest is just fine (when I say just fine, only a few failures overall). Not sure if this additional information is useful or not, but it is good to know some stuff works as expected, and this isn't jus wpt, xpcshell has a lot of issues.
Comment 12•9 months ago
|
||
I assume the other failures you refer to I can see in the initial try push you linked to, and removing the wpt filter. I imagine they have a different root cause to the wpt reftests which mostly seem font related..
I pulled the /system/fonts/Roboto-Regular.ttf file from an Android 11 image, and replaced the file on the Android 12 or later emulator emulator which we're seeing the bug on. And the tests now pass! There's definitely more left to figure out here, but in theory could we do something like that for our CI emulators?
Comment 13•9 months ago
|
||
Lee, we wouldn't expect rasterizing the same glyph from the same TTF file (though loaded as a separate font) with the same font instance options to produce different pixels, would we? Does that sound like a bug in freetype?
Comment 14•9 months ago
|
||
(In reply to Jamie Nicol [:jnicol] from comment #13)
Lee, we wouldn't expect rasterizing the same glyph from the same TTF file (though loaded as a separate font) with the same font instance options to produce different pixels, would we? Does that sound like a bug in freetype?
We wouldn't expect that, no, nor would we even expect such a bug in FreeType. It would be more likely to be some sort of latched state we're setting, or state that is leaking through in options elsewhere, like maybe a variation config?
Jonathan, does this sound maybe variation related?
Comment 15•9 months ago
|
||
The discrepancy in the tail of the "g" in the descriptive text on so many of those tests is such a consistent type of failure that there must be something systematic underlying it.
Note that many of these WPT reftests share a common "reference" file (such as the 100px green square with the line of descriptive text above it), so what may be happening is that the "wrong" rasterization is happening to the reference, and then this results in a whole bunch of tests that compare against the same screenshot reporting the same "Maximum difference per channel 4, 14 pixels differ" failure; i.e. because of the shared reference file, a single bad rasterization operation could result in many reported failures.
It seems unlikely IMO that this is related to variations or other font instance options; if there were something like that, I think we'd see much more widespread rasterization differences, not limited to just one small edge of one glyph.
Jamie, your comment 12 is interesting. Could you please attach the two versions of Roboto-Regular.ttf involved there -- one which results in this kind of failure, and one that doesn't? I'd be interested to see what kinds of changes there have been in the font itself.
| Assignee | ||
Comment 16•9 months ago
|
||
| Assignee | ||
Comment 17•9 months ago
|
||
| Assignee | ||
Comment 18•9 months ago
|
||
:jfkthame, I have attached the old android 7 roboto .ttf as well as the new android 14 version as attachments to this bug. Hopefully this can help.
:jnicol, I am trying to figure out how to include this new file on the emulators. It might as simple as a hack during setup of the test, but I would rather find something less hacky.
Comment 19•9 months ago
|
||
Note that many of these WPT reftests share a common "reference" file (such as the 100px green square with the line of descriptive text above it), so what may be happening is that the "wrong" rasterization is happening to the reference, and then this results in a whole bunch of tests that compare against the same screenshot reporting the same "Maximum difference per channel 4, 14 pixels differ" failure; i.e. because of the shared reference file, a single bad rasterization operation could result in many reported failures.
I think it's the other way around: the glyphs for reference file get rasterized correctly, and the tests pass to start with. then waves hands something happens and we start rasterizing the glyph incorrectly and a test fails. we restart the browser in response to the failure and the cycle repeats. this explains the behaviour Joel described in comment 3. (though I guess "correct" and "incorrect" are just labels and the effect would be identical if they start out rasterizing "incorrectly" and then change to "correctly" mid run.).
Comment 20•9 months ago
|
||
Jamie, your comment 12 is interesting. Could you please attach the two versions of Roboto-Regular.ttf involved there -- one which results in this kind of failure, and one that doesn't? I'd be interested to see what kinds of changes there have been in the font itself.
I don't currently have access to the computer which had these, but I'll post these Monday, I seem to recall the filesize was significantly larger in the android 12 version,
Comment 21•9 months ago
|
||
Comment 22•9 months ago
|
||
Here are the android 11 (good) and android 12 (bad) versions of Roboto-Regular.ttf
Comment 23•9 months ago
|
||
Thanks. OK, so that confirms we're looking at a major update of Roboto: this is when they switched from the static font (presumably android 11 also has a bunch of other styles like Roboto-Bold, Roboto-Condensed, etc., etc) to a variable font with width, weight, and italic axes. It's the variable-font version that is exhibiting this apparent unreliability in rasterization.
My current hunch, though it may be hard to confirm, is that this may be a FreeType bug: maybe something like the FT auto-hinting code is behaving slightly unpredictably, perhaps depending on a not-reliably-initialized value somewhere.
If someone has the cycles to investigate further, something I'd be inclined to try would be to install this version of Roboto on a Linux system, configure it as the default font, and run tests under valgrind to see if it flags anything. (I suppose it would also be important to ensure that the same version of FreeType is on the test system as we're using on android.) I don't have time to pursue this at the moment, unfortunately.
Another thing perhaps worth trying would be to pull the latest FreeType trunk code, as it looks like there have been a number of recent commits that may have some relevance to TrueType rasterization.
Comment 24•9 months ago
|
||
Does WPT therefore cache the result of the first load? Perhaps that would explain why we only see this in WPT reftests and not gecko reftests?
Yes, wpt-reftests do do that. https://treeherder.mozilla.org/jobs?repo=try&revision=89c0ae862dfd618847ba30b26d9092435c0ba545 is a try push with that caching disabled for Android 14.
Comment 25•9 months ago
|
||
(In reply to Jonathan Kew [:jfkthame] from comment #23)
Another thing perhaps worth trying would be to pull the latest FreeType trunk code, as it looks like there have been a number of recent commits that may have some relevance to TrueType rasterization.
I gave this a shot. A try run with current FreeType master from upstream at https://treeherder.mozilla.org/jobs?repo=try&searchStr=reftest&revision=8111114af06c20157110e34a6d985bb8cccd55c6 doesn't appear to suffer from the "tail of g" discrepancy, whereas on https://treeherder.mozilla.org/jobs?repo=try&searchStr=reftest&revision=9d7c2ad09d33250d95fb525a102fe7662ae71b9c, which is the same thing but without the freetype update, there are lots of those failures.
Unfortunately, something else is broken with the freetype update, though: the debug build crashes with a fatal assertion arising inside webrender at wr_glyph_rasterizer::platform::unix::font::FontContext::load_glyph (I think in response to a freetype API call failing), and while the opt build doesn't crash, it has failures that look like font variations aren't being successfully applied.
So this isn't an immediate solution, but does suggest that when a stable new FreeType version is available, it might help resolve this.
| Assignee | ||
Comment 26•9 months ago
|
||
(In reply to James Graham [:jgraham] from comment #24)
Does WPT therefore cache the result of the first load? Perhaps that would explain why we only see this in WPT reftests and not gecko reftests?
Yes, wpt-reftests do do that. https://treeherder.mozilla.org/jobs?repo=try&revision=89c0ae862dfd618847ba30b26d9092435c0ba545 is a try push with that caching disabled for Android 14.
this try push looks to have failures that are not related to the g, and most of the failures look familiar for my "other set" of issues. :jgraham, is it realistic to add a cli option to wpt so we can --no-cache-ref-data or something like that?
Comment 27•9 months ago
|
||
https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=aXBV8ZZxQDabzy11v3LnAg.0&revision=4274e2d590ec24d609c91af332b9502b617c4627 is a further try push with the metadata updated.
:jmaher - I think we could land this pretty much as-is with a coded in exception for Android 14, but I could also make it a command line option and let the task configuration control when to skip caching.
| Assignee | ||
Comment 28•9 months ago
|
||
I will leave it up to you :jgraham, if it is hardcoded then it will likely be on future android emulators, so android os_version >= 14.
if we go the taskcluster config, we could adjust the extra command line args to be by-test-platform and add something for android14.
thanks for finding this. I am happy to review any pieces of this.
Comment 29•9 months ago
|
||
So to sum up, the "g" issue appears it will be fixed by a future freetype update. and in the meantime we can disable caching of the WPT reftest reference images. this avoids needing to make hundreds of annotations, which would be hard to undo (eg when we update freetype).
so we have a path forward for the "g" issue. but there are still the other set of failures, eg from the push in comment 3, ignoring the few "g" failures that are still present there. and these also seem to be present in James' run with caching disabled, and (though perhaps a few more are fixed in this) Jonathan's run with updated freetype.
| Assignee | ||
Comment 30•9 months ago
|
||
thanks, that is right. half of the remaining failures are unexpected-pass :) the other half is a much smaller number- I would like eyes on it, but probably not critical to resolve before migrating. If it is more desirable to revisit after a freetype update in the [near] future, that is fine as well.
Comment 31•9 months ago
|
||
(In reply to Jamie Nicol [:jnicol] from comment #29)
So to sum up, the "g" issue appears it will be fixed by a future freetype update.
Just a note/warning -- it occurs to me that it's possible the reason I didn't see the "g" issue on my try push was precisely because there's something broken about font variation support there (manifesting as assertion-crashes in the debug build), and when that is fixed and freetype is loading variation data successfully, there may be a risk that the unreliable rendering of the "g" tail will resurface.
Hopefully not, but only time will tell. I don't have a lot of cycles right now to dig deeper into the freetype update, so we may just have to wait for upstream to cut a new release, and then check for stability with that. My try push was based on an unreleased tip-of-master-branch, so it was always kinda speculative!
Updated•9 months ago
|
| Assignee | ||
Updated•9 months ago
|
Description
•