Closed Bug 942893 Opened 8 years ago Closed 8 years ago

fxos 1.2 system fonts are slower than 1.1 fonts

Categories

(Firefox OS Graveyard :: GonkIntegration, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(Not tracked)

RESOLVED WONTFIX
1.3 C1/1.4 S1(20dec)

People

(Reporter: bkelly, Assigned: bkelly)

Details

(Keywords: perf, regression, Whiteboard: [c=progress p=5 s= u=1.3] [b2gperf_regression])

Attachments

(5 files, 1 obsolete file)

It appears we regressed on cold launch a number of apps on or around Nov 22:

  http://mzl.la/18jx0ei

The regression range is quite small on both gaia and gecko:

Gaia:  https://github.com/mozilla-b2g/gaia/compare/d1bea4612...e208cf03d7
Gecko: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f7ad046f783f&tochange=98aa428a392c
It seems like this could be due to bug 925970, but we need to verify that.
Ack, I messed up the git.m.o to mercurial revision translation for the gecko range.  It should be:

  http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=dbf94e314cde&tochange=98aa428a392c
Incidentally, it seems the fps_hwc shows an improvement over this same range for some of the apps effected:

  http://mzl.la/17UdXGO

So perhaps related to the scroll infrastructure?
Whiteboard: [c=automation p= s= u=] → [c=automation p= s= u=][b2gperf_regression]
Keywords: regression
Should we block on this, given this is a start up perf regression?
(In reply to Jason Smith [:jsmith] from comment #4)
> Should we block on this, given this is a start up perf regression?

Block for 1.3?  So far we don't see the regression in 1.2.  I'm not sure what blocking means for 1.3 at this point.
(In reply to Ben Kelly [:bkelly] from comment #5)
> (In reply to Jason Smith [:jsmith] from comment #4)
> > Should we block on this, given this is a start up perf regression?
> 
> Block for 1.3?  So far we don't see the regression in 1.2.  I'm not sure
> what blocking means for 1.3 at this point.

It means blocking the 1.3 release. I know startup perf is critical for our apps on our phone.
blocking-b2g: --- → 1.3?
(In reply to Ben Kelly [:bkelly] from comment #3)
> Incidentally, it seems the fps_hwc shows an improvement over this same range
> for some of the apps effected:
> 
>   http://mzl.la/17UdXGO
> 
> So perhaps related to the scroll infrastructure?

Note, the change in fps_hwc test here appears to be due to scrollbars being re-enabled on b2g in bug 876741.  We have to draw more frames during the test, so the hwc fires more often.

This is probably unrelated to the launch regression.
Mike, I'm currently working on 1.2 bugs.  Do we have anyone available to work this?  It would be good to investigate before the code gets stale.
Flags: needinfo?(mlee)
Kevin, need you to make this your highest priority.
Assignee: nobody → kgrandon
Status: NEW → ASSIGNED
Flags: needinfo?(mlee)
Priority: -- → P1
Whiteboard: [c=automation p= s= u=][b2gperf_regression] → [c=progress p= s= u=][b2gperf_regression]
Link for attached screenshot: http://mzl.la/IhEEtN
blocking-b2g: 1.3? → 1.3+
Whiteboard: [c=progress p= s= u=][b2gperf_regression] → [c=progress p= s= u=1.3] [b2gperf_regression]
Sorry, mozilla-central branches are having problems, and I was informed that we're closing them down soon. Here's updated github links.

Good revision: https://github.com/mozilla/gecko-dev/commit/0094801d34f0682670f827a1c1e4607dc8357d34
Bad revision: https://github.com/mozilla/gecko-dev/commit/e7f839bda7e25cbb303359de185cb25e689b2296
I'm having a hard time mimicking these numbers locally. There is a wide range of latency when I run them, so I don't really trust it. Going to start looking through these profiles.

Profile with 'good commit': http://people.mozilla.org/~bgirard/cleopatra/#report=38aaa4b99879ff6a79ae9f827d8497e608fa059a

Profile with 'bad commit': http://people.mozilla.org/~bgirard/cleopatra/#report=5bb82a1fa98b700b7aab7fef122312573d27fc5e
So I'm not entirely sure what's causing this, but it looks like the b2g process is just doing more now. We have Main Thread IO five times, instead of four now, and also WebApps.jsm is doing some work early in the startup process that wasn't there before.
Hi Fabrice - 

Investigating this performance issue and noticed that this new call happens during application startup time. It's really tiny, but I was wondering if you had any idea off the top of your head why this is now in the applciation startup path, and if we should avoid it. Thanks!
Flags: needinfo?(fabrice)
Kevin, I would recommend running b2gperf locally to see if you can reproduce the results.  It performs 30 launches which helps average out any variance.

If you can reproduce it with b2gperf, then you can try to bisect the range down.  (It will take a long time to do a full bisect, though.)
I believe the new event and work I'm seeing in the b2g thread was likely the work done here: https://github.com/mozilla-b2g/gaia/commit/b975c561398b46e16368d6d02761ac3ec4d9c678#diff-293e71c28c7a1b418148d8af71d58804R183

So there is a *tiny* performance impact from that I think, but it shouldn't be anything major.
Flags: needinfo?(fabrice)
We can continue to investigate this while bug 943611 is being worked on. It does not stop investigation of the issue, but is a tool which would allow us to investigate this much faster.
Depends on: 943611
B2gperf isn't telling me too much locally with this one unfortunately. I'm currently seeing < 10 ms difference between the good and bad commits locally for several apps. This isn't really enough to go on for bisecting over this range of commits. Also the fix for bug 936201 is potentially accounting for the slight regression I'm seeing locally.

While I'm hoping we might have something like bug 943611 quickly, the next steps I see are:

1 - See if someone can reproduce these results locally.
2 - Dig into profiler differences, and start fixing up gecko to be faster.

I'm hoping I can do the some combination of the above at the same time. If anyone is able to reproduce this locally please let me know.
I saw that in bug 943594 we added a regressionwindow-wanted and someone looked into this as well. I'm adding that here to see if someone is able to reproduce the results we're seeing in datazilla.

I'm also slightly convinced that this may have been environmental, or not really an issue at all. From the screenshot (https://bug942893.bugzilla.mozilla.org/attachment.cgi?id=8337921), it looks like we gain some time on the 21st, then regress back to 'normal' levels shortly after on the 22nd. I haven't heard/seen anything as to why we gained this performance boost in the first place.
Taking this since I'm a glutton for punishment.

Dropping the regressionwindow-wanted keyword for now since I am going to try to narrow the range.  QA, if you are already actively working this, please let me know.
Assignee: kgrandon → bkelly
I am also unable to reproduce this given the gaia revs in comment 0 and the gecko revs in comment 2.
Dave, is there any way to tell if the jenkins config changed on or around Nov 21/22?
Flags: needinfo?(dave.hunt)
There was no configuration changes on 21-22nd. There was a change on the 15th, which was the increase of the log level to debug, and switching of the order of the Contacts and Phone apps. This was to investigate bug 942840, and the order was subsequently reverted on the 26th, but the log level is still at debug.
Flags: needinfo?(dave.hunt)
Thanks Dave.

It occurred to me that I do not have a simcard in my test device at the moment.  I will try reproducing again tomorrow with a simcard to see if it makes a difference.
(In reply to Ben Kelly [:bkelly] from comment #26)
> It occurred to me that I do not have a simcard in my test device at the
> moment.  I will try reproducing again tomorrow with a simcard to see if it
> makes a difference.

FWIW, when I tried this on my device I had a sim card in it, and was unable to reproduce the results. Wifi was turned on, but it was not connected to a network.
Whiteboard: [c=progress p= s= u=1.3] [b2gperf_regression] → [c=progress p=4 s= u=1.3] [b2gperf_regression]
So it looks like this has to do with the OEM firmware build on the device.

Prior to Nov 23 we were running the US_20131104_root image on the devices.  From talking with Stephen and Naoki it appears that we flashed the devices to V1.2_US_20131115 on Nov 23/24.  This effectively took us from v1.1 firmware to v1.2 firmware.

To determine if this was the cause of the regression I ran a number of launch tests at different revisions:

A) Nov 21 prior to improvement - gaia:6d5e83922853bc, gecko:597287004ff5
B) Nov 21 after improvement    - gaia:f203faf80da3e,  gecko:cf378dddfac8 
C) Nov 22 prior to regression  - gaia:d1bea461219,    gecko:dbf94e314cde
D) Nov 23 after regression     - gaia:e208cf03d7c7,   gecko:98aa428a392c

I ran the Settings launch test with each of these revisions on each version of the firmware.  Here are median results for each run:

Firmware US_20131104_root:
  A) 685ms
  B) 629ms
  C) 612ms
  D) 626ms

Firmware V1.2_US_20131115:
  A) 720ms
  B) 663ms
  C) 644ms
  D) 654ms

This shows pretty well that the change on Nov 21 was a legitimate improvement in launch time due to gecko or gaia changes.  It also shows that the newer firmware is consistently slower than the older firmware.

I spoke with mwu on IRC and he wanted me to check a few more things:

1) Are the fonts the same or different between images?
2) Does disabling the commercial RIL have any effect.

I will investigate these tomorrow.  Based on the pattern of regression across apps, the fonts seem like a probably candidate.
Attached file hamachi-font-diffs
So there are differences in the fonts between the 11/04 and 11/15 OEM firmware images.  This attachment shows the diff.  It appears that there are some fonts missing from the 11/15 image (FeuraSans-Bold, FeuraSans-Regular, etc).

To test if this was significant I re-ran the b2gperf launch test on the V1.2_US_20131115 image.  I then re-ran the tests after performing the following steps to upload the fonts from the older US_20131104_root image.

  adb shell stop b2g
  adb shell rm -r /system/fonts
  adb push backup-hamachi-20131104/system/fonts /system/fonts
  adb shell push
  adb shell start b2g

The median launch times for these tests were:

  Old fonts: 615ms
  New fonts: 653ms

This seems to explain most, if not all, of the regression.

Michael, can you look at the font differences here and tell me if they are expected or not?

Naoki and Vance, do you know if it would be possible roll a new firmware image with different fonts if that is Michael's recommendation?
Flags: needinfo?(vchen)
Flags: needinfo?(nhirata.bugzilla)
Flags: needinfo?(mwu)
I will check to see if we can get a build with the different fonts. And one thing I don't quite understand is, how come the fonts will cause the cold launch regression?
Flags: needinfo?(vchen)
(In reply to Vance Chen [:vchen][vchen@mozilla.com] from comment #30)
> I will check to see if we can get a build with the different fonts. And one
> thing I don't quite understand is, how come the fonts will cause the cold
> launch regression?

Yea, I'd like to get Michael's take on that.  I'm not familiar with the font system on b2g.  My guess, though, is that gaia is asking for a font that is missing, its falling back to another font, and perhaps that is causing some extra delay on startup.
I spoke with mwu on IRC today.  The 11/15 image has the correct fonts.  We're trying to understand why the new fonts are slower.

I will re-run the tests without the MTL*.ttf files as they are quite large.  Michael also is thinking that perhaps the new fonts are just more complex to render based on the fact that text heavy apps are impacted more.
Flags: needinfo?(mwu)
Updating the description to reflect the current situation.  Also nom'ing for koi since it appears that the system fonts in 1.2 are slower than 1.1 fonts.
blocking-b2g: 1.3+ → koi?
Component: Gaia → GonkIntegration
No longer depends on: 943611
See Also: 943594
Summary: datazilla cold launch regression in multiple apps around Nov 22 → fxos 1.2 system fonts are slower than 1.1 fonts
Whiteboard: [c=progress p=4 s= u=1.3] [b2gperf_regression] → [c=progress p=5 s= u=1.3] [b2gperf_regression]
In addition to testing without the MTL*.ttf files, I will also test using the ttf FiraSans files instead of the otf files.

Notes on how to do this:

1) Replace the otf files on device with the ttf files in b2g/external/moztt/FiraSans-2.001 and FiraMono-2.001.
2) Rename "Fira Sans OT" to "Fira Sans" and "Fira Mono OT" to "Fira Mono" in modules/libpref/src/init/all.js
3) Rename fonts in gfx/thebes/gfxFT2FontList.cpp
4) rebuild/reflash
Flags: needinfo?(nhirata.bugzilla)
I also want to re-run my firmware tests with a b2g26 gecko.  I can't explain why I don't see a regression bump on the 1.2 datazilla graph.  Perhaps those devices were flashed at a different time or perhaps there is something in 1.3/1.4 gecko that interacts poorly with the new fonts.
Jonathon also recommend we run the textbench before making any font changes here.  Even if we are paying a penalty at startup, the new fonts may be faster at run time which would help scrolling, etc.  He pointed me at this benchmark:

  http://people.mozilla.org/~jdaggett/textbench/
It's also possible that there's a startup cost simply because the new fonts are more complete than the ones we had previously (more character coverage, more styles, ...). If that's the case, it may be that switching between .otf and .ttf doesn't make much difference in itself.
Flags: needinfo?(nhirata.bugzilla)
Triage Notes:

I'd probably block on this. Fonts is common to the entire phone, so taking a perf regression here will hit us across the board everywhere on the phone.
(In reply to Ben Kelly [:bkelly] from comment #36)
> Jonathon also recommend we run the textbench before making any font changes
> here.  Even if we are paying a penalty at startup, the new fonts may be
> faster at run time which would help scrolling, etc.  He pointed me at this
> benchmark:
> 
>   http://people.mozilla.org/~jdaggett/textbench/

It occurs to me that this tests text layout and shaping, but does not test the actual -rasterization- of the glyphs, as it lays out and measures the text, but does not draw it. As such, if there's a significant difference between the .otf and .ttf fonts in rasterization performance, this test won't show it. :(
So the slowdown being discussed here involves first-run font data loading? Or the actual content rendering performance of the new fonts?  Having a reduced testcase would be really helpful.  We need to categorize this into (1) startup cost associated with iterating over available fonts (2) first-time costs associated with using a particular font (or set of fonts) or (3) a text rendering slowdown when using the new fonts.  I can definitely help set up a test for (3) but for (1) and (2) some sort of perf logging with different default font settings in the prefs would make it easier to track down the problem.
I removed the MTL*.ttf files with no effect.  The regression remained.

I will next test switching from .otf to .ttf files.
So, when I went to compile in the ttf font names I noticed I built the wrong tree earlier when I tried to test v1.2.  Sorry for the pilot error here.

I just compiled a new clean v1.2 and I do NOT see a regression related to fonts.  So this appears to be an interaction between the new fonts and a change in gecko in 1.3 or newer.
blocking-b2g: koi? → 1.3?
Summary: fxos 1.2 system fonts are slower than 1.1 fonts → fxos 1.2 system fonts are slower than 1.1 fonts on 1.3/1.4 gecko
Attached image Screenshot with new fonts (obsolete) —
I believe what is going on here is that our gecko was simply misconfigured to work with the fonts provided in the US_20131104_root OEM firmware image.  This resulted in fewer fonts being loaded which improved our launch time.

Updating to the V1.2_US_20131115 simply brought the system fonts into alignment with the gecko config which allowed us to load the correct fonts.  Again, loading more data takes more time, so our launch time got worse.

If you compare the screenshot with the old fonts (attachment 8348981 [details]) with a screenshot of the new fonts (attachment 8348982 [details]) you can see that there are visual differences.  This implies that Fira Sans/Mono are not being used correctly with the old fonts.

I also added debug in gfxFT2FontList::GetDefaultFont() which showed that gecko was returning NULL for the default font with the US_20131104_root firmware.

In addition, I realized I had again mis-managed my trees while trying to do the v1.2 test.  I accidentally had part of the font switch patch in place when I ran the test in comment 43.  So this is also consistent with the conclusion that the font config was simply out of place.

Finally, I ran the test using ttf files instead of otf files.  They were marginally faster (5ms), but that is within the bounds of the error I am seeing.  So I don't think there is any real change there.

Since we actually need to use Fira Sans for our visual requirements I am going to go ahead and mark this WONTFIX.  The slowdown is due to a legitimate product need.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Summary: fxos 1.2 system fonts are slower than 1.1 fonts on 1.3/1.4 gecko → fxos 1.2 system fonts are slower than 1.1 fonts
Jonathan, do you agree that a misconfigured font config resulting in no default font would be expected to be faster?  Again, from talking with mwu the thought is this would be faster because we're not actually loading the font.  Do you think we should continue investigating here?
Flags: needinfo?(jfkthame)
Another avenue of investigation we could take, if we think its worth it, is to verify the idea that we got an unexpected improvement in launch time when the gecko font config changed going from 1.1 to 1.2.  This would then correspond to the "regression" once the gonk system fonts caught up to this font config.
(In reply to Ben Kelly [:bkelly] from comment #46)

> If you compare the screenshot with the old fonts (attachment 8348981 [details]
> [details]) with a screenshot of the new fonts (attachment 8348982 [details])
> you can see that there are visual differences.

Actually, I can't; I think you accidentally attached the same (old?) screenshot twice, instead of the two different ones.


> Jonathan, do you agree that a misconfigured font config resulting in no
> default font would be expected to be faster?  Again, from talking with mwu
> the thought is this would be faster because we're not actually loading the
> font.  Do you think we should continue investigating here?

I don't think (offhand, without having attempted to instrument and measure it) that just having the "default font", as coded in gfxFT2FontList, misconfigured would make very much difference; mostly, it'd mean that in cases where no valid font is explicitly chosen, we might not end up using the most appropriate default, as we'll just pick the first we find.

I think it's more likely that the main factor is simply that we have considerably more font faces than we used to have (multiple weights, true italic faces; we ship 8 faces of Fira Sans, plus two Fira Mono; it looks like in the Feura days, we had just 4 faces of Feura Sans), and so the process of scanning the fonts directory and building the list of installed fonts - which we have to do if we're going to style text appropriately - will be taking longer.

(We've also added a bunch of Indic fonts, IIRC, though I'm not sure exactly where that falls in the timeline here...)

Anyhow, I don't there's much more we can do here; the main point to take away is simply that if we want to ship a device with a rich collection of fonts, for the sake of a polished visual experience, there's going to be a cost (in both space and time) to that.
Flags: needinfo?(jfkthame)
Fix screenshot of "new fonts".
Attachment #8348982 - Attachment is obsolete: true
(In reply to Jonathan Kew (:jfkthame) from comment #49)
> (In reply to Ben Kelly [:bkelly] from comment #46)
> 
> > If you compare the screenshot with the old fonts (attachment 8348981 [details]
> > [details]) with a screenshot of the new fonts (attachment 8348982 [details])
> > you can see that there are visual differences.
> 
> Actually, I can't; I think you accidentally attached the same (old?)
> screenshot twice, instead of the two different ones.

Fixed.  Sorry about that.

> Anyhow, I don't there's much more we can do here; the main point to take
> away is simply that if we want to ship a device with a rich collection of
> fonts, for the sake of a polished visual experience, there's going to be a
> cost (in both space and time) to that.

So, different reasons, but same conclusion.  I'm going to move on to other bugs.  Thanks for your help here!
blocking-b2g: 1.3? → ---
Flags: needinfo?(nhirata.bugzilla)
Target Milestone: --- → 1.3 C1/1.4 S1(20dec)
You need to log in before you can comment on or make changes to this bug.