1206485 - Boot loop after first boot on some devices (Xperia M2, ...)

Reporter

Description

•

9 years ago

After a first boot properly completed, next reboot ends up in segfault. Reproduced on Xperia Eagle and Tianchi at least. STR: 0. Build and flash everything (including userdata) 1. Boot, complete FTU 2. Reboot Expected: Device boots properly Actual: Device dies with segfault. I have dug a little bit and regression would have occurred during the week: gecko 9c01eed3d4e41157ce25c14fd52a7af98d0d13dc do not exposes the issue.

:gerard-majax

Reporter

Comment 1

•

9 years ago

Attached file gdb backtrace on Xperia M2 — Details

Flags: needinfo?(nicolas.b.pierron)

:gerard-majax

Reporter

Comment 2

•

9 years ago

Maybe a long shot, but it's in the range of regression AND it matches the js/src/gc/Heap.h file. So I'll try a local revert of 5da7dbdb733e4c9e96945b7aa74dd8654da2f3d1: $ git log 9c01eed3d4e41157ce25c14fd52a7af98d0d13dc..mozillaorg/master js/src/gc/Heap.h commit 5da7dbdb733e4c9e96945b7aa74dd8654da2f3d1 Author: Terrence Cole <terrence@mozilla.com> Date: Wed Sep 16 11:19:44 2015 -0700 Bug 1205054 - Remove isNullLike and other imprecise null checks; r=sfink commit bdd0fc968bc607fe892538d97047202694b82485 Author: Kan-Ru Chen <kanru@kanru.info> Date: Fri May 8 11:13:51 2015 +0800 Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com> commit 00a45d37f4e08418790dbd48cf64f4557eac5ffb Author: Kan-Ru Chen <kanru@kanru.info> Date: Fri May 8 11:05:08 2015 +0800 Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>

:gerard-majax

Reporter

Comment 3

•

9 years ago

So after testing, still reproducing with bug 1205054 reverted. Maybe it's bug 1123237: it's a big one that landed recently also, and that already got backed out :)

:gerard-majax

Reporter

Comment 4

•

9 years ago

Latest commit just before bug 1123237 (5d8728423441575dc81c6c38de69fbc7ca35f163) is fine. Checking the one that includes this set of patches (16cae3c6c6b37c2580f05f4ee415b18fff635c83). I don't reproduce the issue on some devices (Z3c, Flame), but I saw reports yesterday night of people on Flame with device going into boot loop since a couple of hours/days. So this might be worse than it looks.

:gerard-majax

Reporter

Comment 5

•

9 years ago

(In reply to Alexandre LISSY :gerard-majax from comment #4) > Latest commit just before bug 1123237 > (5d8728423441575dc81c6c38de69fbc7ca35f163) is fine. Checking the one that > includes this set of patches (16cae3c6c6b37c2580f05f4ee415b18fff635c83). > > I don't reproduce the issue on some devices (Z3c, Flame), but I saw reports > yesterday night of people on Flame with device going into boot loop since a > couple of hours/days. So this might be worse than it looks. Right, so I can confirm regression is within the range 5d8728423441575dc81c6c38de69fbc7ca35f163..16cae3c6c6b37c2580f05f4ee415b18fff635c83: $ git log --oneline 5d8728423441575dc81c6c38de69fbc7ca35f163..16cae3c6c6b37c2580f05f4ee415b18fff635c83 16cae3c Bug 1123237 - Part 12. Fix GC hazards. r=terrence 4efb6ef Bug 1123237 - Part 11. Don't use STL in memory-profiler. r=BenWa,cervantes 28c419d Bug 1123237 - Part 10. Expose SwapElements from nsBaseHashtable. r=nfroyd c2a6c6f Bug 1123237 - Part 9. Interface to memory-profiler add-ons. r=jimb 5c8dec0 Bug 1123237 - Part 8. Tracking the memory events. r=BenWa,terrence 008c01b Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug b4ef7f3 Bug 1123237 - Part 6. A new API to get backtrace without allocating memory in profiler. r=mstange 22e4adb Bug 1123237 - Part 5. Don't emit inline allocation when memory profiler enabled. r=terrence 75ad5ad Bug 1123237 - Part 4. Monitoring allocations and frees for ArrayBuffer. r=terrence,sfink bdd0fc9 Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence 00a45d3 Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence Can we backout only one of those or do we absolutely need to keep them alltogether? Sorry for the mass needinfo :)

Blocks: 1123237

Flags: needinfo?(terrence)

Flags: needinfo?(kchen)

Flags: needinfo?(cyu)

:gerard-majax

Reporter

Comment 6

•

9 years ago

008c01bd6c9cf6ed6831d9ad3663f54b7b427484 is the first bad commit commit 008c01bd6c9cf6ed6831d9ad3663f54b7b427484 Author: Kan-Ru Chen <kanru@kanru.info> Date: Fri May 8 11:22:38 2015 +0800 Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com> :040000 040000 ed6a258195e2c7f780ec8b06e751df38a2a69c65 2cde680a8dad5555679bd7e28d132871636d9dbe M b2g :040000 040000 dd426c61bf5f8225bca94c9f0d922e448648806a 33f8e1fce98ce0bf539df2065f5bb53e9ee7bd7e M browser :040000 040000 c0f1c5b2d67565f4d5fc816a698487d4afb8ae41 f648cbd70baee3cf41267d6f73805b1aa515aeec M mobile :040000 040000 0474479facca6017da670b573a9fe99511795e3e 3152a76d0d8e002889a25b13b822f2d2699c7ec1 M toolkit :040000 040000 8093074ae509c6d8b843079815d27facc1accc0a a8e8a6319e4fdb7d9bf198d32e04a01d647a8c2d M tools

Adam Farden [:adfad666]

Comment 7

•

9 years ago

I'm not sure what the above has to do with this but from my experiments after reverting to 5d8728423441575dc81c6c38de69fbc7ca35f163: Experiment 1: Boot, go through FTU with connecting to WiFi Open browser, open settings reboot --> Boot loop. Experiment 2: Boot, go through FTU do not connect to WiFi Open browser, open settings reboot --> Boot normal. Experiment 3: Disconnect WiFi router from Internet Boot, go through FTU with connecting to WiFi Open browser, open settings reboot --> Boot normal. Experiment 4: Disconnect WiFi router from Internet Boot, go through FTU with connecting to WiFi Open browser, open settings reboot --> Boot normal. Reconnect WiFi router to Internet Open browser, open settings reboot --> Boot normal. Open browser, open settings reboot --> Boot loop. What is common about Experiment 1 and Experiment 4? in both cases there is an eventual boot to home screen with a functioning WiFi connection. This triggers a check for updates. It is this check that is causing the boot loop, probably a database is being corrupted.

[:fabrice] Fabrice Desré

Updated

•

9 years ago

Summary: Boot loot after first boot on some devices (Xperia M2, ...) → Boot loop after first boot on some devices (Xperia M2, ...)

:gerard-majax

Reporter

Comment 8

•

9 years ago

Ok, comment 6 might be wrong but the range is still the proper one. I have checked out Part 6 and pushed that Gecko on a device already in a bad state. Device is still boot looping.

:gerard-majax

Reporter

Comment 9

•

9 years ago

Potential dupes are: bug 1206031, bug 1206094, bug 1206092, bug 1206455

Updated

•

9 years ago

Updated

•

9 years ago

Updated

•

9 years ago

Comment 10

•

9 years ago

(In reply to Alexandre LISSY :gerard-majax from comment #5) > Can we backout only one of those or do we absolutely need to keep them > alltogether? If needed they have to be backed out together. Note I can't reproduce this on Flame and I'm not sure how could a disabled feature affect booting. Is the segfault always at the same place?

Flags: needinfo?(kchen)

:gerard-majax

Reporter

Comment 11

•

9 years ago

(In reply to Kan-Ru Chen [:kanru] from comment #10) > (In reply to Alexandre LISSY :gerard-majax from comment #5) > > Can we backout only one of those or do we absolutely need to keep them > > alltogether? > > If needed they have to be backed out together. Note I can't reproduce this > on Flame and I'm not sure how could a disabled feature affect booting. Is > the segfault always at the same place? Always. Before the regression range, nothing. After, constantly under the described conditions.

:gerard-majax

Reporter

Comment 12

•

9 years ago

Right, now after a repo sync of Gecko and Gaia, I don't have the issue anymore. Only spurious thing I could notice was a crash report on the very first boot, before I begin FTU. And homescreen seemed to be broken after finishing FTU. It's all okay after a reboot.

:gerard-majax

Reporter

Comment 13

•

9 years ago

(In reply to Alexandre LISSY :gerard-majax from comment #12) > Right, now after a repo sync of Gecko and Gaia, I don't have the issue > anymore. > > Only spurious thing I could notice was a crash report on the very first > boot, before I begin FTU. And homescreen seemed to be broken after finishing > FTU. > > It's all okay after a reboot. False hope: after a shutdown and a startup, it is crashing again

Adam Farden [:adfad666]

Comment 14

•

9 years ago

I've noticed that too, _sometimes_ it doesn't start crashing, which made my bisect attempt a fruitless waste of time.

Nicolas B. Pierron [:nbp — off until 29-09]

Comment 15

•

9 years ago

(In reply to Alexandre LISSY :gerard-majax from comment #1) > Created attachment 8663388 [details] > gdb backtrace on Xperia M2 I cannot spot anything obvious from the backtrace, but I am no expert in the GC. None of the patches from Bug 1123237 are modifying the parser. So I guess this issue might be related to some of the nursery patches. Terrence might know better, but I think this kind of bug is hard to investigate, especially on devices, and our best hope might be to wait until fuzzers find similar signature.

Flags: needinfo?(nicolas.b.pierron)

Terrence Cole [:terrence]

Comment 16

•

9 years ago

Crashes in the GC are usually heap corruption of some sort, rather than a direct consequence of GC changes. Generally the only way to track this sort of problem down is to bisect, which is seems you have done. In this particular case, it looks like either the arena list or freespan head points into unmapped addresses. I'm not entirely sure how that squares with the bisection results, but that patch does add some members to the relevant structs which could be bumping the addresses off-by-one if not everything is compiling with the same #defines?

Flags: needinfo?(terrence)

Naoki Hirata :nhirata (please use needinfo instead of cc)

Updated

•

9 years ago

Flags: needinfo?(nhirata.bugzilla)

Chris Lord [:cwiiis]

Comment 17

•

9 years ago

I'm getting a boot loop on Z3C - don't know if it's related, but I bisected: 2d0398ffa709b2af2e5a1e588086a874479c67e6 is the first bad commit commit 2d0398ffa709b2af2e5a1e588086a874479c67e6 Author: Josh Matthews <josh@joshmatthews.net> Date: Sun Sep 20 05:57:15 2015 -0400 Bug 885982 - Part 4: Remove all traces of JS implementation. r=asuth :040000 040000 e3e092e3fc55443ecb1ff1f635dbc68633ee90f6 87637d13278226cea38d380d14f5933d1d9bb5b3 M b2g :040000 040000 580eb5cdb448408ad501c32ab3f895417d87000e f6cf9fd2663e05213aca86f893deede1d813f519 M browser :040000 040000 67c3d9dd8c9afcff5cea25fbc997cbd9df99b9d2 feb0b53d300320ccacea09d06281670b0e11475e M dom :040000 040000 4d5d2838010e6d1ebe5e036f831cccfa19f41199 56c90d1b6513a0038c87c8f793e1aaa3704f14d9 M mobile

[:fabrice] Fabrice Desré

Comment 18

•

9 years ago

fwiw, I'm facing the same issue on a z3c, with the same stack trace as the one initially reported

Kan-Ru Chen [:kanru] (UTC+9)

Comment 19

•

9 years ago

I'm investigating this.

Assignee: administration → kchen

Kan-Ru Chen [:kanru] (UTC+9)

Comment 20

•

9 years ago

ftr, I got a profile that always crash when compiling ContactDB.jsm

Chris Lord [:cwiiis]

Comment 21

•

9 years ago

Reverting the commit I mention in comment #17 on top of master fixes the boot loop for me.

Kan-Ru Chen [:kanru] (UTC+9)

Comment 22

•

9 years ago

(gdb) f #2 js::TenuringTracer::moveToTenured (this=this@entry=0xbed9b248, src=0xb2033180) at /home/ting/w/fx/os/aries-kk/gecko/js/src/gc/Marking.cpp:2059 2059 TenuredCell* t = zone->arenas.allocateFromFreeList(dstKind, Arena::thingSize(dstKind)); (gdb) p zone->runtime There is no member or method named runtime. (gdb) p zone->runtime_ $9 = (JSRuntime * const) 0x904ff0e9 (gdb) p zone->runtime_ == zone->arenas.runtime_ $10 = false (gdb) p zone->arenas.runtime_ $11 = (JSRuntime *) 0x1cf8cd93 (gdb) p *zone->arenas.runtime_ Cannot access memory at address 0x1cf8cd93 (gdb) I think zone->runtime_ != zone->arenas.runtime_ is impossible

Kan-Ru Chen [:kanru] (UTC+9)

Comment 23

•

9 years ago

Set javascript.options.ion to false prevents the crash.

Jon Coppeard (:jonco)

Assignee

Comment 24

•

9 years ago

The crash occurs during minor GC when we are marking the store buffers. With kanru's help debugging we found that we are marking what appears to be a FunctionBox object where we expect to see a nursery allocated JSObject.

Jon Coppeard (:jonco)

Assignee

Comment 25

•

9 years ago

Attached patch bug1206485-function-box-aliasing — Details — Splinter Review

I wasn't able to reproduce the crash, but I found something that could cause it. JSFunction has a union containing a JSObject* and a FunctionBox*. To make barriers work on the object pointer, when assigning to this we cast its address to a HeapPtrObject*. This will create a store buffer entry in the right circumstances (JSFunction allocated in the tenured heap, JSObject allocated in the nursery). While parsing a function we swap out this object pointer and set the function box pointer instead. We don't do anything to remove the store buffer entry though.

Attachment #8664187 - Flags: review?(terrence)

:gerard-majax

Reporter

Comment 26

•

9 years ago

Comment on attachment 8664187 [details] [diff] [review] bug1206485-function-box-aliasing Ship it! It looks perfect! Tested on Xperia M2: - pushing a gecko with the fix, doing ~8-10 reboots, no problem - pushing a gecko without fix, crash after one or two reboot, crash report at the first boot during FTU, homescreen broken - pushing a gecko with the fix on top of a broken profile, revived Thanks for the quick patch!

Attachment #8664187 - Flags: feedback+

Adam Farden [:adfad666]

Comment 27

•

9 years ago

Comment on attachment 8664187 [details] [diff] [review] bug1206485-function-box-aliasing This looks like it fixes the problem. I've done several reboots on both devices I had problems with, both rebooted several times without bootloop.

Chris Lord [:cwiiis]

Comment 28

•

9 years ago

This fixes it for me too :)

[:fabrice] Fabrice Desré

Comment 30

•

9 years ago

That also fixed it on my z3c. SHIP IT!

Terrence Cole [:terrence]

Comment 31

•

9 years ago

Comment on attachment 8664187 [details] [diff] [review] bug1206485-function-box-aliasing Review of attachment 8664187 [details] [diff] [review]: ----------------------------------------------------------------- Great find!

Attachment #8664187 - Flags: review?(terrence) → review+

Pulsebot

Comment 32

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/40bea2b40e5c

Terrence Cole [:terrence]

Comment 33

•

9 years ago

(In reply to [:fabrice] Fabrice Desré from comment #30) > That also fixed it on my z3c. SHIP IT! Shipped to m-i.

Kan-Ru Chen [:kanru] (UTC+9)

Comment 34

•

9 years ago

Nice!

Assignee: kchen → jcoppeard

Flags: needinfo?(nhirata.bugzilla)

Flags: needinfo?(cyu)

Naoki Hirata :nhirata (please use needinfo instead of cc)

Updated

•

9 years ago

Whiteboard: [dogfood-blocker]

Chih-Hsuan Yen [:yan12125]

Comment 35

•

9 years ago

The patch is confirmed to work on Flame. PS. I'm the author of Bug 1207213.

Carsten Book [:Tomcat]

Comment 36

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/40bea2b40e5c

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox44: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → FxOS-S8 (02Oct)

gdb backtrace on Xperia M2 9 years ago :gerard-majax 21.05 KB, text/plain		Details
bug1206485-function-box-aliasing 9 years ago Jon Coppeard (:jonco) 2.27 KB, patch	terrence : review+ gerard-majax : feedback+	Details \| Diff \| Splinter Review