Closed Bug 806096 Opened 12 years ago Closed 12 years ago

Pandaboard builds with new kernel don't always boot

Categories

(Firefox OS Graveyard :: GonkIntegration, defect)

ARM
Gonk (Firefox OS)
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cmtalbert, Assigned: tzimmermann)

References

Details

Attachments

(7 files)

So, the new kernel that Thomas created is not working on all panda boards.  We cannot seem to determine why. We have exact same board revisions, same connectors, connected in the same order, and when we put an sdcard flashed with this kernel into one board, it boots, and in another board it doesn't.  Out of 4 identical boards, 2 booted and 2 didn't.  The boards that don't boot get to the "booting kernel" line in the serial console and then hang indefinitely.  Here is the output from one such failed attempt:

Texas Instruments X-Loader 1.41 (Oct 21 2011 - 09:28:33)
OMAP4460: 1.2 GHz capable SOM
Starting OS Bootloader from MMC/SD1 ...


U-Boot 1.1.4-gedeced79 (Feb  6 2012 - 09:27:11)

Load address: 0x80e80000
DRAM:  1024 MB
Flash:  0 kB
Using default environment

In:    serial
Out:   serial
Err:   serial

efi partition table:
     256     128K xloader
     512     256K bootloader
    2048       8M recovery
   18432       8M boot
   34816     512M system
 1083392     256M cache
 1607680   14408M userdata
Net:   KS8851SNL
Hit any key to stop autoboot:  0 
kernel   @ 80008000 (3786640)
ramdisk  @ 81000000 (164989)
I2C read: I/O error

Starting kernel ...

Uncompressing Linux... done, booting the kernel.

And it just hangs there ^ forever.  When the build does work, it shows the same output, stops on the "booting the kernel" line for about one second and then starts booting the kernel and outputting information. Malini and I bisected the commits in https://github.com/mozilla-b2g/android-device-panda/commits/master and we found that the first build that has a problem is the first commit with the new kernel - commit: 2042c6165da208f88b16d77cd5fd36d3d14938ec

If we flash an sdcard with the old kernel (before this commit), then we can put that sdcard into any panda and that panda will boot. However, with the new kernel, about 50% of the pandas we try with the new kernel will not boot.

When the kernel does not boot on a panda, it is very reproducible. So, I have a board here that exhibits this behavior and I can mail it to you if you want to take a look at it.
I'm just starting the process of getting my pandboard up and running.

These types of problems often mean that something failed early on in the kernel. There is a feature called EARLY PRINTK which can be enabled to see the failure messages.

Currently, printk messages are buffered and when the kernel reaches the point where it can, then it dumps the printk buffer.

EARLY printk causes the messages to go immediately to the serial port, and not just get buffered. It typically requires a few lines of support code (the I/O memory map for the serial port needs to be setup and the code which writes to the serial port needs to be written). I'd imagine that this is already done for the pandaboard.
Another trick that can be employed is to reboot after the failure but immediately drop into u-boot, and use u-boot to dump the memory where the printk buffer is found.

To do this you need to determine the virtual address of the symbol __log_bug (This should be in the System.map file). Then you need to figure out the physical address that the kernel is loaded at.

Then you can compute the physical address of __log_buf (va of __log_buf - 0xc0000000 + phys addr kernel loaded to) and dump that memory in uboot.
Assignee: nobody → tzimmermann
Clint, do you have PandaBoards or PandaBoards ES?

I never experienced these boot problems, so I cannot reproduce them. Could you run 'cat /proc/cpuinfo' on a working and non-working board and print the output here?

I hope this is just a configuration problem. I created several kernel binaries, which I attached to the bug report. Please try each and see if any of them boots or what error it reports. The first is the new earlyprink-enabled kernel. Hopefully you get some more output from this. The second kernel contains some workarounds for bugs in the CPU. The third kernel is the old AOSP binary, but again with earlyprintk enabled.
Status: NEW → ASSIGNED
> When the kernel does not boot on a panda, it is very reproducible. So, I
> have a board here that exhibits this behaviour and I can mail it to you if
> you want to take a look at it.

There is a work week in San Francisco from 5th to 9th of November.
I'll be at that work week from the 5th to the 9th, and I'll have this board with me. In the meantime, let me see if I can get any further output from a working/nonworking board using these binaries.  Thanks Thomas.
Now that my pandaboard is booting, I'm seeing the hand at 

Uncompressing Linux... done, booting the kernel.

as well. My /proc/cpuinfo says:

shell@android:/ $ cat [  289.445404] init: untracked pid 1552 exited
/proc/cpuinfo
Processor	: ARMv7 Processor rev 2 (v7l)
processor	: 0
BogoMIPS	: 597.27

processor	: 1
BogoMIPS	: 599.77

Features	: swp half thumb fastmult vfp edsp thumbee neon vfpv3 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x1
CPU part	: 0xc09
CPU revision	: 2

Hardware	: OMAP4 Panda board
Revision	: 0010
Serial		: 0000000000000000

I found that if I press and hold the RETURN key in the serial console around the time I see uboot starting then it will boot the kernel. If I don't press and hold RETURN then it hangs after uncompressing.
So, here is the first earlyprintk enabled kernel (attachment 676129 [details]). With this kernel, both panda boards booted fine into a working state. I did not do multiple boots, so I don't know if I could potentially recreate the hang with this kernel image.

However, with the next kernel (CPU fixes, attachment 676130 [details]) I hit a problem with the problematic panda board and it did not complete booting. I think this might be our issue. I'll attach the output from that next.
While some binary stuff got spewed into the output, the kernel did boot and load b2g.

This is from the CPU fixes kernel attachment 676130 [details].
So this is the pandaboard that seems to have the problems when it was booting the CPU fixes kernel (attachment 676130 [details]). This issue is reproducible but only on this pandaboard. The working pandaboard continues to boot this kernel fine.

I think this is our issue that we're trying to track down.

I'll do the ASOP kernel tomorrow as well as multiple boots of the first early-printk kernel to see if that one can repro this problem.
I discovered that if I press RETURN, just once, very shortly after the Uncompressing message then my kernel will boot up fine.
(In reply to Clint Talbert ( :ctalbert ) from comment #10)
> Created attachment 676356 [details]
> Minicom output from two boots showing both work
> 
> So, here is the first earlyprintk enabled kernel (attachment 676129 [details]
> [details]). With this kernel, both panda boards booted fine into a working
> state. I did not do multiple boots, so I don't know if I could potentially
> recreate the hang with this kernel image.
> 
> However, with the next kernel (CPU fixes, attachment 676130 [details]) I hit
> a problem with the problematic panda board and it did not complete booting.
> I think this might be our issue. I'll attach the output from that next.

Thank you for trying the kernels and providing the logs.

If I understand you correctly, the first earlyprintk kernel boots everywhere. From the logs I've seen, the kernel with the CPU fixes works better than the current kernel but fails later during boot. By the time it fails, earlyprintk already got irrelevant. It can happen that one of the CPU fixes breaks something else. I've seen this with my board, where one of the CPU fixes prevented it from booting.

To summarize, it looks like enabling earlyprintk fixes the original problem. This could be a timing issue. Or some of the PandaBoards are broken.
BTW what partition layout do you use? The bootloader's output looks like the standard AOSP partitions, but later you mount Linaro's partitions. And what puzzles me the most is that it actually seems to work.

> <6>EXT4-fs (mmcblk0p5): mounted fiTesystem with ordered data mode. Opts: (null)
> [    6.967956] EXT4-fs (mmcblk0p5): mounted filesystem with ordered data mode. Opts: (null)
> <6>EXT4-fs (mmcblk0p7): mounted filesystem with ordered data mode. Opts: (null)
> [    7.008392] EXT4-fs (mmcblk0p7): mounted filesystem with ordered data mode. Opts: (null)
> <3>EXT4-fs (mmcblk0p6): VFS: Can't find ext4 filesystem
(In reply to Thomas Zimmermann from comment #15)
> BTW what partition layout do you use? The bootloader's output looks like the
> standard AOSP partitions, but later you mount Linaro's partitions. And what
> puzzles me the most is that it actually seems to work.
> 
> > <6>EXT4-fs (mmcblk0p5): mounted fiTesystem with ordered data mode. Opts: (null)
> > [    6.967956] EXT4-fs (mmcblk0p5): mounted filesystem with ordered data mode. Opts: (null)
> > <6>EXT4-fs (mmcblk0p7): mounted filesystem with ordered data mode. Opts: (null)
> > [    7.008392] EXT4-fs (mmcblk0p7): mounted filesystem with ordered data mode. Opts: (null)
> > <3>EXT4-fs (mmcblk0p6): VFS: Can't find ext4 filesystem

Never mind. I get the same massages. It's probably the internal name which is used unrelated to the mount parameters.
Attached file Refined kernel binary
Here is another kernel binary. Could you try it on the problematic board?

I extracted the exported linker symbols from the old kernel binary, and configured this kernel to export exactly the same symbols. So this kernel should be very similar to the old one.
Attachment #676784 - Flags: feedback?(ctalbert)
Clint, what does the printed label on the bottom of your pandaboards (that work and don't work) say.

Mine says PandaBoard Rev A3, and Thomas has PandaBoard ES Rev B1
(In reply to Dave Hylands [:dhylands] from comment #18)
> Clint, what does the printed label on the bottom of your pandaboards (that
> work and don't work) say.
> 
> Mine says PandaBoard Rev A3, and Thomas has PandaBoard ES Rev B1

Clint and I both have the PandaBoard ES Rev B1. The image that he raised this bug about works on my pandaboard, but doesn't on his.
If you have an old panda board, you might want to get a current version.

PandaBoard ES Rev B1 is the version we have purchased for production and therefore the version which we should be officially supporting.  There are differences between PandaBoard Rev A3 and PandaBoard ES Rev B1.  In fact, I have seen different behaviors with regards to the USB/Net using an identical linux kernel.
I finally managed to boot one of the broken PandaBoards. The cpuinfo is shown below.

> [tdz@host-6-53 omap]$ adb shell
> root@android:/ # cat /proc/cpuinfo                                             
> Processor	: ARMv7 Processor rev 10 (v7l)
> processor	: 0
> BogoMIPS	: 597.12
> 
> processor	: 1
> BogoMIPS	: 597.12
> 
> Features	: swp half thumb fastmult vfp edsp thumbee neon vfpv3 
> CPU implementer	: 0x41
> CPU architecture: 7
> CPU variant	: 0x2
> CPU part	: 0xc09
> CPU revision	: 10
> 
> Hardware	: OMAP4 Panda board
> Revision	: 0010
> Serial		: 0000000000000000
A fix for this problem is available at [1]. The respective kernel binary is available at [2]. I want to do some more testing before this goes into the official repository.

[1] https://github.com/tdz/android-kernel-omap/tree/bug-806096
[2] https://github.com/tdz/android-device-panda/tree/bug-806096
Blocks: 802317
I don't understand what the fix is for this.  [1] and [2] above look to be full repositories for Android.  If I wanted to get this fix for android on panda boards, what would I need to do?
(In reply to Joel Maher (:jmaher) from comment #23)
> I don't understand what the fix is for this.  [1] and [2] above look to be
> full repositories for Android.  If I wanted to get this fix for android on
> panda boards, what would I need to do?

Replace the file 'device/ti/panda/kernel' by the respective file from [2] and rebuild b2g.
(In reply to Thomas Zimmermann from comment #24)
> (In reply to Joel Maher (:jmaher) from comment #23)
> > I don't understand what the fix is for this.  [1] and [2] above look to be
> > full repositories for Android.  If I wanted to get this fix for android on
> > panda boards, what would I need to do?
> 
> Replace the file 'device/ti/panda/kernel' by the respective file from [2]
> and rebuild b2g.

Aka this one:

  https://github.com/tdz/android-device-panda/blob/bug-806096/kernel?raw=true

Please let me know if it doesn't boot your PandaBoard.
Comment on attachment 676784 [details]
Refined kernel binary

Thomas and I verified that his latest kernel booted great on my problematic pandaboard yesterday.

Great job, Thomas. :)
Attachment #676784 - Flags: feedback?(ctalbert) → feedback+
(In reply to Joel Maher (:jmaher) from comment #23)
> I don't understand what the fix is for this.  [1] and [2] above look to be
> full repositories for Android.  If I wanted to get this fix for android on
> panda boards, what would I need to do?

Joel, this is for the B2G Panda kernel not the Android Panda kernel. I'm not sure we can take this kernel and run Android OS on top of it. I think we'd be better off continuing to run Android OS on top of the Linaro Android builds as we have been doing.

This kernel is specifically targeted toward B2G OS on panda.
Here is the cpuinfo of a PandaBoard that always worked. It has a higher revision number and CPU performance, even though it's labelled 'ES Rev B1' like one of the broken boards

> root@android:/ # cat /proc/cpu                                                 
> cpu/    cpuinfo 
> root@android:/ # cat /proc/cpuinfo                                             
> Processor	: ARMv7 Processor rev 10 (v7l)
> processor	: 0
> BogoMIPS	: 696.37
> 
> processor	: 1
> BogoMIPS	: 699.04
> 
> Features	: swp half thumb fastmult vfp edsp thumbee neon vfpv3 
> CPU implementer	: 0x41
> CPU architecture: 7
> CPU variant	: 0x2
> CPU part	: 0xc09
> CPU revision	: 10
> 
> Hardware	: OMAP4 Panda board
> Revision	: 0020
> Serial		: 0000000000000000
:tzimmerman, I have the same cpu revision board as the "always worked" board but had the same problems with the previous kernels.  The only way to boot them was to hold down the enter key (as Dave suggested in comment 13).  This issue seem fixed with [1] :-)

Processor       : ARMv7 Processor rev 10 (v7l)
processor       : 0
BogoMIPS        : 696.37

processor       : 1
BogoMIPS        : 699.04

Features        : swp half thumb fastmult vfp edsp thumbee neon vfpv3
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc09
CPU revision    : 10

Hardware        : OMAP4 Panda board
Revision        : 0020
Serial          : 0000000000000000



[1] https://github.com/tdz/android-device-panda/blob/bug-806096/kernel?raw=true

sha1: f8c81b8654c2dedf2da5612466f5dcd87ebf7c21  kernel.2012110701
I just reimaged all the Panda boards in chassis 3 with the new B2G image provided by Thomas.
(In reply to Van Le [:van] from comment #30)
> I just reimaged all the Panda boards in chassis 3 with the new B2G image
> provided by Thomas.

I think Van meant for that to go to 807428
:dividehex, this is the bug that Thomas linked me to regarding the b2g images in chassis 3. He uploaded a new image for me this morning.
It's the bug report here.
All I really what to know is if these boards boot without major problems.
I noticed that the content of /proc/cpuinfo changes between AOSP and Linaro environments. Here is what I on AOSP.

> root@android:/ # cat /proc/cpuinfo                                             
> Processor	: ARMv7 Processor rev 10 (v7l)
> processor	: 0
> BogoMIPS	: 597.12
> 
> processor	: 1
> BogoMIPS	: 597.12
> 
> Features	: swp half thumb fastmult vfp edsp thumbee neon vfpv3 
> CPU implementer	: 0x41
> CPU architecture: 7
> CPU variant	: 0x2
> CPU part	: 0xc09
> CPU revision	: 10
> 
> Hardware	: OMAP4 Panda board
> Revision	: 0010
> Serial		: 0000000000000000

And this is on Linaro.

> root@android:/ # cat /proc/cpuinfo                                             
> Processor	: ARMv7 Processor rev 10 (v7l)
> processor	: 0
> BogoMIPS	: 696.37
> 
> processor	: 1
> BogoMIPS	: 699.04
> 
> Features	: swp half thumb fastmult vfp edsp thumbee neon vfpv3 
> CPU implementer	: 0x41
> CPU architecture: 7
> CPU variant	: 0x2
> CPU part	: 0xc09
> CPU revision	: 10
> 
> Hardware	: OMAP4 Panda board
> Revision	: 0020
> Serial		: 0000000000000000

The only way i can explain this is that the boot loaders do some magic here and change the board revision, and/or that the kernel command-line parameters have a significant influence.

The AOSP kernel command line is

> console=ttyO2,115200n8 mem=1G androidboot.console=ttyO2 androidboot.serialno=22C4000200000001 androidboot.bootloader=U-Boot_1.1.4-gedeced79

and the Linaro command line is

> console=ttyO2,115200n8 mem=1G androidboot.console=ttyO2 console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init androidboot.console=ttyO2
It's also possible for the bootloader to pass information to the kernel through ATAGS, and there is also a register that's loaded with a machine type.

http://lxr.linux.no/#linux+v3.0.8/arch/arm/kernel/head.S#L57


The Revision seems to come from the system_rev field here (this struct also appears to the contain the serial number as well)
http://lxr.linux.no/#linux+v3.0.8/arch/arm/kernel/compat.c#L32
Ah, thanks. The MAC is generated from a register. The commit is

  https://github.com/tdz/android-kernel-omap/commit/a858bc6d22d6f50ecbaa85f977e03aefa6da9084

If a boot-loader supplied value is relevant here, it must be written into this register. However, I haven't been able to find any code which does that.

I wish I'd know the actual MAC address on my PandaBoard. Do you know how to find out the Ethernet adapter's MAC from sysfs? Ifconfig doesn't work for me.
cat /sys/class/net/eth0/address FYI
Attachment #676130 - Flags: feedback?(ctalbert)
I'm going to go ahead and call this fixed.  Both Clint and I have verified the new kernel works and its been merged into the main line.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: