Closed Bug 810045 Opened 12 years ago Closed 12 years ago

Image the pandas in chassis 3 with a new kernel image for b2G

Categories

(Infrastructure & Operations :: RelOps: General, task)

ARM
Gonk (Firefox OS)
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
B2G C2 (20nov-10dec)

People

(Reporter: cmtalbert, Assigned: dividehex)

References

Details

(Whiteboard: [reit-b2g])

Attachments

(5 files)

We did this yesterday.  However, the image boots but does not get on the network.

We have been unable to get the image on the network or to determine what the actual cause is. If we could get the full serial log output from one of the pandas during its boot process we might be able to understand what is happening.

Right now, my theory is that the new kernel may calculate the static mac address differently and therefore the panda is coming up with a different mac address that DHCP isn't recognizing and therefore they aren't getting the IP addresses we would expect them to have.

If we could get the full serial output we could easily disprove this theory and maybe find some indication as to what is really happening.
The plan is to put the uboot image on these devices tomorrow (in fact, that was the plan yesterday -- I'm not clear on why the B2G image got installed instead).

Once that's installed and working, we can experiment with different B2G images.
Assignee: server-ops-releng → jwatkins
I've just seen that AOSP's boot loader passes the board's serial number via the kernel command line. The Linaro-based image I gave you does not. This is likely the reason for the network problem. I don't need this functionality, so i didn't think of it. Let's hope this is fixes problem.
I've also noticed the files under /system have clobbered file permissions.  Some files are not setuid that should be.

When building for the linaro setup, make sure to output tarball.  copying out/target/product/panda/system doesn't retain the correct file perms.  in fact, it looks like out/target/product/panda/system doesn't have the correct file perms at all.  Same goes for the userdata.

use:

build.sh boottarball systemtarball userdatatarball

and use tar with -ps when extracting to the sdcard to maintain the correct owners/perms
Thanks for pointing out the tarball parameters, I didn't know they exists. I have a script for copying all data to the respective partitions and fixing the worst problems.
Attached file serial output
I dropped the Nov9 build on a couple of pandas in chassis 3 and immediately noticed the same issue with the network.  From packet sniffing the dhcp server I can see they never make a dhcp request on eth0 during b2g startup.  The only request come from uboot when it exec the bootp command.  This leads to a brief time frame of ping replies but stops when the b2g kernel takes over.  As soon as the serial shell was up I could check netcfg and see that eth0 was up and had a unique mac (but identical to the U-boot mac).  When issued a 'netcfg eth0 dhcp', it gets an ip immediately and remains stable and ping able.  I experienced the same behavior running this build on my home network over the weekend.

Good news is it was stable all weekend.
Also note: the current B2G build doesn't request an IP correctly on startup.

This is noted here: https://bugzilla.mozilla.org/show_bug.cgi?id=810045#c5
Hi Jake,

So the good news is that it works in principle, but needs to be configured automatically.

I recently working on cleaning up the network setup in the PandaBoard's init.*.rc files. Do you have the change in

  https://github.com/tdz/android-device-panda/commit/3c89a64b65390ff3fac0fc07173e10e5746a171a

in your tree? We tried to verify that it doesn't break anything, but who knows.

Best regards
Thomas
> I recently working
'Recently I was working'
sorry, c#6 was for another bug
(In reply to Thomas Zimmermann from comment #7)

> Do you have the change in
> 
>  
> https://github.com/tdz/android-device-panda/commit/
> 3c89a64b65390ff3fac0fc07173e10e5746a171a
> 
> in your tree? We tried to verify that it doesn't break anything, but who
> knows.

I can't say with authority since Clint provided the build artifacts but from inspecting the init.omap4pandaboard.rc, it *does* look like that commit is applied to this build.
Attached file eth0 missing - dmesg
I just had the case where eth0 was not present after boot. The Ethernet NIC's driver produces messages like

> [    5.027954] smsc95xx v1.0.4

> [    5.121032] smsc95xx 1-1.1:1.0: eth0: register 'smsc95xx' at usb-ehci-omap.0-1.1, smsc95xx USB 2.0 Ethernet, 2e:60:d3:3e:4e:01

> [   62.174407] smsc95xx 1-1.1:1.0: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1

in the dmesg output. From the attached dmesg you can see that the driver for the Ethernet NIC is not initialized.

I can't reproduce the problem. I booted the kernel from the older commit 13b71c564331e3d7c8a7509eff0f7bfd4fc17f13, which started the NIC. And now it seems to work even with the latest kernel binary.
Sounds like dirty initial device state.  Are you physically power-cycling your device, or command-line rebooting it?  Production pandas will almost always be started from a power-cycle.
(In reply to Dustin J. Mitchell [:dustin] from comment #12)
> Sounds like dirty initial device state.  Are you physically power-cycling
> your device, or command-line rebooting it?  Production pandas will almost
> always be started from a power-cycle.

Actually I tried both, but it doesn't make a difference: once the NIC was initialized it worked reliably. The best I can do is to check this tomorrow in the morning after I turned on the board for the first time.
I switched on my PandaBoard this morning and networking worked out-of-the-box.

Could you add logwrapper around netcfg in device/ti/panda/init.omap4pandaboard.rc? This will add netcfg's output to logcat. The respective patch is attached.

And could you set 'loglevel' in out/target/product/panda/root/init.rc to 8? This will make init produce a lot more logging on the console.
> Could you add logwrapper around netcfg in
> device/ti/panda/init.omap4pandaboard.rc? This will add netcfg's output to
> logcat. The respective patch is attached.
> 
> And could you set 'loglevel' in out/target/product/panda/root/init.rc to 8?
> This will make init produce a lot more logging on the console.

:Tzimmermann
I've updated the artifacts with these changes.  When I get to the DC today, I'll deploy the update and grab the logcat and serial dump for you to examine.
:tzimmermann

A tcpdump showed the dhcp server was not receiving any dhcp packets from this board while b2g was starting.  It did however successfully obtain an IP from the u-boot script before turning over to the b2g kernel on the sdcard.
ctalbert, jgriffin,
Can I assume that the problems in this bug affect the imaging to b2g pandas by bmm?
Should this be blocking bug 802317 as well?

According to dividehex: "bmm happily puts b2g on a board.  b2g just never comes up.  the issue is in the way b2g starts the netcfg_eth0 service. maybe a timing issue"
Blocks: 802317
In order to have any hope of getting gaia smoke tests in TBPL by Dec 10, this will need to be resolved by early next week.
Target Milestone: --- → B2G C2 (20nov-10dec)
(In reply to Jake Watkins [:dividehex] from comment #16)
> Created attachment 683827 [details]
> serial console dump lvl 8 w/logwrapper
> 
> :tzimmermann
> 
> A tcpdump showed the dhcp server was not receiving any dhcp packets from
> this board while b2g was starting.  It did however successfully obtain an IP
> from the u-boot script before turning over to the b2g kernel on the sdcard.

I looked through the logs you provided. Thanks.

> [    8.340515] init: command 'ifup' r=0
Maybe the NIC gets started here.

> [    8.706146] init: starting 'netcfg_eth0'
Netcfg gets started here.

> [   11.312377] smsc95xx 1-1.1:1.0: eth0: link up, 100Mbps, full-duplex, lpa 0xC5E1
This is when the NIC is ready.

Maybe netcfg runs to early. :/

> D/DHCP    ( 1273): ===== DHCP message:
> D/DHCP    ( 1273): op = BOOTREQUEST (1), htype = 1, hlen = 6, hops = 0
> D/DHCP    ( 1273): xid = 0x3e250000 secs = 0, flags = 0x8000 optlen = 14
> D/DHCP    ( 1273): ciaddr = 0.0.0.0
> D/DHCP    ( 1273): yiaddr = 0.0.0.0
> D/DHCP    ( 1273): siaddr = 0.0.0.0
> D/DHCP    ( 1273): giaddr = 0.0.0.0
> D/DHCP    ( 1273): chaddr = { 2e  60  cd  71  4e  01 }
> D/DHCP    ( 1273): sname = ''
> D/DHCP    ( 1273): file = ''
> D/DHCP    ( 1273): op 53 len 1 { 01 } discover
> D/DHCP    ( 1273): op 55 len 4 { 01  03  06  1c } 

This is the board's DHCP request. It's BOOTP actually, but this should not make much of a difference here. You should probably check if your DHCP server receives this message. Maybe the switches or routers are misconfigured.

> D/DHCP    ( 1273): ===== DHCP message:
> D/DHCP    ( 1273): op = BOOTREPLY (2), htype = 1, hlen = 6, hops = 0
> D/DHCP    ( 1273): xid = 0xa7df2ae0 secs = 0, flags = 0x8000 optlen = 64
> D/DHCP    ( 1273): ciaddr = 0.0.0.0
> D/DHCP    ( 1273): yiaddr = 10.12.50.92
> D/DHCP    ( 1273): siaddr = 10.12.48.10
> D/DHCP    ( 1273): giaddr = 0.0.0.0
> D/DHCP    ( 1273): chaddr = { 34  15  9e  18  17  bc }
> D/DHCP    ( 1273): sname = ''
> D/DHCP    ( 1273): file = 'linux-install/pxelinux.0'
> D/DHCP    ( 1273): op 53 len 1 { 05 } ack
> D/DHCP    ( 1273): op 54 len 4 { 0a  0c  30  0b } 
> D/DHCP    ( 1273): op 51 len 4 { 00  01  51  80 } 
> D/DHCP    ( 1273): op 1 len 4 { ff  ff  f8  00 } 
> D/DHCP    ( 1273): op 15 len 22 {build.scl1.mozilla.com} 
> D/DHCP    ( 1273): op 3 len 4 { 0a  0c  30  01 } 
> D/DHCP    ( 1273): op 6 len 4 { 0a  0c  30  13 } 
> D/DHCP    ( 1273): netcfg: Wrong Xid 0xa7df2ae0 != 0x3e250000
> D/DHCP    ( 1273): invalid reply

This is a reply from the DHCP server. The xid is different. According to the DHCP RFC [1], it's a random transaction number that is chosen by the client. So this is probably the reply for some other board. However the DHCP server seems to support BOOTP requests, and not discard them.

Regards
Thomas

[1] http://www.ietf.org/rfc/rfc2131.txt
 
> > [    8.340515] init: command 'ifup' r=0
> Maybe the NIC gets started here.
> 
> > [    8.706146] init: starting 'netcfg_eth0'
> Netcfg gets started here.
> 
> > [   11.312377] smsc95xx 1-1.1:1.0: eth0: link up, 100Mbps, full-duplex, lpa 0xC5E1
> This is when the NIC is ready.
> 
> Maybe netcfg runs to early. :/

The order is the same on my PandaBoard, but I got

> [   20.640808] init: process 'netcfg_eth0', pid 1268 exited

even later. Since I cannot see such a message in the logs you provided, I 'd suspect netcfg is still running because it didn't receive a valid response from the server. All the error messages about incorrect xids support this.
Whiteboard: [reit-b2g]
(In reply to Thomas Zimmermann from comment #20)

> The order is the same on my PandaBoard, but I got
> 
> > [   20.640808] init: process 'netcfg_eth0', pid 1268 exited
> 
> even later. Since I cannot see such a message in the logs you provided, I 'd
> suspect netcfg is still running because it didn't receive a valid response
> from the server. All the error messages about incorrect xids support this.

Thomas, if that is the case, what is our strategy for mitigating it?
Well, are you sure that the problem is really on the PandaBoard and not somewhere in the network?

Then we could try to run netcfg later during the boot process. Android's init is quite limited, but maybe I'll get something to work here.
We're fairly certain, yes -- the PXE boot process uses DHCP as well, and works just fine.  It's unlikely that anything in the network would allow *some* bootp packets but not others -- we don't do that level of filtering.  We do, however, use dhcp helpers, which may fail on an invalid packet.

It sounds like we need to get a dump of packets sent by the panda and received by dhcpd at the same time.  If we can also get logs from the firewalls (where dhcp helpers are running), that would be even better.
I don't see anything interesting in the firewall logs for the 20th (when Jake captured the serial output).
(In reply to Dustin J. Mitchell [:dustin] from comment #23)
> We're fairly certain, yes -- the PXE boot process uses DHCP as well, and
> works just fine.  It's unlikely that anything in the network would allow
> *some* bootp packets but not others -- we don't do that level of filtering. 
> We do, however, use dhcp helpers, which may fail on an invalid packet.
> 
> It sounds like we need to get a dump of packets sent by the panda and
> received by dhcpd at the same time.  If we can also get logs from the
> firewalls (where dhcp helpers are running), that would be even better.

Could you provide a tcpdump of the packets?
I can, but not synchronized with a serial dump.
Attached file panda-0045.tcpdump
From a mozpool-managed install of
  http://people.mozilla.com/~jwatkins/bmm/artifacts/b2g-2012112000/
on panda-0045.build.scl1.mozilla.com (10.12.52.147).  admin1a.build.scl1.mozilla.com is the DHCP server, and the host where I ran the dump.  It's also the NTP server.

Here are the mozpool logs for this process.  Timestamps are UTC.  There are two attempts to install here - one beginning at 15:37:34 and the next at 15:44:02.  In each case, the device enters state 'b2g_rebooting' just before it invokes 'reboot' locally to boot into the B2G environment.  During that boot process, the u-boot loader gets an IP and polls the TFTP server, but does not find any pxe configuration (it was cleared at 15:38:44), so it falls back to a local boot.

2012-11-27T15:37:34 statemachine writing pxe_config panda-b2g.1, boot config {"version":1,"b2gbase":"http://people.mozilla.com/~jwatkins/bmm/artifacts/b2g-2012112000/"} to db
2012-11-27T15:37:34 statemachine entering state start_pxe_boot
2012-11-27T15:37:34 statemachine entering state pxe_power_cycling
2012-11-27T15:37:34 bmm setting PXE config to 'panda-b2g.1'
2012-11-27T15:37:34 bmm initiating power cycle
2012-11-27T15:37:37 statemachine entering state pxe_booting
2012-11-27T15:38:44 bmm clearing PXE config
2012-11-27T15:38:44 statemachine entering state b2g_downloading
2012-11-27T15:39:15 statemachine entering state b2g_extracting
2012-11-27T15:40:15 statemachine entering state b2g_rebooting
2012-11-27T15:42:22 statemachine entering state b2g_pinging
2012-11-27T15:42:23 bmm ping of panda-0045.build.scl1.mozilla.com complete: failed
2012-11-27T15:42:42 statemachine entering state b2g_pinging
2012-11-27T15:42:43 bmm ping of panda-0045.build.scl1.mozilla.com complete: failed
2012-11-27T15:43:02 statemachine entering state b2g_pinging
2012-11-27T15:43:03 bmm ping of panda-0045.build.scl1.mozilla.com complete: failed
2012-11-27T15:43:22 statemachine entering state b2g_pinging
2012-11-27T15:43:23 bmm ping of panda-0045.build.scl1.mozilla.com complete: failed
2012-11-27T15:43:42 statemachine entering state b2g_pinging
2012-11-27T15:43:43 bmm ping of panda-0045.build.scl1.mozilla.com complete: failed
2012-11-27T15:44:02 statemachine entering state pxe_power_cycling
2012-11-27T15:44:02 bmm setting PXE config to 'panda-b2g.1'
2012-11-27T15:44:02 bmm initiating power cycle
2012-11-27T15:44:05 statemachine entering state pxe_booting
2012-11-27T15:45:13 bmm clearing PXE config
2012-11-27T15:45:13 statemachine entering state b2g_downloading
2012-11-27T15:45:43 statemachine entering state b2g_extracting
2012-11-27T15:46:43 statemachine entering state b2g_rebooting
2012-11-27T15:48:53 statemachine entering state b2g_pinging
2012-11-27T15:48:53 bmm ping of panda-0045.build.scl1.mozilla.com complete: failed
2012-11-27T15:49:13 statemachine entering state b2g_pinging
2012-11-27T15:49:13 bmm ping of panda-0045.build.scl1.mozilla.com complete: failed
2012-11-27T15:49:33 statemachine entering state b2g_pinging
2012-11-27T15:49:33 bmm ping of panda-0045.build.scl1.mozilla.com complete: failed
2012-11-27T15:49:53 statemachine entering state b2g_pinging
2012-11-27T15:49:53 bmm ping of panda-0045.build.scl1.mozilla.com complete: failed
2012-11-27T15:50:13 statemachine entering state b2g_pinging
2012-11-27T15:50:13 bmm ping of panda-0045.build.scl1.mozilla.com complete: failed
2012-11-27T15:50:33 statemachine entering state pxe_power_cycling
2012-11-27T15:50:33 bmm setting PXE config to 'panda-b2g.1'
2012-11-27T15:50:33 bmm initiating power cycle
2012-11-27T15:50:36 statemachine entering state pxe_booting
2012-11-27T15:51:43 bmm clearing PXE config
2012-11-27T15:51:43 statemachine entering state b2g_downloading
Thanks!

And could you do a 

  cat /sys/class/net/eth0/operstate

On my board it returns 'unknown' after booting finished, but when I unplug and plug-in the Ethernet cable, it returns 'up'. I don't know if this is related, but I'd expect it to contain 'up' as soon as I activated the device.
I can't because I'm not onsite.  The serial console isn't remotely accessible.

This suggests a profitable direction to investigate here -- if netcfg is trying to send packets before link is up, or even just after link is up (but perhaps in the few ms before the switch port is fully operational), that might explain their disappearing into the ether.
This issue is caused by a bug in the way dhcpclient.c, which is called by netcfg, handles invalid reply packets.  When netcfg is called in init, it begins initializing the iface and then starts the dhcp request process before the link has enough time to come up.  It is able to send out a discovery packet which never reaches the switch while in this time window.  When the link comes up it begins polling for replies and processing them.  Since there is a lot of dhcp traffic on the vlans we have tested on, the client is consistently throwing out replies with mismatched xid and returning to poll for replies rather than properly timing out and retransmitting a discovery.

The fix for this would be to call 'ifconfig eth0 up' to begin the iface initialization, check for an up state in /sys/class/net/eth0/operstate and then run 'netcfg eth0 dhcp' when the up state condition is met.  The problem here, as Thomas pointed out, is after it has been initialized and the link is up it still doesn't report it correctly under /sys.  We could use a sleep() call pad the time but that would be a poor workaround.

The other fix is to use dhcpcd instead of netcfg.  dhcpcd does a far better job at handling iface initialization and handling time outs between discovery requests.

service dhcpcd_eth0 /system/bin/dhcpcd eth0
    class main
    oneshot

I've tested this in MTV and on vlan48 in SCL1 and it works.
I see that the code has landed:
https://github.com/mozilla-b2g/android-device-panda/commit/63196b48f479c30787863c8ca89456acd4b03910

When will the change be part of the panda builds?
https://tbpl.mozilla.org/?jobname=panda

I see this line:
08:24:02     INFO -    <project name="android-device-panda" path="device/ti/panda" remote="b2g" revision="599e736cb24dff10b949edcac3a6e5a76bce2144"/>
which seems to point to:
https://github.com/mozilla-b2g/android-device-panda/commit/599e736cb24dff10b949edcac3a6e5a76bce2144

That seems to be from Nov. 12th.
Is there a manifest that needs to be updated? Or merge a specific repo?

If this could land today we could have it running on releng's side either EOD or on Monday.

For now we can hard-code where to grab the artifacts from.
Blocks: 818264
(In reply to Armen Zambrano G. [:armenzg] from comment #32)
> I see that the code has landed:
> https://github.com/mozilla-b2g/android-device-panda/commit/
> 63196b48f479c30787863c8ca89456acd4b03910
> 
> When will the change be part of the panda builds?
> https://tbpl.mozilla.org/?jobname=panda
> 
> I see this line:
> 08:24:02     INFO -    <project name="android-device-panda"
> path="device/ti/panda" remote="b2g"
> revision="599e736cb24dff10b949edcac3a6e5a76bce2144"/>
> which seems to point to:
> https://github.com/mozilla-b2g/android-device-panda/commit/
> 599e736cb24dff10b949edcac3a6e5a76bce2144
> 
> That seems to be from Nov. 12th.
> Is there a manifest that needs to be updated? Or merge a specific repo?
> 
> If this could land today we could have it running on releng's side either
> EOD or on Monday.
> 
> For now we can hard-code where to grab the artifacts from.

I found out that for this to make it into the releng systems we have to update the b2g snapshot.

I filed bug 818264 to take care of it.
I'm going to go ahead and call this fixed since bug 818264 is on file to get the update taken care of.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
No longer blocks: 815785
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: