Closed Bug 1439570 Opened 6 years ago Closed 6 years ago

Run JS test suite on ARM64 hardware

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

ARM64
All

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lth, Unassigned)

References

Details

(Whiteboard: [geckoview:fxr:p2])

Currently we run the JS test suite on the ARM64 simulator.  The simulator is pretty good, but there are aspects of the hardware it does not simulate accurately (atomics; instruction cache non-coherence; fault handling; address space layout) and the first run of the test suite on hardware found bugs (bug 1430743), there are other bugs that have been observed on hardware as well.

An ARM64 VM running on top of x64 hardware is not likely to be a substantial improvement over the simulator.

We should therefore run tests on actual ARM64 hardware; just running the JS shell tests (js/src/tests, js/src/jit-tests; jsapi-tests) would be a good start.  We should run multiple configurations but there's no Ion JIT so configs will be nonstandard (--no-baseline; --baseline-eager; defaults).  We should run debug and release builds.

The platform doesn't need to be Android; Linux should be OK for now.  A Raspberry Pi 3 might be OK + is cheap but is slow; I use a SoftIron Overdrive 1000 dev system which is faster but more expensive and also runs OpenSUSE, which is somewhat painful.
This doesn't belong in the Build Config component, but I'm not sure where the right place to put it is, so I'm going to stick it in RelEng for now.

Any solution involving physical hardware is probably not going to be tractable--we had Pandaboards (TI arm dev boards) racked in a datacenter years ago and they were quite the headache, and the useful lifespan of physical hardware just isn't that great. Even if we could acquire the several hundred to 1000+ machines we'd need to run tests that keep up with our CI volume, it would take us months to a year to get them installed and supported, and they'd almost certainly be obsolete within 2 years.

That being said, there are several providers offering ARMv8 cloud computing:
https://www.scaleway.com/armv8-cloud-servers/
https://www.packet.net/bare-metal/servers/type-2a/

are two I've seen before. Those both use Cavium ThunderX CPUs: https://cavium.com/product-thunderx-arm-processors.html

If that would work for JS testing purposes then I think that's a thing we could do.
Component: Build Config → Platform Support
Product: Core → Release Engineering
QA Contact: catlee
Anything that runs on real hardware is fine; virtualization should not in itself be a hindrance, only emulation on top of another architecture.
Whiteboard: [geckoview:crow]
This feels like it will need Taskcluster platform support. Over to Coop for triage.
Flags: needinfo?(coop)
I'm talking to jmaher about our options for aarch64 cloud testing or real devices.
We've had two new requests in the past week for packet.net capacity. cc-ing Jonas who has been spearheading that effort.

I want to be clear that our work with packet.net is still very much at the prototype stage. We still need to design and implement docker-engine support for the tc-worker in order to run real workloads. We also don't have provisioner support for packet.net yet, so any worker pool would need to be provisioned statically.

As Joel (already cc-ed) will tell you, switching to a new machine or instance type is only one step in the process. If someone on the JS team is available to perform validation of the tests, the Taskcluster team (Jonas or someone else) can help you get setup with an instance or two to verify that your tests will, in fact, run in packet.net. From there, you can start getting a baseline of results for fixing/disabling failing specific test cases.
Flags: needinfo?(coop)
I can certainly help out with anything from the JS team side, or corral suitable help for ditto.
See Also: → 1425322
This sounds like you want per-task docker containers on arm hardware.

We don't have a docker-engine for tc-worker quite ready yet. But ckousik (awesome contributor) have been working on one, and me + wcosta have plans to talk to him Friday and see if we can figure something out. It's also possible that we can deploy docker-worker on packet, or that we simply use a tc-worker configuration without any task isolation.

This assumes that you guys are happy with running a command within a docker image, having task fail/succeed depending on exit code, and logs uploaded, but otherwise with no or very limited support for artifacts.

We have no dynamic provisioning, but that might be okay, depending on the load.
Similarly, we have no caching of artifacts in packet yet either, which might incur notable bandwidth cost if tests tasks are heavily chunked. As we would be paying 0.12 USD/GB for download. (Note: when we first when multi-region in EC2 cross region transfer at 0.02USD/GB dominated our EC2 bill pretty quickly -- so this might be worth back of envelope math, just to be sure).
Depends on: 1440330
No longer blocks: Rabaldr-ARM64
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Lars, we are standing up jittests (bug 1475648) to run on ARM64 builds on Google Pixel 2 devices in Bitbar's device farm. Will that be adequate to address this request for testing on real ARM64 hardware?

[geckoview:fxr:p2] because Firefox Reality 1.0 will not include ARM64 support.
Flags: needinfo?(lhansen)
Whiteboard: [geckoview:crow] → [geckoview:fxr:p2]
Testing on Pixel2 should be a major improvement over the current situation and it's a mainstream platform, so yes, I think that should satisfy the request for testing on real ARM64 hardware.
Flags: needinfo?(lhansen)
The jit tests have been running on Android 8.0 Pixel2 AArch64  for mozilla-central opt as a tier-3 job for some time.

<https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&superseded=&tier=1,2,3&searchStr=Android,8.0,Pixel2,AArch64,opt,jit>

We have long standing failures in jit5,jit6,jit10 that haven't been addressed.

Can we call this resolved and move on?
Flags: needinfo?(lhansen)
(In reply to Bob Clary [:bc:] from comment #10)

> Can we call this resolved and move on?

Works for me.  Chris?
Flags: needinfo?(lhansen) → needinfo?(cpeterson)
(In reply to Bob Clary [:bc:] from comment #10)
> We have long standing failures in jit5,jit6,jit10 that haven't been
> addressed.
> 
> Can we call this resolved and move on?

OK, since we have bug 1475648 on file for those jit5/6/10 test failures.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(cpeterson)
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.