Open Bug 1946605 Opened 16 days ago Updated 4 days ago

Poor translation quality with new lines in the text

Tracking

()

Status:

NEW

People

(Reporter: epavlov, Unassigned)

References

(Blocks 1 open bug)

Details

Evgeny Pavlov

Reporter

Description

•

16 days ago

When translating the first paragraph of this website https://www.freebsd.org/ from English to Russian it produces nonsense. When I copy the text to about:translations it's perfect. I edited the HTML to remove new lines from the text and the quality of translation became good.

Does it make sense to remove new lines inside paragraphs before translating to prevent such cases?

FreeBSD is an operating system used to power modern servers,
desktops, and embedded <a href="https://www.freebsd.org/platforms/">platforms.</a> A large
<a href="https://docs.FreeBSD.org/en/articles/contributors/#staff-committers">
community</a> has continually developed it for more than thirty
years. Its advanced networking, security, and storage features have
made FreeBSD the platform of choice for many of the <a href="https://docs.FreeBSD.org/en/books/handbook/introduction/#introduction-nutshell-users">
busiest web sites</a> and most pervasive embedded networking and
storage devices.

Flags: needinfo?(enordin)

Erik Nordin [:nordzilla]

Comment 1

•

15 days ago

This is a good callout.

Here is the raw HTML, how it is written on the site, with exaggerated space between the lines:

<p>FreeBSD is an operating system used to power modern servers,

desktops, and embedded <a href="https://www.freebsd.org/platforms/">platforms.</a> A large

<a href="https://docs.FreeBSD.org/en/articles/contributors/#staff-committers">

community</a> has continually developed it for more than thirty

years. Its advanced networking, security, and storage features have

made FreeBSD the platform of choice for many of the <a href="https://docs.FreeBSD.org/en/books/handbook/introduction/#introduction-nutshell-users">

busiest web sites</a> and most pervasive embedded networking and

storage devices.</p>

I am also able to notice degraded quality in the Full-Page Translation from English to Spanish, though not complete nonsense. And I am also able to observe that it is fixed by removing the explicit newlines from the <p> element within the HTML.

Below is a diff where the "removed" items are the Full-Page translation sentences and the "added" items are the about:translations sentences from the copied text.

-FreeBSD es un sistema operativo utilizado para alimentar servidores modernos, de escritorio y plataformas integradas.
-Un gran comunidad lo ha desarrollado continuamente durante más de treinta años años.
-Sus avanzadas funciones de redes, seguridad y almacenamiento tienen hizo de FreeBSD la plataforma de elección para muchos de los los sitios web más ocupados y la red integrada más omnipresente y dispositivos de almacenamiento.

+FreeBSD es un sistema operativo utilizado para alimentar servidores modernos, escritorios y plataformas integradas.
+Una gran comunidad lo ha desarrollado continuamente durante más de treinta años.
+Sus avanzadas funciones de networking, seguridad y almacenamiento han convertido a FreeBSD en la plataforma de elección para muchos de los sitios web más ocupados y dispositivos de red y almacenamiento integrados más omnipresente.

My first thought is that I don't want to add more arbitrary text cleaning rules to translations-document, and this appears to me to be another point in favor of prioritizing separating out the HTML concerns from the inference engine.

However, even when I pull the textContent string from the <p> element I get this:

"FreeBSD is an operating system used to power modern servers,
desktops, and embedded platforms. A large

community has continually developed it for more than thirty
years. Its advanced networking, security, and storage features have
made FreeBSD the platform of choice for many of the 
busiest web sites and most pervasive embedded networking and
storage devices."

I'm still hesitant to add more arbitrary text cleaning rules to the algorithm, in favor of fixing this with the full HTML separation.

Greg, do you have any thoughts on this?

Flags: needinfo?(enordin) → needinfo?(gtatum)

Greg Tatum [:gregtatum]

Comment 2

•

11 days ago

This is a bug in the HTML parsing code inside of the inference engine. I'm guessing it stems from sentence split treating each new line as a separate sentence. It might take some time to fully replace this with our own custom solution, so I think it's worth doing a quick fix. There's a distinction between the code from innerHTML and what is presented to the end user. A block element that is being translated should ideally have its whitespace normalized in a similar fashion to how the user will see it.

This articles talks more about how whitespace is handled when rendering the page: https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Whitespace

The only tag that I'm aware of where whitespace matters is the <pre> tag. If we're outside of the pre tag I could see two places to handle this normalization.

§1. Walk the block's subtree and normalize the whitespace of the TextNodes following the same behavior as what would happen when the text is normalized. This would be mean changing newlines to whitespace, and removing duplicate concurrent whitespace. If we did this, then the innerHTML would be correct.

https://searchfox.org/mozilla-central/rev/f89fa2093be1cedd708fb8c5538df98ba73f4456/toolkit/components/translations/content/translations-document.sys.mjs#1532

Or we could do:

§2. In the text normalization step inside of the engine, we could replace newlines with whitespace. We might need to pass in the outer tag as well to toggle this behavior on/off to support <pre> tags, as the engine doesn't know about the surrounding tag.

https://searchfox.org/mozilla-central/rev/f89fa2093be1cedd708fb8c5538df98ba73f4456/toolkit/components/translations/content/translations-engine.worker.js#95-120

Flags: needinfo?(gtatum)

Greg Tatum [:gregtatum]

Updated

•

11 days ago

Severity: -- → S3

Erik Nordin [:nordzilla]

Comment 3

•

10 days ago

•

Edited

I think my vote would be for §1 since our ultimate goal is to move the HTML parsing out of the engine entirely. Updating the bindings to include a tag so that we can handle it within the engine feels like a step in the other direction, though definitely a workable solution in and of itself.

Greg Tatum [:gregtatum]

Comment 4

•

4 days ago

§1 works for me!

Greg Tatum [:gregtatum]

Updated

•

4 days ago

Blocks: 1845772

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Poor translation quality with new lines in the text

Categories

(Firefox :: Translations, defect)

Tracking

()

People

(Reporter: epavlov, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Updated