Poor translation quality with new lines in the text
Categories
(Firefox :: Translations, defect)
Tracking
()
People
(Reporter: epavlov, Unassigned)
References
(Blocks 1 open bug)
Details
When translating the first paragraph of this website https://www.freebsd.org/ from English to Russian it produces nonsense. When I copy the text to about:translations it's perfect. I edited the HTML to remove new lines from the text and the quality of translation became good.
Does it make sense to remove new lines inside paragraphs before translating to prevent such cases?
FreeBSD is an operating system used to power modern servers,
desktops, and embedded <a href="https://www.freebsd.org/platforms/">platforms.</a> A large
<a href="https://docs.FreeBSD.org/en/articles/contributors/#staff-committers">
community</a> has continually developed it for more than thirty
years. Its advanced networking, security, and storage features have
made FreeBSD the platform of choice for many of the <a href="https://docs.FreeBSD.org/en/books/handbook/introduction/#introduction-nutshell-users">
busiest web sites</a> and most pervasive embedded networking and
storage devices.
Comment 1•15 days ago
|
||
This is a good callout.
Here is the raw HTML, how it is written on the site, with exaggerated space between the lines:
<p>FreeBSD is an operating system used to power modern servers,
desktops, and embedded <a href="https://www.freebsd.org/platforms/">platforms.</a> A large
<a href="https://docs.FreeBSD.org/en/articles/contributors/#staff-committers">
community</a> has continually developed it for more than thirty
years. Its advanced networking, security, and storage features have
made FreeBSD the platform of choice for many of the <a href="https://docs.FreeBSD.org/en/books/handbook/introduction/#introduction-nutshell-users">
busiest web sites</a> and most pervasive embedded networking and
storage devices.</p>
I am also able to notice degraded quality in the Full-Page Translation from English to Spanish, though not complete nonsense. And I am also able to observe that it is fixed by removing the explicit newlines from the <p>
element within the HTML.
Below is a diff where the "removed" items are the Full-Page translation sentences and the "added" items are the about:translations sentences from the copied text.
-FreeBSD es un sistema operativo utilizado para alimentar servidores modernos, de escritorio y plataformas integradas.
-Un gran comunidad lo ha desarrollado continuamente durante más de treinta años años.
-Sus avanzadas funciones de redes, seguridad y almacenamiento tienen hizo de FreeBSD la plataforma de elección para muchos de los los sitios web más ocupados y la red integrada más omnipresente y dispositivos de almacenamiento.
+FreeBSD es un sistema operativo utilizado para alimentar servidores modernos, escritorios y plataformas integradas.
+Una gran comunidad lo ha desarrollado continuamente durante más de treinta años.
+Sus avanzadas funciones de networking, seguridad y almacenamiento han convertido a FreeBSD en la plataforma de elección para muchos de los sitios web más ocupados y dispositivos de red y almacenamiento integrados más omnipresente.
My first thought is that I don't want to add more arbitrary text cleaning rules to translations-document
, and this appears to me to be another point in favor of prioritizing separating out the HTML concerns from the inference engine.
However, even when I pull the textContent
string from the <p>
element I get this:
"FreeBSD is an operating system used to power modern servers,
desktops, and embedded platforms. A large
community has continually developed it for more than thirty
years. Its advanced networking, security, and storage features have
made FreeBSD the platform of choice for many of the
busiest web sites and most pervasive embedded networking and
storage devices."
I'm still hesitant to add more arbitrary text cleaning rules to the algorithm, in favor of fixing this with the full HTML separation.
Greg, do you have any thoughts on this?
Comment 2•11 days ago
|
||
This is a bug in the HTML parsing code inside of the inference engine. I'm guessing it stems from sentence split treating each new line as a separate sentence. It might take some time to fully replace this with our own custom solution, so I think it's worth doing a quick fix. There's a distinction between the code from innerHTML
and what is presented to the end user. A block element that is being translated should ideally have its whitespace normalized in a similar fashion to how the user will see it.
This articles talks more about how whitespace is handled when rendering the page: https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Whitespace
The only tag that I'm aware of where whitespace matters is the <pre>
tag. If we're outside of the pre tag I could see two places to handle this normalization.
§1. Walk the block's subtree and normalize the whitespace of the TextNodes following the same behavior as what would happen when the text is normalized. This would be mean changing newlines to whitespace, and removing duplicate concurrent whitespace. If we did this, then the innerHTML
would be correct.
Or we could do:
§2. In the text normalization step inside of the engine, we could replace newlines with whitespace. We might need to pass in the outer tag as well to toggle this behavior on/off to support <pre>
tags, as the engine doesn't know about the surrounding tag.
Updated•11 days ago
|
Comment 3•10 days ago
•
|
||
I think my vote would be for §1 since our ultimate goal is to move the HTML parsing out of the engine entirely. Updating the bindings to include a tag so that we can handle it within the engine feels like a step in the other direction, though definitely a workable solution in and of itself.
Comment 4•4 days ago
|
||
§1 works for me!
Description
•