A startling discovery has been made by researchers: a significant portion of online content is badly translated by machines. This is especially true for languages spoken in Africa and the Global South.
The study conducted by the Amazon Web Services AI lab has raised serious concerns about the training of large language models. The researchers discovered that over 57% of the sentences on the web have been poorly translated into two or more languages, resulting in a significant amount of machine-translated garbage.
To conduct the study, the researchers collected 6.38 billion sentences from the web. They observed patterns of multi-way parallelism, where sets of sentences are translations of each other in three or more languages. Surprisingly, a majority of the internet is composed of translated content, with 57.1% of the sentences in the corpus being multi-way parallel translations.
The quality of these translations varies greatly, as machine learning is influenced by human biases and tends to favor languages spoken in the Western world and the Global North. This poses challenges for "low-resource" languages spoken in places like Africa, where insufficient training data is available for accurate translations.
- CyberBeat
CyberBeat is a grassroots initiative from a team of producers and subject matter experts, driven out of frustration at the lack of media coverage, responding to an urgent need to provide a clear, concise, informative and educational approach to the growing fields of Cybersecurity and Digital Privacy.
If you have a story of interest, a comment, a concern or if you'd just like to say Hi, please contact us
We couldn't do this without the support of our sponsors and contributors.