University students develop automatic translators for indigenous languages

Machine translators will allow the preservation of indigenous languages. Experts from IIMAS are working on translation procedures for Wixárika and Ayuuk. A major challenge is to feed the computer system with native phrases equivalent to Spanish.

A program that allows to automatically translate Wixárika (Nayarit), Ayuuk (Oaxaca), Nahuatl (classic and modern), Mexicanero (Durango), and Yorinoqui (State of Mexico), as if they were English or French into Spanish, is being developed by specialists from UNAM's Institute for Research in Applied Mathematics and Systems (IIMAS). Iván Vladimir Meza Ruiz, from the Department of Computer Science at IIMAS and head of the project, said that we are used to the use of translators offered by large international companies for Spanish-English or other languages.

According to the catalog of the National Institute of Indigenous Languages, in Mexico, 68 linguistic groupings have 364 variants and until recently only the company Microsoft developed, in collaboration with universities in Queretaro and Yucatan, Otomi and Mayan interpretation software, as part of its Heritage program. "How do you help when there is a language of which there are very few speakers left, such as Ayapaneco? There are few records of it, so the technology probably arrives late for some of them and we cannot make 68 official ones, but there are others that do have speakers and are flourishing," said Meza Ruiz.

The Artificial Intelligence specialist explained that in 2014 he started the work thanks to a student who has a relationship with the Wixárika community, known by most as Huicholes, and had the intention of supporting it. Little by little volunteers joined this work, mainly those who have a relationship with native communities, study a technical career, and work with Nahuatl, Mexicanero, and Yoem Noki. For example, the IIMAS researcher advises his undergraduate student, César Cruz, IIMAS, to document the intelligent system for Mazahua, or as they call themselves J ñatio, which the student developed in the form of a cell phone application called MazahuApp, which is available through GoogleApps.

Another case is that of his master's student Delfino Zacarías Márquez Cruz, an Ayuuk (Mixe) speaker, who is working on an interpretation method, a work in which several members of his home site participated in data collection. The idea arose because I had wanted to design a translator for some time, but I didn't know how to land my idea, so I approached Dr. Ivan who proposed me to make the neural network, but it required fieldwork because when I started there were no resources to train the model and something called corpus was needed, which are the texts between Spanish and the language you want to work with. The challenge was to work on them, find someone to translate them, and find people willing to share, said Zacarías Márquez.

This work uses neural networks, a computational model that mimics a process, which in this case is the translation from one language to another, and therefore requires examples, such as translated sentence data between the two. Common, and to some extent basic, mathematical concepts such as matrix operations and vector calculus are used. The complexity arises when calibrating the models, that is, finding specific values for each of the actions to be performed by the system, so that a sentence in one language is transformed to another, without being confused. Fortunately, several algorithms work well, but since today's so-called deep models have numerous modules and values to process, specialized computing equipment is needed.

Feeding the database

Those developed so far, including Microsoft's, are deficient because these types of technologies are more successful when they have a corpus of data, that is, millions of examples of equivalent phrases in both languages for the program to learn to recognize them. "For native languages, the largest corpora are close to 10 thousand examples, compared to millions for commercial systems. We are very far from having an experience similar to what we have when using a normal translator because we have very little data. That is part of our battle right now: to get more data and increase our examples," said Zacarías Márquez.

He added that the original voices of Mexico are predominantly oral, so the standardization of their writing is contemporary and in several cases, it has not yet been decided how to write words, concepts, and even complete phrases. For example, the case of Wixárika is made up of numerous words with morphological particles, so what for us may be a phrase for them is a single word, a situation that is difficult for the neural networks to process.

Some losses in translation must also be considered because for the Huichol a sentence is structured based on how many people hear what is being said and if there is someone higher in the hierarchy than us, something that in Spanish is not usually done and this influences some texts to be incomplete. For example, the phrase m'k'pa:pa ya p'-ta-ti-u-ti-wawi-ri-wa among other things indicates that the event described is seen by the speaker, a situation that is not marked in Spanish and the closest translation would be: She always asks us for tortillas. To consult these works, in the case of the Wixárika there is a website, and another one is in process for the Ayuuk.

More support is needed

The researcher emphasized that there is a lack of support for the development of this type of technology to rescue the languages of indigenous peoples since they are traditionally studied through linguistics or anthropology to document them. In addition, the discussion arises as to how much these communities need the tools, if it is beneficial for them or how they would use them since they have other priorities. "What we have detected is that there is a recognition by the inhabitants of Mexico that we should support their preservation, promote their use, and having an automatic translator could help this and facilitate this situation," said Zacarías Márquez.