Mar 20, 2024
As we close the second phase of the Awal project, it’s a moment of reflection, celebration, and anticipation for what’s to come. Awal has embarked on a crucial mission: to preserve and promote the Tamazight language in the digital space. Through the development of innovative tools, we aim to facilitate the use and dissemination of this 3000 year old language of North-Africa, bridging the digital divide and ensuring its speakers are not left behind in the rapidly evolving technological landscape.
Our journey has already seen significant milestones. With the enthusiastic participation of the Tamazight-speaking community, we’ve collected over 3,500 translated sentences and 2 hours of speech data, thanks to the efforts of a newly-founded data community spanning Catalunya and Morocco. This push was facilitated with the development of a dedicated web portal and the organization of the first Awal datathon. Awal has also heightened awareness for technology in this marginalized language, which ranks third in terms of number of speakers in Catalunya, through presentations at various initiatives, including the Amazigh New Year celebrations and a talk in “Barcelona, the city of 300 languages.”
These efforts are not just about numbers; they’re about bringing people together, fostering a sense of pride, and creating tools that reflect the rich diversity of Tamazight. By doing so, we’re not only preserving a language but also empowering a community to see their cultural identity represented and respected in the digital age.
In a groundbreaking effort led under the SomPart initiative by CIEMEN, Col·lectivaT has proudly introduced the first website dedicated to crowdsourcing language data for Tamazight. This platform stands as a testament to innovation and cultural representation, representing all Tamazight dialects and scripts with a design inspired by Amazigh customs, thanks to the creative efforts of La Clara Comunicació.
The website features an open-source machine translation app, currently in its nascent stage but promising improvement using the collected data. Contributors can add and translate sentences using a simple interface, contributing to our database of translated sentences in languages such as Catalan, Spanish, English, French, and Arabic. Apart from translation, the web site also promotes speech data collection, integrating with Mozilla’s Common Voice.
Community validation is a cornerstone of our approach, ensuring that all data meets quality standards. Contributors are recognized for their invaluable input, earning points for each contribution and validation, which fosters a friendly competitive spirit through a leaderboard system.
Awal website functionalities were designed by Alp Öktem, tech lead of Col·lectivaT, and was built by full-stack developer Yuxuan Peng, who currently maintains and enhances it on a volunteer basis. Serving the community in five languages — Catalan, Spanish, French, English, and Tamazight — the website’s accessibility is broadened, thanks to the exceptional volunteer work of Brahim Essaidi and Yassine Aït-El-Mouden for the Tamazight translations.
The Awal Datathon was the first of its kind and the landmark event of this phase of the project. Over a weekend, this linguistic marathon successfully gathered more than 1,000 translated sentences and an hour of voice recordings in Tamazight. Hosted both virtually and in person, the event saw Tamazight speakers of various dialects, ages, and genders come together, enriching the linguistic contributions with diverse pronunciations, vocabularies, and accents.
This collective effort showcased the community’s dedication to preserving their language. Participants like Mohamed and Nasseur worked together, navigating the challenges of Tifinagh, the traditional alphabet of Tamazight, and fitting into a standardized model of the language promoted only since 2011 in Morocco. Their efforts underscored a poignant truth: Tamazight is not promoted in schools and is learned primarily in the homes and streets, through lived experience. The event also served as a reminder of the language’s vitality, with participants like Zaina expressing a fierce pride in their heritage and a commitment to its preservation after spending five hours making contributions.
The datathon, celebrated on 17 February 2024 in between Amazigh new year and International Mother Language Day, marked a significant step towards building a participative dataset which is shared with open access, with the Awal project website continuing to welcome contributions from all who wish to participate.
As we move forward, our focus remains clear: to continue promoting participation through our social media channels, to enhance machine translation models with high-quality data, and to expand our network with new participants, linguists, NLP experts, technology developers and language rights activists.
Excitingly, our journey is already fostering fruitful collaborations. Notably, one of our active collaborators, Mohamed Aymane Farhi has developed a spell-checker to help standardize contributions. Additionally, a company is exploring our data to develop a Tamazight-speaking chatbot aimed at assisting Moroccan farmers.
We believe that by working together, we can overcome the challenges of digital exclusion and create a future where every language, no matter how challenging or marginalized, has its rightful place in the digital arena.
The journey of Awal is far from over. As we look to the future, we invite collaborators interested in developing language technology for Tamazight or other marginalized languages to join us. For more information or to get involved, please contact us at awal@collectivat.cat and connect through our social media channels on Twitter, Instagram, Facebook, and Telegram.
Together, let’s ensure that the voices of all communities are heard, respected, and celebrated in the digital world.