Etenat Awol
Addis Ababa, Ethiopia
A comprehensive dataset on five Ethiopian languages poised for use in training artificial intelligence systems like Large Language Models (LLMs) has been developed by iCog. The Company formerly known as ICog Anyone Can Code has debuted Leyu Ai, an open-source voice dataset that includes Amharic, Afaan Oromo, Tigrinya, Af-Somali, and Sidama languages. Leyu, which means ‘to identify’ in Amharic, incorporates refined dialects in its dataset to ensure the accommodation of linguistic nuances.
“This approach incorporates local linguistic nuances into AI and Natural Language Processing (NLP) applications, helping businesses and organizations develop more inclusive and effective digital solutions for Ethiopia and beyond.” Says Betelhem Dessie, CEO of iCog.
The data collected through crowdsourcing goes through a comprehensive validation process, where experts review it to ensure accuracy, contextual relevance, and cultural propriety. Established linguistic standards are weighed against the collected data to incorporate variations in language use and context sensitivity. Data quality is further assessed to make certain that coherence, consistency, and overall suitability for the intended use.
“The multi-layered review process is essential to guarantee the integrity and reliability of the dataset, positioning it as a valuable resource in comparison to other benchmark datasets,” Betelhem told Shega.
An open-source approach is implemented by Liyu as it allows AI researchers and the public to access the dataset for free while commercially interested companies can pay for access.
The Platform also plans to provide loyalty for contributors. However, according to Betelhem, the company's core business model is not the platform itself, but rather the services built upon it.
“Our initial focus is on speech data, with plans to expand to video and image datasets. By leveraging the widespread use of smartphones, Leyu democratizes the data creation process and provides micro-work opportunities for Ethiopians to contribute to AI development,” she told Shega.
While LLMs like Open AI’s ChatGPT have transformed the global economic landscape, their capabilities skew heavily toward English and other high-resource languages. African languages which are over 3000 make up just 0.1% of online content, and are often excluded from training datasets due to scarce digital resources and complex script.
“This stark disparity isn’t just a matter of convenience; it’s a barrier that hinders democratization of access and participation in the AI age.” according to the CEO.
While it supports open-source collaboration and AI development in Ethiopia, Leyu is a for-profit platform that sells datasets and plans to develop proprietary models. Its long-term goal is to build a comprehensive, ethically sourced language resource, prioritizing privacy and fair compensation. Leyu aims to drive innovation, create jobs, and accelerate AI adoption by democratizing data creation and incorporating sector-specific data to maximize impact.
In November Leyu partnered with Karya., an India-based organization that works towards providing marginalized communities with dignified jobs in the AI ecosystem.
👏
😂
❤️
😲
😠
Share this post:
Etenat Awol
Etenat holds a degree in Journalism and her master's in Public Relations. Previously, she served as a university lecturer and has five years of experience in communications, media, digital marketing, and consulting.
Your Email Address Will Not Be Published. Required Fields Are Marked *
Latest Stories
Guava Leaf Tea Gains Traction as Ethiopia Explores Traditional Remedies for Modern Health Challenges
20 February 2025
Dashen Bank, Accion, & Mastercard to Launch Innovation Hub for Ethiopian MSMEs
19 February 2025