Skip to main content

References

Works cited across the AfriPlaybook, in one place. Every entry has a stable anchor, so an in-text citation links straight to its exact position here. To add a reference, give its heading an author-year id (### Author (year) {#author-year}) and cite it from the text as [Author (year)](../references.md#author-year).

Abdulmumin, I., et al. (2026)

Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora. arXiv:2605.27239. A Setswana sentiment dataset (3,565 tweets, three native-speaker annotators across eight batches): overall agreement was κ = 0.76, but items labelled within a minute of each other reached κ = 0.98 while items labelled more than a day apart fell to κ = 0.65, revealing annotator drift over time. arxiv.org/abs/2605.27239

Adelani, D. I., et al. (2022)

MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition. Proceedings of EMNLP 2022. A human-annotated NER dataset spanning 20 African languages, built by the Masakhane community. aclanthology.org/2022.emnlp-main.298

Adelani, D. I., et al. (2022b)

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation (MAFAND-MT). Proceedings of NAACL 2022. Human-translated news parallel data for ~16–21 African languages from the Masakhane community. aclanthology.org/2022.naacl-main.223

Adelani, D. I., et al. (2024)

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models. NAACL 2025; arXiv:2406.03368. Human-translated benchmark for 17 African languages: AfriXNLI (NLI), AfriMGSM (math reasoning), and AfriMMLU (knowledge QA). arxiv.org/abs/2406.03368

African Next Voices (2025)

African Next Voices. A Gates Foundation–funded initiative (about US$2.2 million) that recorded roughly 9,000 hours of everyday speech in 18 African languages across Kenya, Nigeria, and South Africa, collected by local centres such as the Maseno Centre for Applied AI and Data Science Nigeria. One of the largest African voice datasets of its kind. up.ac.za/afridsai

AfriCaption (2025)

AfriCaption: Establishing a New Paradigm for Image Captioning in African Languages. arXiv:2510.17405. Native (non-translated) image captions across linguistically diverse African languages (Igbo, Hausa, Yoruba, Ewe, Luganda, Kinyarwanda, and more). arxiv.org/abs/2510.17405

AfriDoc-MT (2025)

AFRIDOC-MT: Document-level MT Corpus for African Languages. Masakhane, Lacuna-funded; EMNLP 2025; arXiv:2501.06374. Document-level English to Amharic, Hausa, Swahili, Yorùbá, and Zulu (health + IT news); finds that sentence-trained models generalise poorly to documents. arxiv.org/abs/2501.06374

AfriScience-MT (2026)

AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation. arXiv:2605.29741. A scientific/STEM parallel corpus for six African languages across 11 domains, with translators coining new terminology where none existed. arxiv.org/abs/2605.29741

AfriSign (2025)

AfriSign: African sign languages machine translation. Discover Artificial Intelligence (Springer), 2025. A video-to-text translation dataset of sign-language renderings of Bible verses across six African countries. doi.org/10.1007/s44163-025-00227-7

AfroBench

AfroBench: How Good are Large Language Models on African Languages? McGill-NLP; arXiv:2311.07978. A large-scale LLM benchmark spanning 64 African languages, 15 tasks, and 22 datasets, showing a persistent gap between English and African languages. arxiv.org/abs/2311.07978

Ajami HTR (2025)

A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa. 2025. Manually segmented and transcribed Arabic-script (Ajami) manuscripts with ALTO formatting, bringing a previously digitally invisible script into reach. link.springer.com

Alabi, J. O., et al. (2025)

AfriHuBERT: A self-supervised speech representation model for African languages. Interspeech 2025; arXiv:2409.20201. Pretrains on raw African-language audio so that little labelled data goes further. arxiv.org/abs/2409.20201

Belay, T. D., et al. (2025)

The Rise of AfricaNLP: A Survey of Contributions, Contributors, Community Impact, and Bibliometric Analysis. arXiv:2509.25477. arxiv.org/abs/2509.25477

Belay, T. D., et al. (2026)

Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks. arXiv:2605.09955. Clusters annotators by agreement to model disagreement across 40 datasets and 18 typologically diverse languages (sentiment, emotion, hate speech), outperforming majority voting. arxiv.org/abs/2605.09955

Bender, E. M., & Friedman, B. (2018)

Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. TACL 6. A standard for documenting the linguistic and demographic context of NLP datasets. aclanthology.org/Q18-1041

Bird, S. (2020)

Decolonising Speech and Language Technology. Proceedings of COLING 2020. Argues for centring the communities whose languages are at stake rather than treating them as data sources. aclanthology.org/2020.coling-main.313

BRIGHTER / AfriEmo (2025)

BRIGHTER: Bridging the Gap in Text-Based Emotion Detection (SemEval-2025 Task 11). Multilingual emotion-in-text datasets covering many African languages; a useful taxonomy reference for emotion work. arxiv.org/abs/2502.11926

Carnegie Endowment for International Peace (2024)

How African NLP Experts Are Navigating the Challenges of Copyright, Innovation, and Access. April 2024. carnegieendowment.org

Carroll, S. R., Garba, I., et al. (2020)

The CARE Principles for Indigenous Data Governance. Data Science Journal, 19(1), 43. Collective benefit, Authority to control, Responsibility, Ethics — the community's right to govern data about it. doi.org/10.5334/dsj-2020-043

Deep Learning Indaba

Deep Learning Indaba and IndabaX. The annual gathering of Africa's machine-learning community, founded 2017, with a mission for Africans to be owners and shapers of AI. Its local IndabaX events grew from 13 in 2018 to 47 countries by 2024–2025, aiming for every African country by 2027. deeplearningindaba.com

Emezue, C. C., et al. (2025)

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages. Interspeech 2025 / arXiv:2505.20564. An 1,800-hour Igbo, Hausa, and Yorùbá speech corpus built from 5,000+ voice donors via trained community facilitators and a reciprocal "data farming" model. arxiv.org/abs/2505.20564

Esethu Framework (2025)

The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages. arXiv:2502.15916. A community-centric dataset-governance framework and the Esethu license (Lelapa AI, Way With Words, Data Science for Social Impact). arxiv.org/abs/2502.15916

Ethnologue (2024)

Ethnologue: Languages of the World (27th ed.). SIL International. ethnologue.com/insights/how-many-languages

Fidel (2025)

Fidel: a large-scale sentence-level Amharic OCR dataset. IJDAR, 2025. 40,000 handwritten and 28,000 typed line images from 411 native writers, spanning handwritten, typed, and synthetic Ge'ez-script text. doi.org/10.1007/s10032-026-00593-7

Gauthier, E., et al. (2024)

Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal. arXiv:2404.01991. 100+ hours of spontaneous agricultural speech in Wolof, Pulaar, and Sereer. arxiv.org/abs/2404.01991

Gebru, T., et al. (2021)

Datasheets for Datasets. Communications of the ACM, 64(12). A standard questionnaire documenting a dataset's motivation, composition, collection, recommended uses, and limitations. doi.org/10.1145/3458723

Hasan, T., et al. (2021)

XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. Findings of ACL 2021. BBC-derived article and summary pairs covering many African languages (Amharic, Hausa, Igbo, Oromo, Somali, Swahili, Tigrinya, Yorùbá, and more). aclanthology.org/2021.findings-acl.413

HaVQA (2023)

HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language. Findings of ACL 2023. 6,022 English question-answer pairs over 1,555 Visual Genome images, human-translated to Hausa. aclanthology.org/2023.findings-acl.646

Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020)

The State and Fate of Linguistic Diversity and Inclusion in the NLP World. Proceedings of ACL 2020. aclanthology.org/2020.acl-main.560

Kreutzer, J., et al. (2022)

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics (TACL), 10. aclanthology.org/2022.tacl-1.4

Lacuna Fund (n.d.)

Language — Lacuna Fund. Co-founded in 2020 by The Rockefeller Foundation, Google.org, and Canada's International Development Research Centre (IDRC) to fund the creation and labelling of machine-learning datasets in low-resource settings, including a dedicated stream for sub-Saharan African languages. lacunafund.org/language

Lanfrica

Lanfrica and the African AI Atlas. A continually updated map of African AI resources (datasets, models, papers, policies) that tackles the discoverability problem of work scattered across repositories, PDFs, and dead project pages, and also creates bespoke datasets through "data farming." lanfrica.com

Lanfrica — Licensing as a Barrier (n.d.)

Licensing as a Barrier to the Usability of African Language Datasets. Lanfrica Blog. Many African datasets carry missing or unclear licences, which leaves them legally unusable for reusers. lanfrica.com/blog

LLM Data Generation (2025)

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages. arXiv:2506.12158. Finds careful generation (real demonstrations plus self-revision) can approach real data, while naive synthetic generation often fails for the lowest-resource languages. arxiv.org/abs/2506.12158

Malabo Convention (2023)

African Union Convention on Cyber Security and Personal Data Protection (Malabo Convention). Adopted by the African Union in 2014; entered into force June 2023 — the continent's first comprehensive treaty on personal-data protection and cybersecurity. au.int/en/treaties

Masakhane

Masakhane — a grassroots NLP community for Africa (isiZulu: "we build together"). Holds that Africans should decide what data represents their communities, retain ownership of it, and know how it is used. masakhane.io

Meyer, J., et al. (2022)

BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus. Interspeech 2022; arXiv:2207.03546. Up to 86 hours of clean single-speaker TTS audio per language under CC-BY-SA. arxiv.org/abs/2207.03546

Mitchell, M., et al. (2019)

Model Cards for Model Reporting. Proceedings of FAccT 2019. A standard for reporting a model's intended use, performance across groups, and limitations. doi.org/10.1145/3287560.3287596

Muhammad, S. H., et al. (2023)

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages. EMNLP 2023; arXiv:2302.08956. Sentiment datasets across 14 African languages (110k+ tweets), annotated by native speakers with inter-annotator agreement above 0.70. arxiv.org/abs/2302.08956

Muhammad, S. H., et al. (2025)

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages. arXiv:2501.08284. Inter-annotator agreement (Randolph's kappa) ranged 0.46–0.81 across languages. arxiv.org/abs/2501.08284

Nekoto, W., et al. (2020)

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. Findings of EMNLP 2020. A continent-spanning participatory effort, co-authored by 49 researchers, treating native speakers as co-creators of data and models. aclanthology.org/2020.findings-emnlp.195

Nigatu, H. H., Tonja, A. L., Rosman, B., Solorio, T., & Choudhury, M. (2024)

The Zeno's Paradox of 'Low-Resource' Languages. Proceedings of EMNLP 2024. aclanthology.org/2024.emnlp-main.983

NLLB Team (2022)

No Language Left Behind: Scaling Human-Centered Machine Translation (with the FLORES-200 benchmark). Meta AI; arXiv:2207.04672. Translation and evaluation across 200 languages, including many African ones. arxiv.org/abs/2207.04672

Nwulite Obodo Open Data License (NOODL) (2025)

The Nwulite Obodo Open Data License: A New Licence for Sharing African Datasets (Igbo: "raising, reviving, and building the community"). Data Science Law Lab / Centre for Intellectual Property and Information Technology Law (CIPIT), Strathmore University. A tiered, community-centred data licence — free, share-alike reuse within Africa and developing nations; royalties / benefit-sharing required of users elsewhere. datasciencelawlab.africa

Ogundepo, O., et al. (2023)

AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages. Findings of EMNLP 2023. The first cross-lingual open-retrieval QA dataset for African languages: 12,000+ questions across 10 languages, with answers retrieved from English or French passages. arxiv.org/abs/2305.06897

Olatunji, T., et al. (2023)

AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR. TACL 2023; arXiv:2310.00274. 200 hours of accented English from 120 accents across 13 African countries. arxiv.org/abs/2310.00274

Plank, B. (2022)

The "Problem" of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation. Proceedings of EMNLP 2022. Argues that annotator disagreement often reflects genuine, informative variation rather than noise. aclanthology.org/2022.emnlp-main.731

Popović, M. (2015)

chrF: character n-gram F-score for automatic MT evaluation. Proceedings of WMT 2015. A character-level metric better suited to morphologically rich languages than word-based BLEU. aclanthology.org/W15-3049

Ranathunga, S., & de Silva, N. (2022)

Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World. arXiv:2210.08523. arxiv.org/abs/2210.08523

Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. (2021)

"Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Based on 53 practitioners across India, East and West Africa, and the US; 92% reported at least one data cascade. doi.org/10.1145/3411764.3445518

Seeing, Signing, and Saying (2025)

Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media. arXiv:2510.25413. arxiv.org/abs/2510.25413

Sikasote, C., et al. (2023)

Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages. Interspeech 2023. 79 hours labelled across Bemba, Nyanja, Tonga, and Lozi, plus 525 hours of unlabelled radio audio. arxiv.org/abs/2306.04428

Siminyu, K., et al. (2021)

AI4D – African Language Program. arXiv:2104.02516. A three-part program that crowd-sourced and curated 9+ open African-language datasets through community challenges and short research fellowships. arxiv.org/abs/2104.02516

SSA-COMET (2025)

SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages? arXiv:2506.04557. A learned MT-evaluation metric adapted to sub-Saharan African languages. arxiv.org/abs/2506.04557

State of Computer Vision in Africa (2024)

The State of Computer Vision Research in Africa. arXiv:2401.11617. A survey finding African computer vision concentrated in agriculture and health and constrained by data scarcity. arxiv.org/abs/2401.11617

Te Hiku Media — Kaitiakitanga License

Data Sovereignty and the Kaitiakitanga License. Te Hiku Media (Papa Reo), Aotearoa New Zealand. Treats data as cared-for under guardianship rather than owned; benefit flows to the source community; forbids surveillance and unconsented corpus-building. tehiku.nz

Tech In Africa (2026)

AI Regulation in Africa 2026: New Laws, Compliance Risks, and Startup Opportunities. Reports that by 2026, 44 African countries had enacted data-protection legislation. techinafrica.com

Thiomi Dataset (2026)

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages. arXiv:2603.29244. Maintained Fleiss' kappa above 0.82 between peer moderators during validation. arxiv.org/abs/2603.29244

WAXAL (2026)

WAXAL: A Multilingual African Speech Dataset (Google AI, 2026). A large multilingual African speech dataset for training ASR and TTS models across many African languages. marktechpost.com

Wilkinson, M. D., et al. (2016)

The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. Findable, Accessible, Interoperable, Reusable. doi.org/10.1038/sdata.2016.18

Contributor
@abumafrim

Join the discussion

Spotted an error, have a question, or want to share what worked on a real project? Sign in with GitHub to add your voice — every thread lives in the open, powered by GitHub Discussions.

Loading discussion…