Skip to main content

Text

Building text datasets for African languages. Every task below shares the same groundwork — these pages differ only in what is labelled and how it is evaluated.

Shared across text data

Source selection (news, web, social media, books, religious & educational texts)
Text cleaning & normalization (encoding, scripts, diacritics, punctuation)
Language identification & code-switching handling
Annotation guidelines & label schemas
Annotator recruitment & inter-annotator agreement
Metadata, dataset formatting & splits
Licensing & toxic-content filtering

Tasks

Text classification & labeling – (sentiment, emotion, hate speech, topic, intent)
Text generation, summarization & QA – (open generation, summarization, question answering)
Machine translation & parallel corpora – (parallel mining, alignment, human translation)
Sequence labeling – (named-entity recognition, part-of-speech, chunking)

Contributor

Shared across text data
Tasks

Join the discussion

Spotted an error, have a question, or want to share what worked on a real project? Sign in with GitHub to add your voice — every thread lives in the open, powered by GitHub Discussions.

Loading discussion…