Text
Building text datasets for African languages. Every task below shares the same groundwork — these pages differ only in what is labelled and how it is evaluated.
Shared across text data
- Source selection (news, web, social media, books, religious & educational texts)
- Text cleaning & normalization (encoding, scripts, diacritics, punctuation)
- Language identification & code-switching handling
- Annotation guidelines & label schemas
- Annotator recruitment & inter-annotator agreement
- Metadata, dataset formatting & splits
- Licensing & toxic-content filtering
Tasks
- Text classification & labeling – (sentiment, emotion, hate speech, topic, intent)
- Text generation, summarization & QA – (open generation, summarization, question answering)
- Machine translation & parallel corpora – (parallel mining, alignment, human translation)
- Sequence labeling – (named-entity recognition, part-of-speech, chunking)