Skip to main content

Text

Building text datasets for African languages. Every task below shares the same groundwork — these pages differ only in what is labelled and how it is evaluated.

Shared across text data

  • Source selection (news, web, social media, books, religious & educational texts)
  • Text cleaning & normalization (encoding, scripts, diacritics, punctuation)
  • Language identification & code-switching handling
  • Annotation guidelines & label schemas
  • Annotator recruitment & inter-annotator agreement
  • Metadata, dataset formatting & splits
  • Licensing & toxic-content filtering

Tasks

  • Text classification & labeling – (sentiment, emotion, hate speech, topic, intent)
  • Text generation, summarization & QA – (open generation, summarization, question answering)
  • Machine translation & parallel corpora – (parallel mining, alignment, human translation)
  • Sequence labeling – (named-entity recognition, part-of-speech, chunking)
Contributor
@abumafrim

Join the discussion

Spotted an error, have a question, or want to share what worked on a real project? Sign in with GitHub to add your voice — every thread lives in the open, powered by GitHub Discussions.

Loading discussion…