How to read this playbook

The playbook runs end-to-end through the dataset lifecycle. Here is the whole book at a glance — every chapter, grouped into the four phases of building a dataset. Click any chapter to jump straight to it.

You don't have to read it in order. Pick the path that fits where you are:

New to dataset design. Start here, then read chapters 2–4 in order: Data Collection, Annotation Design, Data Quality. They build on each other and cover the foundations everyone needs.
You already have raw data and want help annotating it. Go to chapter 3 (Annotation Design and Workforce Management), then chapter 4 (Data Quality Assurance and Validation).
You're working with a specific modality (speech, multimodal, low-resource scripts). Skip to chapter 5 (Modality-Specific Task Design).
You're using LLMs to generate or augment data. Read chapter 7 (LLM-Assisted and Synthetic Data Generation) for the trade-offs and safeguards.
You're preparing a dataset for release. Read chapter 6 (Documentation, Data Release, and Governance) and chapter 9 (Dataset Lifecycle Management and Release Checklist).
You're offline or on a slow connection. Use Download PDF in the navbar. The whole playbook bundles into one file, rebuilt on every release.
You'd rather read in another language. Use the language switcher at the top-right. Translations are community-maintained and grow over time.

Throughout, you'll find practical templates (consent forms, annotation guidelines, governance checklists), worked examples from real African-language projects, and links to datasets and tools you can reuse. New terms are defined in the glossary.

Cite this page

Join the discussion