Licensing and Compliance
A licence states what others may and may not do with your dataset. The reasoning behind choosing one, including the African community licences such as NOODL and Esethu, is in the data governance chapter. This page is the practical companion, covering the common licence types and a template for documenting your choice.
Common licence types for NLP datasets
For open data, the Creative Commons family is the usual starting point: CC0 dedicates data to the public domain, CC BY requires attribution, and CC BY-SA adds share-alike. Where community rights and benefit need protecting, the African community licences keep control and benefit with the source community. Do not apply a software licence such as MIT or Apache to data, since they are written for code and leave data rights unclear.
Choosing, attributing, and complying
Choose the licence that matches how open you actually want to be, balanced against the community's wishes and any obligations from source data. Whatever you choose, require attribution and record provenance, and check that your licence is compatible with the terms of any data you built on, since you cannot grant rights you do not hold. Comply with the data-protection law of every country your contributors are in (see Data Governance).
Licensing agreement template
For a formal release, a short agreement records the terms:
- Part 1. Dataset and licensor details:
[dataset, version, licensor]. - Part 2. Grant of rights: the licensor grants
[rights]under[licence]. - Part 3. Restrictions: prohibited uses are
[list]. - Part 4. Attribution requirements: reusers must credit
[citation]. - Part 5. Warranty and liability: provided "as is", with
[limitations]. - Part 6. Termination: the terms under which the licence ends.
- Part 7. Signature:
[names, dates].