Video

Video adds time to vision: the data is moving images, and the tasks read meaning from motion. This chapter groups three of them under one roof because they share the demanding mechanics of working with video, even though they serve very different ends. The headline task for African languages is sign language, the visual languages of Deaf communities, which has been almost entirely left out of language technology and is only now getting its first datasets.

This chapter covers:

Sign language: recognising and translating the visual languages of Deaf communities.
Gesture: recognising hand and body gestures, which vary by culture.
Video: general video understanding, such as classification and captioning.

Video data is heavy and laborious. Clips must be captured or sourced with rights and consent, stored and standardised at a manageable resolution and frame rate, and annotated over time as well as space, which makes labelling far slower than for still images. The Vision overview and Data Governance chapters cover the shared groundwork, and the consent point is sharper for video than almost anywhere else, because a person on film is fully identifiable, and for sign language the signer's face and body are the language itself. The defining African fact across these tasks is novelty: where high-income sign languages now have substantial datasets, African sign languages had almost none until very recently, so much of the work here is first-of-its-kind collection done with the communities concerned.

Cite this page

Contributor

What the three tasks share​

Join the discussion

What the three tasks share