Skip to main content

Image-Text

Image-text tasks join a picture to language: answering a question about an image, or describing one in words. They sit at the meeting point of the vision and text chapters, and they are among the newest tasks to reach African languages, because they need an image, a language, and the alignment between them, each scarce on its own and scarcer together.

This chapter covers two image-text tasks:

What the two tasks share, and what makes them African

Both pair images with text, and both depend on a tight match between the two: the answer or caption has to be true to what the image actually shows. The shared groundwork is aligning the modalities, annotating them jointly, and quality-controlling both at once (see the Multimodal overview). The African difficulty is twofold. The images themselves are usually drawn from Western datasets and show Western scenes, so a model learns to describe a world that is not the user's, and the cheapest way to get African-language data, translating English questions and captions, produces text that is grammatically African but culturally foreign. The strongest recent work pushes back on both: HaVQA built Hausa visual question answering with careful human translation tied to the images (HaVQA, 2023), and AfriCaption set out a new paradigm for image captioning across several African languages rather than translating from English (AfriCaption, 2025).

Contributor
@abumafrim

Join the discussion

Spotted an error, have a question, or want to share what worked on a real project? Sign in with GitHub to add your voice — every thread lives in the open, powered by GitHub Discussions.

Loading discussion…