Skip to main content

Multimodal

Datasets that combine modalities, and data produced with the help of large models. These tasks share the challenge of aligning and jointly labelling more than one signal.

Shared across multimodal data

Aligning modalities (image–text, audio–text)
Joint annotation guidelines across modalities
Quality control spanning every modality
Metadata & combined-format handling
Licensing across combined sources

Tasks

Image–text – (visual question answering, image captioning)
LLM-assisted & synthetic data – (generation, augmentation, distillation)

Contributor

Shared across multimodal data
Tasks

Join the discussion

Spotted an error, have a question, or want to share what worked on a real project? Sign in with GitHub to add your voice — every thread lives in the open, powered by GitHub Discussions.

Loading discussion…