Multimodal
Datasets that combine modalities, and data produced with the help of large models. These tasks share the challenge of aligning and jointly labelling more than one signal.
Shared across multimodal data
- Aligning modalities (image–text, audio–text)
- Joint annotation guidelines across modalities
- Quality control spanning every modality
- Metadata & combined-format handling
- Licensing across combined sources
Tasks
- Image–text – (visual question answering, image captioning)
- LLM-assisted & synthetic data – (generation, augmentation, distillation)