Researchers have introduced TIGER, an inference-time framework designed to mitigate hallucinations in multimodal generative models. Unlike free-form feedback methods, TIGER independently extracts observation graphs from inputs and claim graphs from outputs to assign fact-level risk scores.
Context
Multimodal models often produce fluent text that contains factual inaccuracies unsupported by the source image, audio, or video. Current repair methods are frequently biased by the hallucinated claims themselves, making it difficult for the model to identify and correct objective contradictions effectively.
Why it matters
TIGER selectively repairs high-risk claims while keeping the underlying model's backbone frozen. Experiments across four cross-modal paths (image-to-text, audio, and video) demonstrate that the framework significantly reduces unsupported content without compromising task performance. This approach enhances the reliability and traceability of multimodal AI systems in critical applications.