In a personal blog post, AI expert Lilian Weng discussed a new approach to building Vision-Language Models (VLMs). Instead of using discrete, complex systems, the current trend is to extend pre-trained Large Language Models (LLMs) so they can directly ingest and process visual signals.
Background
According to Lilian Weng, processing images for text generation—such as image captioning or visual question answering—has been studied for years. Previously, these systems typically relied on an object detection network acting as a vision encoder to extract image features before passing them to a text decoder. While effective for certain tasks, this approach lacks flexibility and is difficult to scale seamlessly when dealing with more complex multimodal data formats.
Recent Developments
The new method focuses on directly integrating visual understanding into pre-trained large language models. This leverages the vast textual knowledge already accumulated by LLMs while optimizing the training process by "teaching" the model to align visual representations with the language vector space. Research centers on converting visual signals into tokens that the language model can comprehend, thereby processing both inputs simultaneously.
Why It Matters
For the AI development community in Vietnam, shifting toward vision-language models opens up huge opportunities for hardware resource optimization. Instead of maintaining multiple bulky, specialized models in parallel, engineers can fine-tune a single LLM to handle both visual and linguistic tasks simultaneously. This is considered a crucial stepping stone toward building multimodal AI agents capable of natural interaction and a deep understanding of the surrounding physical world.