Today we are introducing LLaVA-MORE , a family of models that enhances the well-known LLaVA architecture by integrating for the first time LLaMA 3.1 as the language model. These models significantly enhance visual understanding, generation, and reasoning capabilities, excelling across a wide range of multimodal benchmarks.
The first 8B model is available to download now direcly from Huggingface, with more releases coming soon. To empower the research community in advancing Multimodal LLM performance, we are also releasing the training code and scripts for distributed training.
This work is part of the PNRR-M4C2 project FAIR - Future Artificial Intelligence Research, Transversal Project on Vision, Language and Multimodal Challenges. A special thanks to CINECA for providing the high-performance computing resources that made training LLaVA-MORE possible.
Download LLaVA-MORE models on our Github repository and our Huggingface collection!