Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

1Mohamed bin Zayed University of Artificial Intelligence, UAE
2Center of Secure Cyber-Physical Security Systems, Khalifa University, UAE
3Michigan State University
4Linköping University
5Australian National University
(Under Review)

Current multi-modal large language models (MLLMs) struggle to achieve high adversarial robustness while maintaining strong vision-language reasoning. Methods such as TeCoA, FARE4, and Sim-CLIP4 perform constrained adversarial fine-tuning of CLIP to preserve the generalization capabilities of the pre-trained model. However, this limited adversarial training results in only modest robustness gains when the model is integrated into an MLLM framework. Moreover, the misalignment between adversarial CLIP training objectives and MLLMs' generative understanding creates a semantic alignment gap, impairing MLLMs' ability to perform complex visual reasoning. This leads us to explore whether current large-scale adversarially pre-trained vision encoders, which contain rich robust representations, can exhibit strong semantic alignment within the MLLM framework.

Left: We investigate the multimodal alignment of robust encoders by aligning the feature space of robust encoders using a linear layer with the pre-trained CLIP model, which has a strong multimodal feature representation. We then align robust encoders with CLIP’s text encoder to evaluate robust zero-shot performance, in order to assess their robust multimodal alignment.
Right: The results demonstrate a strong correlation between model scale, training strategy, and robustness preservation during CLIP alignment. Small-scale models (e.g., ViT-B and ResNet-101) suffer significant robustness degradation post-alignment, with accuracy dropping below 60% across all datasets. In contrast, large-scale models (ViT-H and ViT-G) successfully retain their robustness while acquiring robust zero-shot capabilities. Leveraging this insight, we integrate these robust encoders into the LLaVA framework, achieving strong adversarial robustness and semantic alignment in MLLMs without additional specialized adversarial training.

Robust score of Robust-LLaVA4 on downstream vision-language tasks with adversarial examples crafted at ε = 4/255: The original CLIP exhibits minimal robustness. Our proposed Robust-LLaVA4 outperforms state-of-the-art FARE4 and Sim-CLIP4 in robustness score across all tasks and diverse datasets, while maintaining high clean performance. (Accuracy is reported for VQAv2 and TextVQA, while CIDER score is reported for Flickr30k and COCO).

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated impressive capabilities in vision-language tasks, but their reliance on visual processing introduces critical security vulnerabilities. Their vision encoders remain susceptible to adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms while maintaining coherent language generation. Current approaches attempt to address this by adversarially fine-tuning CLIP vision encoders on ImageNet-scale data, but exhibit inherent limitations in both robustness and generalization due to the restricted scale and diversity of adversarial training. In this work, we present an alternative approach by leveraging vision encoders adversarially pre-trained on billion-scale image-text pairs. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these encoders to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, and (2) end-to-end MLLM optimization with these robust encoders facilitates enhanced adaptation of language components to robust visual features, substantially outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust encoders achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against advanced jailbreaking attacks compared to state-of-the-art methods.

Untargetted Attack on Image Captioning Task

Targetted Attack on Image Captioning Task

Untargetted Attack on Visual Question Answering(VQA) Task

Common Corruptions on Image Captioning Task

Increasing Corruption Severity

Quantitative Results

MY ALT TEXT

On untargeted attacks, results across six datasets, covering image captioning and visual question answering tasks, both Robust-LLaVAG and Robust-LLaVA4H maintain reasonable clean performance while achieving substantial robustness improvements over FARE4 and Sim-CLIP4 against adversarial attacks, striking the right balance between clean and adversarial generalization.

MY ALT TEXT

Both FARE4 and Sim-CLIP4 show robustness against targeted attacks, but break in a few cases at high perturbation budgets (ε = 8/255). In contrast, Robust-LLaVA4G and Robust-LLaVA4H remain fully robust to these attacks even at high perturbation budgets. This indicates a strong resistance to generating the attacker's targeted output. The robustness of Robust-LLaVA4G stands out further as it continues to generate high-quality captions for adversarial examples, maintaining a strong CIDEr score.

MY ALT TEXT

Robustness evaluation of MLLMs against ensemble-based SSA-CWA transfer attacks using the MultiTrust benchmarking framework. Adversarial examples are crafted using an ensemble of diverse vision models, with perturbations designed to mislead object recognition. Model performance is assessed on 100 relabeled NIPS17 images, with GPT-4 determining the correctness of generated descriptions. The figure illustrates that Robust-LLaVA4G and Robust-LLaVA4H achieve 10-12% higher accuracy than their closest robust counterparts, demonstrating superior resilience against highly transferable adversarial attacks.

MY ALT TEXT

Comparison of various vision encoders integrated with LLaVA against white-box (VisualAdv) and black-box (HADES) jailbreak attacks. The white-box results (Table 3) show that LLaVA with the original CLIP encoder is the most vulnerable, producing the highest number of toxic outputs. In contrast, our Robust-LLaVA4G and Robust-LLaVA4H models significantly reduce toxic content generation. The black-box results (Table 4) highlight the effectiveness of different models against HADES attacks, with the original CLIP encoder exhibiting the highest Attack Success Rate (ASR). In contrast, our Robust-LLaVA models achieve the lowest ASR, demonstrating superior resilience across multiple adversarial scenarios.

MY ALT TEXT

Evaluation of vision encoder ensembles within the MLLM framework, assessing their robustness across multiple benchmarks. Our analysis reveals that an ensemble’s robustness is limited by its weakest vision encoder. Across all configurations, we observe that the most vulnerable component dictates the overall robustness, highlighting the importance of reinforcing individual vision encoders to strengthen ensemble resilience.

MY ALT TEXT

Assessment of prompt formatting strategies during inference to enhance model robustness against adversarial examples in the image captioning task. Results reveal that strategic prompt modifications can improve robustness; however, this approach remains susceptible to adaptive attacks, where adversaries can incorporate the modified prompts to craft adversarial examples to bypass these defenses.

MY ALT TEXT
MY ALT TEXT

Analysis of hallucination behavior in vision-language models using the (POPE dataset) . Results indicate that ensembling robust models with CLIP enhances hallucination mitigation. However, this trend does not hold for ensembles incorporating the adversarially fine-tuned CLIP variant, FARE4, which exhibits reduced generalization. Among robust models, Robust-LLaVA4G and Robust-LLaVA4H demonstrate the best performance in mitigating object hallucinations, showcasing superior reliability in object recognition while maintaining robustness.

BibTeX

@article{malik2025robust,
  title={Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models},
  author={Malik, Hashmat Shadab and Shamshad, Fahad and Naseer, Muzammal and Nandakumar, Karthik and Khan, Fahad and Khan, Salman},
  journal={arXiv preprint arXiv:2502.01576},
  year={2025}
}