Vision Models

Vision Language Models (VLMs) let you analyze images, screenshots, diagrams, and documents directly in your conversations. The AI can "see" and understand visual content.

Supported Models

On Device AI supports vision models through both inference engines:

Vision models are identified by a [VLM] tag in the model picker.

Using Vision Models

  1. Select a VLM model

    Choose a vision-capable model from the model picker (look for the [VLM] tag).

  2. Attach an image

    Use the camera button, photo library, or paste an image into the chat.

  3. Ask about the image

    Type your question about the image. Examples: "What's in this image?", "Read the text in this screenshot", "Describe this diagram".

VLM vs OCR Processing

The app handles images differently depending on whether you're using a vision model:

Image Sharing and Ingestion

You can bring images into the app from outside the conversation:

Shared images follow intelligent routing rules based on the active model, seamlessly using vision inference or gracefully falling back to OCR.

💡 Tip

For analyzing charts, diagrams, or complex visual layouts, use a VLM model. For simple text extraction from screenshots, either approach works well.

Camera Integration

On iOS, you can use the camera directly within the app to capture images for analysis. This is great for:

Tips for Best Results