AI Chat

Brainstorm ideas, write code, and work through problems with language models running on your device. After a model is downloaded, local chat can work offline.

On this page

Model Selection
Chat History Without a Loaded Model
Conversation Modes
Context Window
Attachments
Reasoning & Thinking
Tool Calling
Chat Settings

Model Selection

Tap the model selector at the top of the chat screen to choose from downloaded models. The app supports two inference engines:

GGUF (llama.cpp): Quantized models in GGUF format. Broad model support, efficient memory usage. Works on all devices.
MLX: Apple's MLX framework optimized for Apple Silicon. Fastest performance on Macs, also available on newer iPhones and iPads.

Tip

You can import custom GGUF models from Hugging Face by providing the direct download URL in Settings → Model Management.

Chat History Without a Loaded Model

You can still open and read previous conversations when no model is loaded. This is useful when you want to review a past answer, copy information, or decide which model to load before continuing.

If you open an existing conversation without a ready model, the normal message composer is replaced by a simple prompt to load or select a model. After a model is ready, you can continue the same conversation without losing your place.

Conversation Modes

On Device AI supports two conversation modes:

Standard Mode: Direct conversation with one AI model. Best for general chat, Q&A, and coding tasks.
Chat Flow Mode: Multi-agent conversation with multiple AI participants. See the Chat Flows guide for details.

Context Window

The context window determines how much conversation history the AI can see. Larger context windows allow for longer, more coherent conversations but use more memory.

You can adjust the context size in Settings → Chat. The default is optimized for your device's available RAM.

Warning

Setting the context window too large may cause the app to run out of memory, especially on devices with limited RAM. Stick to the recommended default if unsure.

Attachments & Sharing

You can bring content into your conversations in several ways:

Images: Photos from camera or library. Uses OCR for text extraction, or direct vision model analysis for VLM models.
Documents: PDFs and text files. Content is extracted and indexed for AI reference.
URLs: Web pages are fetched, converted to markdown, and indexed. A thumbnail screenshot is captured for reference.

iOS Share Sheet: Highlight text, select URLs, or pick images in any app and share them directly to On Device AI. Shared text automatically becomes your chat prompt, and images are imported as attachments.
Drag-and-Drop / Pasteboard: On macOS and visionOS, you can drag images or text directly into the chat window, or paste them from your clipboard.

Reasoning & Thinking

Models that support reasoning (like DeepSeek, Qwen 3 with thinking) can show their chain-of-thought process in a collapsible "Thinking" section above the response.

You can control the default expansion behavior in Settings → Chat → "Show reasoning by default".

Tool Calling

Compatible models can use built-in tools during conversation:

Web Search: Search the web for current information and synthesize results
Web Fetch: Fetch and read web page content as markdown
Calculator: Perform mathematical calculations
Global Memory: Store and recall personal preferences and key facts across conversations
HTTP Request: Make GET, POST, PUT, and PATCH requests to any endpoint with custom headers, body, timeout, and optional file download mode

Tool calling is automatic when the AI determines it needs external information. You can enable/disable individual tools and configure per-tool default parameters in Settings → Tool Calling.

Customizing Tool Order: You can rearrange the order of tools in Settings → Tool Calling by dragging and dropping them.

See Tool Calling for a full guide to each tool and its configuration options.

Chat Settings

Customize your chat experience in Settings → Chat:

Temperature: Controls response creativity (lower = more deterministic, higher = more creative)
System Prompt: Set a default instruction for all conversations
Context Size: Adjust the maximum context window
Show Reasoning: Default expansion state for thinking blocks
Auto-play Voice: Automatically speak AI responses

Advanced Settings

For power users, the Advanced Chat Settings section provides deeper control over model behavior:

Performance & System

Force Load to RAM (MLock): By default, the OS may compress inactive memory to save space. Enabling this forces the model to stay in active RAM, preventing "warm-up" delays when you return to the app. Use this if you have plenty of RAM and want instant responses.
Flash Attention: An optimized computation technique that speeds up processing and reduces memory usage for compatible models (mostly standard Transformers). Leave on "Auto" for best performance.
Batch Size: Controls how many tokens the prompt processes at once. Higher values are faster but use more memory. Lower this if the app crashes during long prompt processing.

Generation Parameters

Mirostat 2.0: An alternative to standard Temperature sampling. Instead of just picking random words, it actively adjusts the "surprise" level of the output to keep it coherent yet interesting (based on Perplexity). Great for creative writing where standard temperature feels too chaotic or too boring.
Top-P & Top-K: These filter the "vocabulary" the AI chooses from. Top-K limits choices to the top X most likely words. Top-P limits choices to the top words whose combined probability equals P%. Tweaking these can reduce repetition or gibberish.
Attention Type: Causal is standard for chat models. Non-Causal is rarely used for specific architectures. Generally, keep this on "Unspecified" or "Causal".

Context Management

Smart Context Truncation: When the conversation gets too long, this feature intelligently removes the oldest messages while preserving your System Prompt and key instructions. Ensures the AI doesn't "forget" its role even in long chats.

Web Content

Web Readability: When enabled, the app uses Mozilla's Readability algorithm to strip ads, navigation, and clutter from web pages before the AI reads them. Disable this if you need the AI to see raw HTML structure or hidden content.

UI Customization

History Display Mode: Customize how conversation history is presented. Choose between different styles to match your reading preference.

AI Chat

Model Selection

Chat History Without a Loaded Model

Conversation Modes

Context Window

Attachments & Sharing

Sharing from Outside the App

Reasoning & Thinking

Tool Calling

Chat Settings

Advanced Settings

Performance & System

Generation Parameters

Context Management

Web Content

UI Customization