AI Chat
Brainstorm ideas, write code, and solve complex problems instantly. AI Chat connects you with powerful language models running entirely on your device—giving you full privacy and 100% offline access to your own personal AI assistant.
Model Selection
Tap the model selector at the top of the chat screen to choose from downloaded models. The app supports two inference engines:
- GGUF (llama.cpp): Quantized models in GGUF format. Broad model support, efficient memory usage. Works on all devices.
- MLX: Apple's MLX framework optimized for Apple Silicon. Fastest performance on Macs, also available on newer iPhones and iPads.
You can import custom GGUF models from Hugging Face by providing the direct download URL in Settings → Model Management.
Conversation Modes
On Device AI supports two conversation modes:
- Standard Mode: Direct conversation with one AI model. Best for general chat, Q&A, and coding tasks.
- Chat Flow Mode: Multi-agent conversation with multiple AI participants. See the Chat Flows guide for details.
Context Window
The context window determines how much conversation history the AI can see. Larger context windows allow for longer, more coherent conversations but use more memory.
You can adjust the context size in Settings → Chat. The default is optimized for your device's available RAM.
Setting the context window too large may cause the app to run out of memory, especially on devices with limited RAM. Stick to the recommended default if unsure.
Attachments & Sharing
You can bring content into your conversations in several ways:
- Images: Photos from camera or library. Uses OCR for text extraction, or direct vision model analysis for VLM models.
- Documents: PDFs and text files. Content is extracted and indexed for AI reference.
- URLs: Web pages are fetched, converted to markdown, and indexed. A thumbnail screenshot is captured for reference.
Sharing from Outside the App
- iOS Share Sheet: Highlight text, select URLs, or pick images in any app and share them directly to On Device AI. Shared text automatically becomes your chat prompt, and images are imported as attachments.
- Drag-and-Drop / Pasteboard: On macOS and visionOS, you can drag images or text directly into the chat window, or paste them from your clipboard.
Reasoning & Thinking
Models that support reasoning (like DeepSeek, Qwen 3 with thinking) can show their chain-of-thought process in a collapsible "Thinking" section above the response.
You can control the default expansion behavior in Settings → Chat → "Show reasoning by default".
Tool Calling
Compatible models can use built-in tools during conversation:
- Web Search: Search the web for current information and synthesize results
- Web Fetch: Fetch and read web page content as markdown
- Calculator: Perform mathematical calculations
- Global Memory: Store and recall personal preferences and key facts across conversations
- HTTP Request: Make GET, POST, PUT, and PATCH requests to any endpoint — with custom headers, body, timeout, and optional file download mode
Tool calling is automatic when the AI determines it needs external information. You can enable/disable individual tools and configure per-tool default parameters in Settings → Tool Calling.
Customizing Tool Order: You can rearrange the order of tools in Settings → Tool Calling by dragging and dropping them.
See Tool Calling for a full guide to each tool and its configuration options.
Chat Settings
Customize your chat experience in Settings → Chat:
- Temperature: Controls response creativity (lower = more deterministic, higher = more creative)
- System Prompt: Set a default instruction for all conversations
- Context Size: Adjust the maximum context window
- Show Reasoning: Default expansion state for thinking blocks
- Auto-play Voice: Automatically speak AI responses
Advanced Settings
For power users, the Advanced Chat Settings section provides deeper control over model behavior:
Performance & System
- Force Load to RAM (MLock): By default, the OS may compress inactive memory to save space. Enabling this forces the model to stay in active RAM, preventing "warm-up" delays when you return to the app. Use this if you have plenty of RAM and want instant responses.
- Flash Attention: An optimized computation technique that speeds up processing and reduces memory usage for compatible models (mostly standard Transformers). Leave on "Auto" for best performance.
- Batch Size: Controls how many tokens the prompt processes at once. Higher values are faster but use more memory. Lower this if the app crashes during long prompt processing.
Generation Parameters
- Mirostat 2.0: An alternative to standard Temperature sampling. Instead of just picking random words, it actively adjusts the "surprise" level of the output to keep it coherent yet interesting (based on Perplexity). Great for creative writing where standard temperature feels too chaotic or too boring.
- Top-P & Top-K: These filter the "vocabulary" the AI chooses from. Top-K limits choices to the top X most likely words. Top-P limits choices to the top words whose combined probability equals P%. Tweaking these can reduce repetition or gibberish.
- Attention Type: Causal is standard for chat models. Non-Causal is rarely used for specific architectures. Generally, keep this on "Unspecified" or "Causal".
Context Management
- Smart Context Truncation: When the conversation gets too long, this feature intelligently removes the oldest messages while preserving your System Prompt and key instructions. Ensures the AI doesn't "forget" its role even in long chats.
Web Content
- Web Readability: When enabled, the app uses Mozilla's Readability algorithm to strip ads, navigation, and clutter from web pages before the AI reads them. Disable this if you need the AI to see raw HTML structure or hidden content.
UI Customization
- History Display Mode: Customize how conversation history is presented. Choose between different styles to match your reading preference.