Text to speech

On-device TTS, voice cloning, and cleaner reference audio

Text-to-speech gets much more useful when it is private, repeatable, and easy to control. On Device AI now gives users more local voice model choices, saved cloned voices for supported models, and optional cleanup for noisy reference audio.

Voice generation should not feel like renting a microphone from the cloud

People use TTS for different reasons. Some want AI responses read aloud. Some want to create audio files from notes or drafts. Some want a familiar voice for narration, study material, or personal workflows.

The problem is that many voice tools send the text, the audio, or both to a hosted service. That may be fine for public copy. It is a different question when the text is a private document, a client note, or a personal recording used as a reference voice.

On Device AI keeps the voice workflow close to the user. The app can generate speech locally, store voice choices locally, and let heavier models appear only where the device can handle them.

More local TTS models, less one-size-fits-all thinking

The TTS side of On Device AI now covers a wider range of needs. Apple built-in voices are still the lightest path. Kokoro gives users a compact neural voice option. PocketTTS focuses on low-latency English speech. Newer model choices such as Qwen3TTS, CosyVoice3, and VibeVoice add more room for local neural speech and voice-reference workflows.

This is choice architecture, not a model-name contest. A quick read-aloud should not require the same setup as a saved voice workflow. A Mac with more memory can offer options that would be a poor fit for a smaller device. On Device AI handles those differences in the product instead of making every user sort through the whole catalog.

When a selected voice model needs a download, the app asks first. When resources are tight, it can fall back to a lighter Apple voice path. That keeps the feature usable instead of turning every voice request into an all-or-nothing bet.

How voice cloning works for users

For supported voice models, voice cloning starts with a reference sample. The user can record a short clip or import audio, preview the result, give the voice a name, and save it. Once saved, the cloned voice appears in the user's voice list and can be selected like a reusable voice.

Some workflows can also use a reference transcript when the model benefits from knowing exactly what was spoken in the sample. The app keeps the setup flow visible and direct: choose the source, review it, add the text if needed, then save the voice.

The important part is control. A cloned voice is not a mystery setting buried behind a generic toggle. It is a saved voice the user can select, edit, rename, or remove.

Cleaner reference audio helps the voice setup

Voice cloning depends on the quality of the reference audio. A quiet room helps, but real users are not always in a quiet room. That is why On Device AI includes optional noisy-speech cleanup for user-provided audio in supported speech workflows.

For TTS, that cleanup applies to reference audio used during supported voice setup. It does not rewrite or post-process generated speech output. The purpose is narrower and more useful: give the model a cleaner version of the user's sample when that path is available, while keeping the original audio untouched.

If cleanup fails or is unavailable, the app falls back to the original sample and keeps the user moving. No one should lose a voice setup because a quality pass had a bad day.

RAM matters, so the app behaves like it matters

Local voice models are powerful, but they are not free. They need storage, memory, and time to prepare. On Device AI reduces the friction by showing suitable choices, loading voice models when they are needed, and avoiding unnecessary overlap between heavy voice tasks.

Users do not need the technical details to benefit from that work. On a smaller device, the app leans toward lighter paths. On a stronger Mac, it can expose more capable choices. If a model is not ready, the UI says so and gives the next step.

That is the difference between a demo and a daily tool. A daily tool needs to run on the device in front of you, not only on the developer's best machine.

Where TTS fits with the rest of On Device AI

The voice features are connected to the rest of the app. You can have AI responses read aloud, generate and save audio from text, use voice notes as source material, and ask AI to summarize or reshape transcripts before turning them into spoken output.

That loop is useful: record privately, transcribe locally, summarize with AI, then listen back or export speech. It feels less like a pile of tools and more like a voice workspace.

Try local speech generation

Read the Text-to-Speech documentation to compare voice engines, preview speech, and export audio. For the recording side, start with Voice Notes.