Quick Use
- Record 10-30s clean audio sample of the target voice
client.voices.clone(clip=open('sample.wav','rb'), name='...', mode='stability')- Use returned
voice['id']in subsequentclient.tts.bytes(...)calls
Intro
Cartesia's voice cloning creates a high-fidelity custom voice from a 5-30 second audio sample — accent, timbre, pacing all preserved. Voices are saved to your account library, versionable, shareable across team members. The platform enforces consent attestation before clone-from-real-person — protecting against misuse. Best for: character voices in apps, branded customer support voices, audiobook narration with custom narrators. Works with: REST upload, Python/JS SDKs. Setup time: 5 minutes per voice.
Upload + clone a voice
from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])
with open("narrator-sample.wav", "rb") as f:
voice = client.voices.clone(
clip=f,
name="Brand Narrator — Sarah",
description="Warm mid-30s American female. Used for TokRepo product walkthrough videos.",
mode="similarity", # "similarity" (closer to source) | "stability" (more natural)
enhance=True, # auto-clean noise before training
)
print(voice["id"])Use the cloned voice
audio = client.tts.bytes(
model_id="sonic-2",
voice_id=voice["id"],
transcript="Welcome to TokRepo. Let's walk through what's new this week.",
output_format={"container": "mp3"},
)Voice library management
# List all voices in your account
voices = client.voices.list()
for v in voices:
print(v["id"], v["name"], v["is_owner"], v["is_starred"])
# Update metadata
client.voices.update(voice["id"], name="Brand Narrator — Sarah (v2)", description="...")
# Delete (cleanup unused)
client.voices.delete(voice["id"])Best practices for source audio
| Aspect | Recommendation |
|---|---|
| Length | 10-30 seconds (under 10 → similarity drops; over 30 → no further gain) |
| Content | Cover varied prosody — questions, statements, exclamations |
| Background | Silent room or denoised ahead of time |
| Format | WAV 16-bit 24kHz+ (mp3 is OK but lossy artifacts can leak in) |
| Avoid | Music, multiple speakers in clip, heavy reverb, extreme audio compression |
Consent and policy
Cartesia requires attestation that the source voice is yours OR you have written permission from the voice owner. The platform monitors for misuse — cloning public figures without consent is grounds for account termination. For commercial brand voices, document the talent release agreement with your legal team.
FAQ
Q: similarity vs stability mode? A: Similarity sticks closer to the source — best for celebrity voice character work. Stability smooths variation — better for long-form narration where source artifacts would compound. Default to stability for production unless you specifically want source resemblance.
Q: Can I clone in a language different from the source? A: Yes — clones cross languages. A 10s English source clip can synthesize Spanish/French output retaining the speaker's vocal characteristics. Accent transfer accuracy varies; test on representative content.
Q: How big is my voice library quota? A: Free tier: 3 voices. Pro tier: 50. Scale tier: 500+. Cloned voices count toward the limit; pre-built voices do not. Delete unused voices to reclaim slots.
Source & Thanks
Built by Cartesia. Voice cloning docs at docs.cartesia.ai/voices/clone.