How do I install Cartesia Voice Cloning — Build a Voice Library from Audio?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Cartesia Voice Cloning — Build a Voice Library from Audio

from cartesia import Cartesia client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"]) with open("narrator-sample.wav", "rb") as f: voice = client.voices.clone( clip=f, name="Brand Narrator — Sarah", description="Warm mid-30s American female. Used for TokRepo product walkthrough videos.", mode="similarity", # "similarity" (closer to source) | "stability" (more natural) enhance=True, # auto-clean noise before training ) print(voice["id"])

# List all voices in your account voices = client.voices.list() for v in voices: print(v["id"], v["name"], v["is_owner"], v["is_starred"]) # Update metadata client.voices.update(voice["id"], name="Brand Narrator — Sarah (v2)", description="...") # Delete (cleanup unused) client.voices.delete(voice["id"])

Aspect

Recommendation

Length

10-30 seconds (under 10 → similarity drops; over 30 → no further gain)

Content

Cover varied prosody — questions, statements, exclamations

Background

Silent room or denoised ahead of time

Format

WAV 16-bit 24kHz+ (mp3 is OK but lossy artifacts can leak in)

Avoid

Music, multiple speakers in clip, heavy reverb, extreme audio compression

Quick Use

Record 10-30s clean audio sample of the target voice
client.voices.clone(clip=open('sample.wav','rb'), name='...', mode='stability')
Use returned voice['id'] in subsequent client.tts.bytes(...) calls

Intro

Cartesia's voice cloning creates a high-fidelity custom voice from a 5-30 second audio sample — accent, timbre, pacing all preserved. Voices are saved to your account library, versionable, shareable across team members. The platform enforces consent attestation before clone-from-real-person — protecting against misuse. Best for: character voices in apps, branded customer support voices, audiobook narration with custom narrators. Works with: REST upload, Python/JS SDKs. Setup time: 5 minutes per voice.

Upload + clone a voice

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

with open("narrator-sample.wav", "rb") as f:
    voice = client.voices.clone(
        clip=f,
        name="Brand Narrator — Sarah",
        description="Warm mid-30s American female. Used for TokRepo product walkthrough videos.",
        mode="similarity",   # "similarity" (closer to source) | "stability" (more natural)
        enhance=True,        # auto-clean noise before training
    )

print(voice["id"])

Use the cloned voice

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id=voice["id"],
    transcript="Welcome to TokRepo. Let's walk through what's new this week.",
    output_format={"container": "mp3"},
)

Voice library management

# List all voices in your account
voices = client.voices.list()
for v in voices:
    print(v["id"], v["name"], v["is_owner"], v["is_starred"])

# Update metadata
client.voices.update(voice["id"], name="Brand Narrator — Sarah (v2)", description="...")

# Delete (cleanup unused)
client.voices.delete(voice["id"])

Best practices for source audio

Aspect	Recommendation
Length	10-30 seconds (under 10 → similarity drops; over 30 → no further gain)
Content	Cover varied prosody — questions, statements, exclamations
Background	Silent room or denoised ahead of time
Format	WAV 16-bit 24kHz+ (mp3 is OK but lossy artifacts can leak in)
Avoid	Music, multiple speakers in clip, heavy reverb, extreme audio compression

Consent and policy

Cartesia requires attestation that the source voice is yours OR you have written permission from the voice owner. The platform monitors for misuse — cloning public figures without consent is grounds for account termination. For commercial brand voices, document the talent release agreement with your legal team.

FAQ

Q: similarity vs stability mode? A: Similarity sticks closer to the source — best for celebrity voice character work. Stability smooths variation — better for long-form narration where source artifacts would compound. Default to stability for production unless you specifically want source resemblance.

Q: Can I clone in a language different from the source? A: Yes — clones cross languages. A 10s English source clip can synthesize Spanish/French output retaining the speaker's vocal characteristics. Accent transfer accuracy varies; test on representative content.

Q: How big is my voice library quota? A: Free tier: 3 voices. Pro tier: 50. Scale tier: 500+. Cloned voices count toward the limit; pre-built voices do not. Delete unused voices to reclaim slots.

Source & Thanks

Built by Cartesia. Voice cloning docs at docs.cartesia.ai/voices/clone.

cartesia-ai/cartesia-python

Cartesia Voice Cloning — Build a Voice Library from Audio

Safe staging for this asset

Upload + clone a voice

Use the cloned voice

Voice library management

Best practices for source audio

Consent and policy

FAQ

Quick Use

Intro

Upload + clone a voice

Use the cloned voice

Voice library management

Best practices for source audio

Consent and policy

FAQ

Source & Thanks

Source & Thanks

Discussion

Related Assets

OpenVoice — Instant Voice Cloning with Tone and Style Control

Cartesia Streaming WebSocket — Full-Duplex Voice Agent TTS

GPT-SoVITS — Few-Shot Voice Cloning and Text-to-Speech

Voicebox — Open-Source AI Voice Studio