SkillsMay 11, 2026·4 min read

Cartesia Voice Cloning — Build a Voice Library from Audio

Cartesia voice cloning creates a custom voice from a 5-30 second sample. Upload, save, version, share within your account. Consent built in.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Stage only · 17/100Stage only
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Stage only
Trust
Trust: New
Entrypoint
Asset
Universal CLI install command
npx tokrepo install e5dd6c2d-fc3d-485a-842d-3338e266e5ed
Intro

Cartesia's voice cloning creates a high-fidelity custom voice from a 5-30 second audio sample — accent, timbre, pacing all preserved. Voices are saved to your account library, versionable, shareable across team members. The platform enforces consent attestation before clone-from-real-person — protecting against misuse. Best for: character voices in apps, branded customer support voices, audiobook narration with custom narrators. Works with: REST upload, Python/JS SDKs. Setup time: 5 minutes per voice.


Upload + clone a voice

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

with open("narrator-sample.wav", "rb") as f:
    voice = client.voices.clone(
        clip=f,
        name="Brand Narrator — Sarah",
        description="Warm mid-30s American female. Used for TokRepo product walkthrough videos.",
        mode="similarity",   # "similarity" (closer to source) | "stability" (more natural)
        enhance=True,        # auto-clean noise before training
    )

print(voice["id"])

Use the cloned voice

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id=voice["id"],
    transcript="Welcome to TokRepo. Let's walk through what's new this week.",
    output_format={"container": "mp3"},
)

Voice library management

# List all voices in your account
voices = client.voices.list()
for v in voices:
    print(v["id"], v["name"], v["is_owner"], v["is_starred"])

# Update metadata
client.voices.update(voice["id"], name="Brand Narrator — Sarah (v2)", description="...")

# Delete (cleanup unused)
client.voices.delete(voice["id"])

Best practices for source audio

Aspect Recommendation
Length 10-30 seconds (under 10 → similarity drops; over 30 → no further gain)
Content Cover varied prosody — questions, statements, exclamations
Background Silent room or denoised ahead of time
Format WAV 16-bit 24kHz+ (mp3 is OK but lossy artifacts can leak in)
Avoid Music, multiple speakers in clip, heavy reverb, extreme audio compression

Consent and policy

Cartesia requires attestation that the source voice is yours OR you have written permission from the voice owner. The platform monitors for misuse — cloning public figures without consent is grounds for account termination. For commercial brand voices, document the talent release agreement with your legal team.


FAQ

Q: similarity vs stability mode? A: Similarity sticks closer to the source — best for celebrity voice character work. Stability smooths variation — better for long-form narration where source artifacts would compound. Default to stability for production unless you specifically want source resemblance.

Q: Can I clone in a language different from the source? A: Yes — clones cross languages. A 10s English source clip can synthesize Spanish/French output retaining the speaker's vocal characteristics. Accent transfer accuracy varies; test on representative content.

Q: How big is my voice library quota? A: Free tier: 3 voices. Pro tier: 50. Scale tier: 500+. Cloned voices count toward the limit; pre-built voices do not. Delete unused voices to reclaim slots.


Quick Use

  1. Record 10-30s clean audio sample of the target voice
  2. client.voices.clone(clip=open('sample.wav','rb'), name='...', mode='stability')
  3. Use returned voice['id'] in subsequent client.tts.bytes(...) calls

Intro

Cartesia's voice cloning creates a high-fidelity custom voice from a 5-30 second audio sample — accent, timbre, pacing all preserved. Voices are saved to your account library, versionable, shareable across team members. The platform enforces consent attestation before clone-from-real-person — protecting against misuse. Best for: character voices in apps, branded customer support voices, audiobook narration with custom narrators. Works with: REST upload, Python/JS SDKs. Setup time: 5 minutes per voice.


Upload + clone a voice

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

with open("narrator-sample.wav", "rb") as f:
    voice = client.voices.clone(
        clip=f,
        name="Brand Narrator — Sarah",
        description="Warm mid-30s American female. Used for TokRepo product walkthrough videos.",
        mode="similarity",   # "similarity" (closer to source) | "stability" (more natural)
        enhance=True,        # auto-clean noise before training
    )

print(voice["id"])

Use the cloned voice

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id=voice["id"],
    transcript="Welcome to TokRepo. Let's walk through what's new this week.",
    output_format={"container": "mp3"},
)

Voice library management

# List all voices in your account
voices = client.voices.list()
for v in voices:
    print(v["id"], v["name"], v["is_owner"], v["is_starred"])

# Update metadata
client.voices.update(voice["id"], name="Brand Narrator — Sarah (v2)", description="...")

# Delete (cleanup unused)
client.voices.delete(voice["id"])

Best practices for source audio

Aspect Recommendation
Length 10-30 seconds (under 10 → similarity drops; over 30 → no further gain)
Content Cover varied prosody — questions, statements, exclamations
Background Silent room or denoised ahead of time
Format WAV 16-bit 24kHz+ (mp3 is OK but lossy artifacts can leak in)
Avoid Music, multiple speakers in clip, heavy reverb, extreme audio compression

Consent and policy

Cartesia requires attestation that the source voice is yours OR you have written permission from the voice owner. The platform monitors for misuse — cloning public figures without consent is grounds for account termination. For commercial brand voices, document the talent release agreement with your legal team.


FAQ

Q: similarity vs stability mode? A: Similarity sticks closer to the source — best for celebrity voice character work. Stability smooths variation — better for long-form narration where source artifacts would compound. Default to stability for production unless you specifically want source resemblance.

Q: Can I clone in a language different from the source? A: Yes — clones cross languages. A 10s English source clip can synthesize Spanish/French output retaining the speaker's vocal characteristics. Accent transfer accuracy varies; test on representative content.

Q: How big is my voice library quota? A: Free tier: 3 voices. Pro tier: 50. Scale tier: 500+. Cloned voices count toward the limit; pre-built voices do not. Delete unused voices to reclaim slots.


Source & Thanks

Built by Cartesia. Voice cloning docs at docs.cartesia.ai/voices/clone.

cartesia-ai/cartesia-python

🙏

Source & Thanks

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets