Cette page est affichée en anglais. Une traduction française est en cours.
SkillsMay 11, 2026·4 min de lecture

Cartesia Voice Cloning — Build a Voice Library from Audio

Cartesia voice cloning creates a custom voice from a 5-30 second sample. Upload, save, version, share within your account. Consent built in.

Cartesia
Cartesia · Community
Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 17/100Stage only
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Stage only
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install e5dd6c2d-fc3d-485a-842d-3338e266e5ed
Introduction

Cartesia's voice cloning creates a high-fidelity custom voice from a 5-30 second audio sample — accent, timbre, pacing all preserved. Voices are saved to your account library, versionable, shareable across team members. The platform enforces consent attestation before clone-from-real-person — protecting against misuse. Best for: character voices in apps, branded customer support voices, audiobook narration with custom narrators. Works with: REST upload, Python/JS SDKs. Setup time: 5 minutes per voice.


Upload + clone a voice

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

with open("narrator-sample.wav", "rb") as f:
    voice = client.voices.clone(
        clip=f,
        name="Brand Narrator — Sarah",
        description="Warm mid-30s American female. Used for TokRepo product walkthrough videos.",
        mode="similarity",   # "similarity" (closer to source) | "stability" (more natural)
        enhance=True,        # auto-clean noise before training
    )

print(voice["id"])

Use the cloned voice

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id=voice["id"],
    transcript="Welcome to TokRepo. Let's walk through what's new this week.",
    output_format={"container": "mp3"},
)

Voice library management

# List all voices in your account
voices = client.voices.list()
for v in voices:
    print(v["id"], v["name"], v["is_owner"], v["is_starred"])

# Update metadata
client.voices.update(voice["id"], name="Brand Narrator — Sarah (v2)", description="...")

# Delete (cleanup unused)
client.voices.delete(voice["id"])

Best practices for source audio

Aspect Recommendation
Length 10-30 seconds (under 10 → similarity drops; over 30 → no further gain)
Content Cover varied prosody — questions, statements, exclamations
Background Silent room or denoised ahead of time
Format WAV 16-bit 24kHz+ (mp3 is OK but lossy artifacts can leak in)
Avoid Music, multiple speakers in clip, heavy reverb, extreme audio compression

Consent and policy

Cartesia requires attestation that the source voice is yours OR you have written permission from the voice owner. The platform monitors for misuse — cloning public figures without consent is grounds for account termination. For commercial brand voices, document the talent release agreement with your legal team.


FAQ

Q: similarity vs stability mode? A: Similarity sticks closer to the source — best for celebrity voice character work. Stability smooths variation — better for long-form narration where source artifacts would compound. Default to stability for production unless you specifically want source resemblance.

Q: Can I clone in a language different from the source? A: Yes — clones cross languages. A 10s English source clip can synthesize Spanish/French output retaining the speaker's vocal characteristics. Accent transfer accuracy varies; test on representative content.

Q: How big is my voice library quota? A: Free tier: 3 voices. Pro tier: 50. Scale tier: 500+. Cloned voices count toward the limit; pre-built voices do not. Delete unused voices to reclaim slots.


Quick Use

  1. Record 10-30s clean audio sample of the target voice
  2. client.voices.clone(clip=open('sample.wav','rb'), name='...', mode='stability')
  3. Use returned voice['id'] in subsequent client.tts.bytes(...) calls

Intro

Cartesia's voice cloning creates a high-fidelity custom voice from a 5-30 second audio sample — accent, timbre, pacing all preserved. Voices are saved to your account library, versionable, shareable across team members. The platform enforces consent attestation before clone-from-real-person — protecting against misuse. Best for: character voices in apps, branded customer support voices, audiobook narration with custom narrators. Works with: REST upload, Python/JS SDKs. Setup time: 5 minutes per voice.


Upload + clone a voice

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

with open("narrator-sample.wav", "rb") as f:
    voice = client.voices.clone(
        clip=f,
        name="Brand Narrator — Sarah",
        description="Warm mid-30s American female. Used for TokRepo product walkthrough videos.",
        mode="similarity",   # "similarity" (closer to source) | "stability" (more natural)
        enhance=True,        # auto-clean noise before training
    )

print(voice["id"])

Use the cloned voice

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id=voice["id"],
    transcript="Welcome to TokRepo. Let's walk through what's new this week.",
    output_format={"container": "mp3"},
)

Voice library management

# List all voices in your account
voices = client.voices.list()
for v in voices:
    print(v["id"], v["name"], v["is_owner"], v["is_starred"])

# Update metadata
client.voices.update(voice["id"], name="Brand Narrator — Sarah (v2)", description="...")

# Delete (cleanup unused)
client.voices.delete(voice["id"])

Best practices for source audio

Aspect Recommendation
Length 10-30 seconds (under 10 → similarity drops; over 30 → no further gain)
Content Cover varied prosody — questions, statements, exclamations
Background Silent room or denoised ahead of time
Format WAV 16-bit 24kHz+ (mp3 is OK but lossy artifacts can leak in)
Avoid Music, multiple speakers in clip, heavy reverb, extreme audio compression

Consent and policy

Cartesia requires attestation that the source voice is yours OR you have written permission from the voice owner. The platform monitors for misuse — cloning public figures without consent is grounds for account termination. For commercial brand voices, document the talent release agreement with your legal team.


FAQ

Q: similarity vs stability mode? A: Similarity sticks closer to the source — best for celebrity voice character work. Stability smooths variation — better for long-form narration where source artifacts would compound. Default to stability for production unless you specifically want source resemblance.

Q: Can I clone in a language different from the source? A: Yes — clones cross languages. A 10s English source clip can synthesize Spanish/French output retaining the speaker's vocal characteristics. Accent transfer accuracy varies; test on representative content.

Q: How big is my voice library quota? A: Free tier: 3 voices. Pro tier: 50. Scale tier: 500+. Cloned voices count toward the limit; pre-built voices do not. Delete unused voices to reclaim slots.


Source & Thanks

Built by Cartesia. Voice cloning docs at docs.cartesia.ai/voices/clone.

cartesia-ai/cartesia-python

🙏

Source et remerciements

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires