Esta página se muestra en inglés. Una traducción al español está en curso.
SkillsMay 11, 2026·4 min de lectura

Cartesia Voice Cloning — Build a Voice Library from Audio

Cartesia voice cloning creates a custom voice from a 5-30 second sample. Upload, save, version, share within your account. Consent built in.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Stage only · 17/100Stage only
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Stage only
Confianza
Confianza: New
Entrada
Asset
Comando CLI universal
npx tokrepo install e5dd6c2d-fc3d-485a-842d-3338e266e5ed
Introducción

Cartesia's voice cloning creates a high-fidelity custom voice from a 5-30 second audio sample — accent, timbre, pacing all preserved. Voices are saved to your account library, versionable, shareable across team members. The platform enforces consent attestation before clone-from-real-person — protecting against misuse. Best for: character voices in apps, branded customer support voices, audiobook narration with custom narrators. Works with: REST upload, Python/JS SDKs. Setup time: 5 minutes per voice.


Upload + clone a voice

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

with open("narrator-sample.wav", "rb") as f:
    voice = client.voices.clone(
        clip=f,
        name="Brand Narrator — Sarah",
        description="Warm mid-30s American female. Used for TokRepo product walkthrough videos.",
        mode="similarity",   # "similarity" (closer to source) | "stability" (more natural)
        enhance=True,        # auto-clean noise before training
    )

print(voice["id"])

Use the cloned voice

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id=voice["id"],
    transcript="Welcome to TokRepo. Let's walk through what's new this week.",
    output_format={"container": "mp3"},
)

Voice library management

# List all voices in your account
voices = client.voices.list()
for v in voices:
    print(v["id"], v["name"], v["is_owner"], v["is_starred"])

# Update metadata
client.voices.update(voice["id"], name="Brand Narrator — Sarah (v2)", description="...")

# Delete (cleanup unused)
client.voices.delete(voice["id"])

Best practices for source audio

Aspect Recommendation
Length 10-30 seconds (under 10 → similarity drops; over 30 → no further gain)
Content Cover varied prosody — questions, statements, exclamations
Background Silent room or denoised ahead of time
Format WAV 16-bit 24kHz+ (mp3 is OK but lossy artifacts can leak in)
Avoid Music, multiple speakers in clip, heavy reverb, extreme audio compression

Consent and policy

Cartesia requires attestation that the source voice is yours OR you have written permission from the voice owner. The platform monitors for misuse — cloning public figures without consent is grounds for account termination. For commercial brand voices, document the talent release agreement with your legal team.


FAQ

Q: similarity vs stability mode? A: Similarity sticks closer to the source — best for celebrity voice character work. Stability smooths variation — better for long-form narration where source artifacts would compound. Default to stability for production unless you specifically want source resemblance.

Q: Can I clone in a language different from the source? A: Yes — clones cross languages. A 10s English source clip can synthesize Spanish/French output retaining the speaker's vocal characteristics. Accent transfer accuracy varies; test on representative content.

Q: How big is my voice library quota? A: Free tier: 3 voices. Pro tier: 50. Scale tier: 500+. Cloned voices count toward the limit; pre-built voices do not. Delete unused voices to reclaim slots.


Quick Use

  1. Record 10-30s clean audio sample of the target voice
  2. client.voices.clone(clip=open('sample.wav','rb'), name='...', mode='stability')
  3. Use returned voice['id'] in subsequent client.tts.bytes(...) calls

Intro

Cartesia's voice cloning creates a high-fidelity custom voice from a 5-30 second audio sample — accent, timbre, pacing all preserved. Voices are saved to your account library, versionable, shareable across team members. The platform enforces consent attestation before clone-from-real-person — protecting against misuse. Best for: character voices in apps, branded customer support voices, audiobook narration with custom narrators. Works with: REST upload, Python/JS SDKs. Setup time: 5 minutes per voice.


Upload + clone a voice

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

with open("narrator-sample.wav", "rb") as f:
    voice = client.voices.clone(
        clip=f,
        name="Brand Narrator — Sarah",
        description="Warm mid-30s American female. Used for TokRepo product walkthrough videos.",
        mode="similarity",   # "similarity" (closer to source) | "stability" (more natural)
        enhance=True,        # auto-clean noise before training
    )

print(voice["id"])

Use the cloned voice

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id=voice["id"],
    transcript="Welcome to TokRepo. Let's walk through what's new this week.",
    output_format={"container": "mp3"},
)

Voice library management

# List all voices in your account
voices = client.voices.list()
for v in voices:
    print(v["id"], v["name"], v["is_owner"], v["is_starred"])

# Update metadata
client.voices.update(voice["id"], name="Brand Narrator — Sarah (v2)", description="...")

# Delete (cleanup unused)
client.voices.delete(voice["id"])

Best practices for source audio

Aspect Recommendation
Length 10-30 seconds (under 10 → similarity drops; over 30 → no further gain)
Content Cover varied prosody — questions, statements, exclamations
Background Silent room or denoised ahead of time
Format WAV 16-bit 24kHz+ (mp3 is OK but lossy artifacts can leak in)
Avoid Music, multiple speakers in clip, heavy reverb, extreme audio compression

Consent and policy

Cartesia requires attestation that the source voice is yours OR you have written permission from the voice owner. The platform monitors for misuse — cloning public figures without consent is grounds for account termination. For commercial brand voices, document the talent release agreement with your legal team.


FAQ

Q: similarity vs stability mode? A: Similarity sticks closer to the source — best for celebrity voice character work. Stability smooths variation — better for long-form narration where source artifacts would compound. Default to stability for production unless you specifically want source resemblance.

Q: Can I clone in a language different from the source? A: Yes — clones cross languages. A 10s English source clip can synthesize Spanish/French output retaining the speaker's vocal characteristics. Accent transfer accuracy varies; test on representative content.

Q: How big is my voice library quota? A: Free tier: 3 voices. Pro tier: 50. Scale tier: 500+. Cloned voices count toward the limit; pre-built voices do not. Delete unused voices to reclaim slots.


Source & Thanks

Built by Cartesia. Voice cloning docs at docs.cartesia.ai/voices/clone.

cartesia-ai/cartesia-python

🙏

Fuente y agradecimientos

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados