Cette page est affichée en anglais. Une traduction française est en cours.
WorkflowsMay 7, 2026·3 min de lecture

Pinecone Inference — Hosted Embeddings & Reranking API

Pinecone Inference is a managed embedding + reranking endpoint. Use llama-text-embed-v2 or other models without managing GPU infrastructure.

Pinecone
Pinecone · Community
Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 17/100Stage only
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Stage only
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 42928dd4-41ee-451c-93ff-8122c6e90af7
Introduction

Pinecone Inference is the hosted embedding + reranking layer that complements Pinecone's vector index. Generate embeddings with llama-text-embed-v2, multilingual-e5, or pluggable third-party models without running your own GPU. Reranking endpoint scores candidate documents with bge-reranker for higher RAG accuracy. Best for: anyone using Pinecone who'd rather not run an embedding service. Works with: Pinecone Python / TypeScript SDK, REST API. Setup time: 2 minutes.


Generate embeddings

from pinecone import Pinecone

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

embeddings = pc.inference.embed(
    model="llama-text-embed-v2",
    inputs=[
        "Pinecone is a managed vector database",
        "Weaviate is also a vector database",
    ],
    parameters={"input_type": "passage", "truncate": "END"},
)

# Use embeddings directly with a Pinecone index
index = pc.Index("my-index")
index.upsert(vectors=[
    {"id": "doc1", "values": embeddings[0].values, "metadata": {"text": "..."}},
    {"id": "doc2", "values": embeddings[1].values, "metadata": {"text": "..."}},
])

Embed-then-query in one call

# Embed the query
query_emb = pc.inference.embed(
    model="llama-text-embed-v2",
    inputs=["What is a managed vector database?"],
    parameters={"input_type": "query"},
)

# Search the index
results = index.query(
    vector=query_emb[0].values,
    top_k=10,
    include_metadata=True,
)

Rerank candidate documents

reranked = pc.inference.rerank(
    model="bge-reranker-v2-m3",
    query="What is a managed vector database?",
    documents=[r.metadata["text"] for r in results.matches],
    top_n=5,
    return_documents=True,
)

# reranked.data contains the top 5 most relevant, scored 0-1
for r in reranked.data:
    print(r.score, r.document.text)

Why use Inference vs run your own embedding service

  • No GPU to manage — Pinecone hosts and scales the model
  • Same SDK as the index (no extra auth, billing)
  • Inference is included in Pinecone Standard / Enterprise plans
  • Latency optimized for use with Pinecone's index (same network)

FAQ

Q: Is Pinecone Inference free? A: There's a free tier (2K embeddings/month). Beyond that it's pay-as-you-go bundled into Pinecone's Standard plan. Free for testing, scales with your index usage.

Q: Which models are available? A: llama-text-embed-v2 (1024-dim), multilingual-e5-large, pinecone-sparse-english-v0, bge-reranker-v2-m3 (reranking). Pinecone adds models periodically — check their docs for the current list.

Q: Can I use Pinecone Inference without Pinecone the index? A: Yes — Inference is a separate API. Generate embeddings, store them anywhere (Postgres pgvector, your own DB). The bundled use case (embed + index in one Pinecone account) is just convenient.


Quick Use

  1. pip install pinecone (≥5.0)
  2. pc = Pinecone(api_key=...); call pc.inference.embed(model=..., inputs=[...])
  3. For RAG, follow with pc.inference.rerank(...) on the top candidates

Intro

Pinecone Inference is the hosted embedding + reranking layer that complements Pinecone's vector index. Generate embeddings with llama-text-embed-v2, multilingual-e5, or pluggable third-party models without running your own GPU. Reranking endpoint scores candidate documents with bge-reranker for higher RAG accuracy. Best for: anyone using Pinecone who'd rather not run an embedding service. Works with: Pinecone Python / TypeScript SDK, REST API. Setup time: 2 minutes.


Generate embeddings

from pinecone import Pinecone

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

embeddings = pc.inference.embed(
    model="llama-text-embed-v2",
    inputs=[
        "Pinecone is a managed vector database",
        "Weaviate is also a vector database",
    ],
    parameters={"input_type": "passage", "truncate": "END"},
)

# Use embeddings directly with a Pinecone index
index = pc.Index("my-index")
index.upsert(vectors=[
    {"id": "doc1", "values": embeddings[0].values, "metadata": {"text": "..."}},
    {"id": "doc2", "values": embeddings[1].values, "metadata": {"text": "..."}},
])

Embed-then-query in one call

# Embed the query
query_emb = pc.inference.embed(
    model="llama-text-embed-v2",
    inputs=["What is a managed vector database?"],
    parameters={"input_type": "query"},
)

# Search the index
results = index.query(
    vector=query_emb[0].values,
    top_k=10,
    include_metadata=True,
)

Rerank candidate documents

reranked = pc.inference.rerank(
    model="bge-reranker-v2-m3",
    query="What is a managed vector database?",
    documents=[r.metadata["text"] for r in results.matches],
    top_n=5,
    return_documents=True,
)

# reranked.data contains the top 5 most relevant, scored 0-1
for r in reranked.data:
    print(r.score, r.document.text)

Why use Inference vs run your own embedding service

  • No GPU to manage — Pinecone hosts and scales the model
  • Same SDK as the index (no extra auth, billing)
  • Inference is included in Pinecone Standard / Enterprise plans
  • Latency optimized for use with Pinecone's index (same network)

FAQ

Q: Is Pinecone Inference free? A: There's a free tier (2K embeddings/month). Beyond that it's pay-as-you-go bundled into Pinecone's Standard plan. Free for testing, scales with your index usage.

Q: Which models are available? A: llama-text-embed-v2 (1024-dim), multilingual-e5-large, pinecone-sparse-english-v0, bge-reranker-v2-m3 (reranking). Pinecone adds models periodically — check their docs for the current list.

Q: Can I use Pinecone Inference without Pinecone the index? A: Yes — Inference is a separate API. Generate embeddings, store them anywhere (Postgres pgvector, your own DB). The bundled use case (embed + index in one Pinecone account) is just convenient.


Source & Thanks

Built by Pinecone. Commercial product with free tier.

docs.pinecone.io/inference — Inference docs

🙏

Source et remerciements

Built by Pinecone. Commercial product with free tier.

docs.pinecone.io/inference — Inference docs

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires