WorkflowsMay 7, 2026·3 min read

Pinecone Inference — Hosted Embeddings & Reranking API

Pinecone Inference is a managed embedding + reranking endpoint. Use llama-text-embed-v2 or other models without managing GPU infrastructure.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Stage only · 17/100Stage only
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Stage only
Trust
Trust: New
Entrypoint
Asset
Universal CLI install command
npx tokrepo install 42928dd4-41ee-451c-93ff-8122c6e90af7
Intro

Pinecone Inference is the hosted embedding + reranking layer that complements Pinecone's vector index. Generate embeddings with llama-text-embed-v2, multilingual-e5, or pluggable third-party models without running your own GPU. Reranking endpoint scores candidate documents with bge-reranker for higher RAG accuracy. Best for: anyone using Pinecone who'd rather not run an embedding service. Works with: Pinecone Python / TypeScript SDK, REST API. Setup time: 2 minutes.


Generate embeddings

from pinecone import Pinecone

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

embeddings = pc.inference.embed(
    model="llama-text-embed-v2",
    inputs=[
        "Pinecone is a managed vector database",
        "Weaviate is also a vector database",
    ],
    parameters={"input_type": "passage", "truncate": "END"},
)

# Use embeddings directly with a Pinecone index
index = pc.Index("my-index")
index.upsert(vectors=[
    {"id": "doc1", "values": embeddings[0].values, "metadata": {"text": "..."}},
    {"id": "doc2", "values": embeddings[1].values, "metadata": {"text": "..."}},
])

Embed-then-query in one call

# Embed the query
query_emb = pc.inference.embed(
    model="llama-text-embed-v2",
    inputs=["What is a managed vector database?"],
    parameters={"input_type": "query"},
)

# Search the index
results = index.query(
    vector=query_emb[0].values,
    top_k=10,
    include_metadata=True,
)

Rerank candidate documents

reranked = pc.inference.rerank(
    model="bge-reranker-v2-m3",
    query="What is a managed vector database?",
    documents=[r.metadata["text"] for r in results.matches],
    top_n=5,
    return_documents=True,
)

# reranked.data contains the top 5 most relevant, scored 0-1
for r in reranked.data:
    print(r.score, r.document.text)

Why use Inference vs run your own embedding service

  • No GPU to manage — Pinecone hosts and scales the model
  • Same SDK as the index (no extra auth, billing)
  • Inference is included in Pinecone Standard / Enterprise plans
  • Latency optimized for use with Pinecone's index (same network)

FAQ

Q: Is Pinecone Inference free? A: There's a free tier (2K embeddings/month). Beyond that it's pay-as-you-go bundled into Pinecone's Standard plan. Free for testing, scales with your index usage.

Q: Which models are available? A: llama-text-embed-v2 (1024-dim), multilingual-e5-large, pinecone-sparse-english-v0, bge-reranker-v2-m3 (reranking). Pinecone adds models periodically — check their docs for the current list.

Q: Can I use Pinecone Inference without Pinecone the index? A: Yes — Inference is a separate API. Generate embeddings, store them anywhere (Postgres pgvector, your own DB). The bundled use case (embed + index in one Pinecone account) is just convenient.


Quick Use

  1. pip install pinecone (≥5.0)
  2. pc = Pinecone(api_key=...); call pc.inference.embed(model=..., inputs=[...])
  3. For RAG, follow with pc.inference.rerank(...) on the top candidates

Intro

Pinecone Inference is the hosted embedding + reranking layer that complements Pinecone's vector index. Generate embeddings with llama-text-embed-v2, multilingual-e5, or pluggable third-party models without running your own GPU. Reranking endpoint scores candidate documents with bge-reranker for higher RAG accuracy. Best for: anyone using Pinecone who'd rather not run an embedding service. Works with: Pinecone Python / TypeScript SDK, REST API. Setup time: 2 minutes.


Generate embeddings

from pinecone import Pinecone

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

embeddings = pc.inference.embed(
    model="llama-text-embed-v2",
    inputs=[
        "Pinecone is a managed vector database",
        "Weaviate is also a vector database",
    ],
    parameters={"input_type": "passage", "truncate": "END"},
)

# Use embeddings directly with a Pinecone index
index = pc.Index("my-index")
index.upsert(vectors=[
    {"id": "doc1", "values": embeddings[0].values, "metadata": {"text": "..."}},
    {"id": "doc2", "values": embeddings[1].values, "metadata": {"text": "..."}},
])

Embed-then-query in one call

# Embed the query
query_emb = pc.inference.embed(
    model="llama-text-embed-v2",
    inputs=["What is a managed vector database?"],
    parameters={"input_type": "query"},
)

# Search the index
results = index.query(
    vector=query_emb[0].values,
    top_k=10,
    include_metadata=True,
)

Rerank candidate documents

reranked = pc.inference.rerank(
    model="bge-reranker-v2-m3",
    query="What is a managed vector database?",
    documents=[r.metadata["text"] for r in results.matches],
    top_n=5,
    return_documents=True,
)

# reranked.data contains the top 5 most relevant, scored 0-1
for r in reranked.data:
    print(r.score, r.document.text)

Why use Inference vs run your own embedding service

  • No GPU to manage — Pinecone hosts and scales the model
  • Same SDK as the index (no extra auth, billing)
  • Inference is included in Pinecone Standard / Enterprise plans
  • Latency optimized for use with Pinecone's index (same network)

FAQ

Q: Is Pinecone Inference free? A: There's a free tier (2K embeddings/month). Beyond that it's pay-as-you-go bundled into Pinecone's Standard plan. Free for testing, scales with your index usage.

Q: Which models are available? A: llama-text-embed-v2 (1024-dim), multilingual-e5-large, pinecone-sparse-english-v0, bge-reranker-v2-m3 (reranking). Pinecone adds models periodically — check their docs for the current list.

Q: Can I use Pinecone Inference without Pinecone the index? A: Yes — Inference is a separate API. Generate embeddings, store them anywhere (Postgres pgvector, your own DB). The bundled use case (embed + index in one Pinecone account) is just convenient.


Source & Thanks

Built by Pinecone. Commercial product with free tier.

docs.pinecone.io/inference — Inference docs

🙏

Source & Thanks

Built by Pinecone. Commercial product with free tier.

docs.pinecone.io/inference — Inference docs

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets