Quick Use
pip install pinecone(≥5.0)pc = Pinecone(api_key=...); callpc.inference.embed(model=..., inputs=[...])- For RAG, follow with
pc.inference.rerank(...)on the top candidates
Intro
Pinecone Inference is the hosted embedding + reranking layer that complements Pinecone's vector index. Generate embeddings with llama-text-embed-v2, multilingual-e5, or pluggable third-party models without running your own GPU. Reranking endpoint scores candidate documents with bge-reranker for higher RAG accuracy. Best for: anyone using Pinecone who'd rather not run an embedding service. Works with: Pinecone Python / TypeScript SDK, REST API. Setup time: 2 minutes.
Generate embeddings
from pinecone import Pinecone
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
embeddings = pc.inference.embed(
model="llama-text-embed-v2",
inputs=[
"Pinecone is a managed vector database",
"Weaviate is also a vector database",
],
parameters={"input_type": "passage", "truncate": "END"},
)
# Use embeddings directly with a Pinecone index
index = pc.Index("my-index")
index.upsert(vectors=[
{"id": "doc1", "values": embeddings[0].values, "metadata": {"text": "..."}},
{"id": "doc2", "values": embeddings[1].values, "metadata": {"text": "..."}},
])Embed-then-query in one call
# Embed the query
query_emb = pc.inference.embed(
model="llama-text-embed-v2",
inputs=["What is a managed vector database?"],
parameters={"input_type": "query"},
)
# Search the index
results = index.query(
vector=query_emb[0].values,
top_k=10,
include_metadata=True,
)Rerank candidate documents
reranked = pc.inference.rerank(
model="bge-reranker-v2-m3",
query="What is a managed vector database?",
documents=[r.metadata["text"] for r in results.matches],
top_n=5,
return_documents=True,
)
# reranked.data contains the top 5 most relevant, scored 0-1
for r in reranked.data:
print(r.score, r.document.text)Why use Inference vs run your own embedding service
- No GPU to manage — Pinecone hosts and scales the model
- Same SDK as the index (no extra auth, billing)
- Inference is included in Pinecone Standard / Enterprise plans
- Latency optimized for use with Pinecone's index (same network)
FAQ
Q: Is Pinecone Inference free? A: There's a free tier (2K embeddings/month). Beyond that it's pay-as-you-go bundled into Pinecone's Standard plan. Free for testing, scales with your index usage.
Q: Which models are available? A: llama-text-embed-v2 (1024-dim), multilingual-e5-large, pinecone-sparse-english-v0, bge-reranker-v2-m3 (reranking). Pinecone adds models periodically — check their docs for the current list.
Q: Can I use Pinecone Inference without Pinecone the index? A: Yes — Inference is a separate API. Generate embeddings, store them anywhere (Postgres pgvector, your own DB). The bundled use case (embed + index in one Pinecone account) is just convenient.
Source & Thanks
Built by Pinecone. Commercial product with free tier.
docs.pinecone.io/inference — Inference docs