What is Pinecone Inference — Hosted Embeddings & Reranking API?

Pinecone Inference is a managed embedding + reranking endpoint. Use llama-text-embed-v2 or other models without managing GPU infrastructure.

How do I install Pinecone Inference — Hosted Embeddings & Reranking API?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Pinecone Inference — Hosted Embeddings & Reranking API

Name: Pinecone Inference — Hosted Embeddings & Reranking API
Author: Pinecone

from pinecone import Pinecone pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"]) embeddings = pc.inference.embed( model="llama-text-embed-v2", inputs=[ "Pinecone is a managed vector database", "Weaviate is also a vector database", ], parameters={"input_type": "passage", "truncate": "END"}, ) # Use embeddings directly with a Pinecone index index = pc.Index("my-index") index.upsert(vectors=[ {"id": "doc1", "values": embeddings[0].values, "metadata": {"text": "..."}}, {"id": "doc2", "values": embeddings[1].values, "metadata": {"text": "..."}}, ])

# Embed the query query_emb = pc.inference.embed( model="llama-text-embed-v2", inputs=["What is a managed vector database?"], parameters={"input_type": "query"}, ) # Search the index results = index.query( vector=query_emb[0].values, top_k=10, include_metadata=True, )

reranked = pc.inference.rerank( model="bge-reranker-v2-m3", query="What is a managed vector database?", documents=[r.metadata["text"] for r in results.matches], top_n=5, return_documents=True, ) # reranked.data contains the top 5 most relevant, scored 0-1 for r in reranked.data: print(r.score, r.document.text)

Quick Use

pip install pinecone (≥5.0)
pc = Pinecone(api_key=...); call pc.inference.embed(model=..., inputs=[...])
For RAG, follow with pc.inference.rerank(...) on the top candidates

Intro

Pinecone Inference is the hosted embedding + reranking layer that complements Pinecone's vector index. Generate embeddings with llama-text-embed-v2, multilingual-e5, or pluggable third-party models without running your own GPU. Reranking endpoint scores candidate documents with bge-reranker for higher RAG accuracy. Best for: anyone using Pinecone who'd rather not run an embedding service. Works with: Pinecone Python / TypeScript SDK, REST API. Setup time: 2 minutes.

Generate embeddings

from pinecone import Pinecone

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

embeddings = pc.inference.embed(
    model="llama-text-embed-v2",
    inputs=[
        "Pinecone is a managed vector database",
        "Weaviate is also a vector database",
    ],
    parameters={"input_type": "passage", "truncate": "END"},
)

# Use embeddings directly with a Pinecone index
index = pc.Index("my-index")
index.upsert(vectors=[
    {"id": "doc1", "values": embeddings[0].values, "metadata": {"text": "..."}},
    {"id": "doc2", "values": embeddings[1].values, "metadata": {"text": "..."}},
])

Embed-then-query in one call

# Embed the query
query_emb = pc.inference.embed(
    model="llama-text-embed-v2",
    inputs=["What is a managed vector database?"],
    parameters={"input_type": "query"},
)

# Search the index
results = index.query(
    vector=query_emb[0].values,
    top_k=10,
    include_metadata=True,
)

Rerank candidate documents

reranked = pc.inference.rerank(
    model="bge-reranker-v2-m3",
    query="What is a managed vector database?",
    documents=[r.metadata["text"] for r in results.matches],
    top_n=5,
    return_documents=True,
)

# reranked.data contains the top 5 most relevant, scored 0-1
for r in reranked.data:
    print(r.score, r.document.text)

Why use Inference vs run your own embedding service

No GPU to manage — Pinecone hosts and scales the model
Same SDK as the index (no extra auth, billing)
Inference is included in Pinecone Standard / Enterprise plans
Latency optimized for use with Pinecone's index (same network)

FAQ

Q: Is Pinecone Inference free? A: There's a free tier (2K embeddings/month). Beyond that it's pay-as-you-go bundled into Pinecone's Standard plan. Free for testing, scales with your index usage.

Q: Which models are available? A: llama-text-embed-v2 (1024-dim), multilingual-e5-large, pinecone-sparse-english-v0, bge-reranker-v2-m3 (reranking). Pinecone adds models periodically — check their docs for the current list.

Q: Can I use Pinecone Inference without Pinecone the index? A: Yes — Inference is a separate API. Generate embeddings, store them anywhere (Postgres pgvector, your own DB). The bundled use case (embed + index in one Pinecone account) is just convenient.

Source & Thanks

Built by Pinecone. Commercial product with free tier.

docs.pinecone.io/inference — Inference docs

Pinecone Inference — Hosted Embeddings & Reranking API

This asset can be read and installed directly by agents

Generate embeddings

Embed-then-query in one call

Rerank candidate documents

Why use Inference vs run your own embedding service

FAQ

Quick Use

Intro

Generate embeddings

Embed-then-query in one call

Rerank candidate documents

Why use Inference vs run your own embedding service

FAQ

Source & Thanks

Source & Thanks

Discussion

Related Assets

Text Embeddings Inference — High-Performance Embedding Server by Hugging Face

Pinecone — Managed Vector Database for Production AI

Pinecone Assistant — Managed RAG Service with Auto-Indexing

KoboldCpp — Single-File Local LLM Inference Engine