# Pinecone Inference — Hosted Embeddings & Reranking API > Pinecone Inference is a managed embedding + reranking endpoint. Use llama-text-embed-v2 or other models without managing GPU infrastructure. ## Install Copy the content below into your project: ## Quick Use 1. `pip install pinecone` (≥5.0) 2. `pc = Pinecone(api_key=...)`; call `pc.inference.embed(model=..., inputs=[...])` 3. For RAG, follow with `pc.inference.rerank(...)` on the top candidates --- ## Intro Pinecone Inference is the hosted embedding + reranking layer that complements Pinecone's vector index. Generate embeddings with llama-text-embed-v2, multilingual-e5, or pluggable third-party models without running your own GPU. Reranking endpoint scores candidate documents with bge-reranker for higher RAG accuracy. Best for: anyone using Pinecone who'd rather not run an embedding service. Works with: Pinecone Python / TypeScript SDK, REST API. Setup time: 2 minutes. --- ### Generate embeddings ```python from pinecone import Pinecone pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"]) embeddings = pc.inference.embed( model="llama-text-embed-v2", inputs=[ "Pinecone is a managed vector database", "Weaviate is also a vector database", ], parameters={"input_type": "passage", "truncate": "END"}, ) # Use embeddings directly with a Pinecone index index = pc.Index("my-index") index.upsert(vectors=[ {"id": "doc1", "values": embeddings[0].values, "metadata": {"text": "..."}}, {"id": "doc2", "values": embeddings[1].values, "metadata": {"text": "..."}}, ]) ``` ### Embed-then-query in one call ```python # Embed the query query_emb = pc.inference.embed( model="llama-text-embed-v2", inputs=["What is a managed vector database?"], parameters={"input_type": "query"}, ) # Search the index results = index.query( vector=query_emb[0].values, top_k=10, include_metadata=True, ) ``` ### Rerank candidate documents ```python reranked = pc.inference.rerank( model="bge-reranker-v2-m3", query="What is a managed vector database?", documents=[r.metadata["text"] for r in results.matches], top_n=5, return_documents=True, ) # reranked.data contains the top 5 most relevant, scored 0-1 for r in reranked.data: print(r.score, r.document.text) ``` ### Why use Inference vs run your own embedding service - No GPU to manage — Pinecone hosts and scales the model - Same SDK as the index (no extra auth, billing) - Inference is included in Pinecone Standard / Enterprise plans - Latency optimized for use with Pinecone's index (same network) --- ### FAQ **Q: Is Pinecone Inference free?** A: There's a free tier (2K embeddings/month). Beyond that it's pay-as-you-go bundled into Pinecone's Standard plan. Free for testing, scales with your index usage. **Q: Which models are available?** A: llama-text-embed-v2 (1024-dim), multilingual-e5-large, pinecone-sparse-english-v0, bge-reranker-v2-m3 (reranking). Pinecone adds models periodically — check their docs for the current list. **Q: Can I use Pinecone Inference without Pinecone the index?** A: Yes — Inference is a separate API. Generate embeddings, store them anywhere (Postgres pgvector, your own DB). The bundled use case (embed + index in one Pinecone account) is just convenient. --- ## Source & Thanks > Built by [Pinecone](https://github.com/pinecone-io). Commercial product with free tier. > > [docs.pinecone.io/inference](https://docs.pinecone.io/guides/inference) — Inference docs --- ## 快速使用 1. `pip install pinecone`(≥5.0) 2. `pc = Pinecone(api_key=...)`,调 `pc.inference.embed(model=..., inputs=[...])` 3. RAG 流程后接 `pc.inference.rerank(...)` 给 top 候选打分 --- ## 简介 Pinecone Inference 是托管的 embedding + 重排层,跟 Pinecone 向量索引互补。用 llama-text-embed-v2、multilingual-e5、或可插拔第三方模型生成 embedding,不用自己跑 GPU。Reranking 端点用 bge-reranker 给候选文档打分,提高 RAG 准度。适合在用 Pinecone 又不想自己跑 embedding 服务的人。兼容 Pinecone Python / TypeScript SDK + REST API。装机时间 2 分钟。 --- ### 生成 embedding ```python from pinecone import Pinecone pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"]) embeddings = pc.inference.embed( model="llama-text-embed-v2", inputs=[ "Pinecone is a managed vector database", "Weaviate is also a vector database", ], parameters={"input_type": "passage", "truncate": "END"}, ) # 直接喂 Pinecone 索引 index = pc.Index("my-index") index.upsert(vectors=[ {"id": "doc1", "values": embeddings[0].values, "metadata": {"text": "..."}}, {"id": "doc2", "values": embeddings[1].values, "metadata": {"text": "..."}}, ]) ``` ### 一步完成 embed + 查询 ```python # embed query query_emb = pc.inference.embed( model="llama-text-embed-v2", inputs=["What is a managed vector database?"], parameters={"input_type": "query"}, ) # 搜索索引 results = index.query( vector=query_emb[0].values, top_k=10, include_metadata=True, ) ``` ### 重排候选文档 ```python reranked = pc.inference.rerank( model="bge-reranker-v2-m3", query="What is a managed vector database?", documents=[r.metadata["text"] for r in results.matches], top_n=5, return_documents=True, ) # reranked.data 含 top 5 最相关,分数 0-1 for r in reranked.data: print(r.score, r.document.text) ``` ### 为啥用 Inference 不自己跑 embedding 服务 - 不用管 GPU —— Pinecone 托管 + 扩展模型 - 跟索引一样的 SDK(不用额外鉴权 / 计费) - Inference 包含在 Pinecone Standard / Enterprise 套餐里 - 延迟为配 Pinecone 索引优化(同一网络) --- ### FAQ **Q: Pinecone Inference 免费吗?** A: 有免费档(每月 2K embedding)。超出按量付费打包进 Pinecone Standard 套餐。测试免费,随索引使用量扩展。 **Q: 有哪些模型?** A: llama-text-embed-v2(1024 维)、multilingual-e5-large、pinecone-sparse-english-v0、bge-reranker-v2-m3(重排)。Pinecone 定期加新模型 —— 看官方 docs 获取最新列表。 **Q: 不用 Pinecone 索引能用 Inference 吗?** A: 能 —— Inference 是独立 API。生成 embedding 后存任何地方(Postgres pgvector / 你自己的 DB)。捆绑用法(同一个 Pinecone 账号 embed + 索引)只是便利。 --- ## 来源与感谢 > Built by [Pinecone](https://github.com/pinecone-io). Commercial product with free tier. > > [docs.pinecone.io/inference](https://docs.pinecone.io/guides/inference) — Inference docs --- Source: https://tokrepo.com/en/workflows/pinecone-inference-hosted-embeddings-reranking-api Author: Pinecone