Reranking for RAG: Cross-Encoders, LLM Rerankers, and Latency Tradeoffs

How to choose the right second-stage ranking layer for RAG when retrieval is good enough to find the answer but not good enough to prioritize it.

15 min read

Just now

Not a medium member? Read the full article here.

Press enter or click to view image in full size

Your hybrid search is finally working. You set up the vector database, the keyword index and wrote the fusion logic. The right chunk of information is now consistently sitting somewhere in the candidate set.

But there’s still a problem. The language model still generates a weak answer.

You look at the logs and see exactly what happened. A developer asked how to debug a specific PAYMENTS_API_TIMEOUT error in the staging environment. The hybrid search returned twenty candidates. It found a broad authentication overview document, a stale incident note from last year and found a generic guide on retry logic. And right there at rank number eight, it found the exact staging runbook chunk containing the correct troubleshooting step.

The problem is that your prompt budget only had room for the top three chunks. The exact chunk sat too low in the ranking. The model answered using the wrong evidence simply because the right evidence did not fit…