The Embedder Question: Seven Models, 28,350 Judgements, Two Frontiers
We benchmarked seven open-weights embedding models on a corpus of 135 live web pages and 405 stratified queries, using an LLM judge on a 0/1/2 relevance rubric. The chunker was held constant and only the embedder was varied. Total: 28,350 judgements. The top-to-bottom spread in retrieval quality was 0.32 on mean judge — larger than the spread between any two chunkers in our previous study. Two Pareto frontiers emerged when cost was introduced, and they contained different models. Two mid-tier models produced identical aggregate scores to ten decimal places from observably different retrieval behaviours.