What Glean actually published in May 2026
On May 14, 2026 Glean's research team posted a blog titled "How Glean compares to ChatGPT and Claude on real enterprise work" [1]. The headline number: blind human graders preferred Glean's answers 1.9× more often than ChatGPT Enterprise and 1.6× more than Claude Enterprise across 280 queries drawn from real customer deployments. The post lists the rubric (four-point scale: correct, partially correct, off-topic, harmful), names the grader pool (24 graders, mix of in-house and contract), and links a spreadsheet with per-query scores.
That is more methodology than 90% of vendor benchmarks ship with. Microsoft's Copilot eval pages give one number; Salesforce Einstein eval material rarely lists query count. So the first honest thing to say is that Glean has set a higher bar for vendor-published evals in this category, and the May 2026 post is a meaningful step away from the "we A/B'd internally and our product won" pattern that has dominated enterprise AI marketing since 2024.
The second honest thing: a vendor self-eval that flatters the vendor is still a vendor self-eval. The methodology gaps matter, not because Glean did anything dishonest, but because any rigorous procurement team needs to know what the 1.9× number can and cannot support.
Decomposing the 1.9× claim
Read the blog carefully. The 1.9× is a preference ratio on a paired comparison: for each of 280 queries, graders saw three anonymized answers (Glean, ChatGPT Enterprise, Claude Enterprise) and picked the best one. Glean was picked first 51% of the time. ChatGPT 27%. Claude 22%. Divide and you get the headline.
That is one specific task: answer-quality preference, on enterprise queries, against retrieval-thin consumer-shaped frontier-LLM products. The 280 queries were drawn from "real customer workloads" — but the post does not specify which customers, which industries, or what the query distribution looks like across those workloads. ChatGPT Enterprise was tested with its default connector stack, which Glean's footnote describes as "the standard Enterprise connectors enabled at the time of the test." Claude was tested through claude.ai with the MCP gateway in default configuration.
Neither competitor was tuned. No retrieval pipeline was wrapped around them. No reranker. No system prompt engineering beyond the defaults each vendor ships. That is fair as a baseline measurement and unfair as a procurement decision input. Enterprise buyers who deploy ChatGPT or Claude in 2026 do not run them naked — they build retrieval scaffolding, they tune connectors, they often layer a vector store and a permissioned index in front. Glean's eval measures the gap between Glean and the unmodified competitor, not the gap between Glean and a competently deployed competitor stack.
Query selection: representative of what, exactly?
The 280 queries are the largest single methodology question. Glean's blog says they came from "sampling real production traffic across a mix of customer industries" with PII removed. The post does not publish the industry mix, the query-type distribution (lookup vs. summarization vs. multi-hop reasoning vs. action), or the query-length statistics.
That matters because enterprise-search benchmark distributions are not uniform. BEIR, the most-cited information-retrieval benchmark, deliberately mixes 18 datasets across question-answering, fact-checking, duplicate detection, and argument retrieval — because performance on one query type predicts very little about performance on others [2]. HELM's enterprise evaluation framework similarly stratifies by task type to avoid headline-number inflation [3]. MTEB does the same for embedding models [4].
Glean's 280-query set is almost certainly skewed toward the query shapes Glean handles well — short factual lookups across a permissioned corpus, the workload type the product was built for. That is not cheating. It is what "queries from our production traffic" means when your product is enterprise search. A buyer evaluating Glean against ChatGPT Enterprise for, say, multi-step research workflows or document drafting should not assume the 1.9× holds. Different query type, different bench.
Graders, blinding, and the rubric
Twenty-four graders. Mix of in-house and contract. Blind to which vendor produced which answer. The rubric scored on a four-point scale and graders also picked an overall best answer per query. So far so good — paired blind preference is the gold-standard setup for this kind of comparison, and a 24-grader pool with inter-rater agreement reported in the blog (Krippendorff's alpha = 0.71, which is acceptable for subjective tasks) is a real effort.
The harder question is what the graders were blind to and what they were not. They were blind to vendor identity. They were not blind to answer length, formatting, or citation style. Glean's product surfaces inline citations to the source documents by default; ChatGPT Enterprise and Claude Enterprise can do this but the default UI presentation differs. A grader who sees a heavily-cited, well-formatted Glean answer next to a longer prose answer from Claude may rate the Glean answer as more trustworthy not because the underlying retrieval is better but because the surface presentation cues "enterprise-grade."
Forrester's TEI report on Glean's work-AI platform [5] notes a similar dynamic in customer interviews: users prefer answers that surface source attribution even when the underlying answer is equivalent. That is a real product strength of Glean. It is also a confound in a blind grader study.
Side-by-side: five enterprise-search vendors on the criteria buyers actually use
The Glean eval is a useful data point for one question (paired answer-quality preference vs. retrieval-thin competitors). Procurement teams have ten or twelve other questions. Here is the matrix that gets the budget approved.
| Criterion | Glean | ChatGPT Enterprise | Claude Enterprise | Microsoft 365 Copilot | Onyx (OSS) |
|---|---|---|---|---|---|
| Connector breadth | 100+ mature connectors | ~40 GA connectors | MCP-driven, growing catalog | 1,400+ Power Platform connectors + Graph | 60+ community connectors |
| Grounding architecture | Permissioned indexed corpus | Connector retrieval per session | MCP servers + retrieval per session | Microsoft Graph federation | Vector + lexical hybrid, self-hosted |
| Eval transparency | 280-query blind grader study (May 2026) | GPT-5 system card; no enterprise-search eval | Claude model card; no enterprise-search eval | Customer case studies, not blind evals | Public eval scripts in repo |
| Enterprise auth | SSO, SCIM, permission mirroring | SSO, SCIM, SOC 2 Type II | SSO, SCIM, SOC 2 Type II, MCP audit | Entra ID, Purview, SOC 2 | BYO auth, self-hosted |
| Audit log granularity | Per-query + per-document attribution | Per-session, admin console | Per-session + MCP tool-call logs | Purview audit logs | Self-hosted logging |
| Pricing transparency | Quote-only; ~$45-50/seat/mo reported | ~$60/user/month list | Quote-based, enterprise tier | $30/user/month add-on + E3/E5 base | Free / self-hosted |
How academic benchmarks would frame this comparison
BEIR, HELM, MTEB, MS MARCO, and KILT are the five evaluation frameworks most-cited by IR and ML researchers working on retrieval-grounded systems [2][3][4]. None of them rely on paired human preference as the headline metric. The reasons are instructive for anyone reading vendor evals.
BEIR measures nDCG@10 across 18 heterogeneous tasks because preference judgments on long-form generated answers correlate weakly with downstream task success. MS MARCO uses MRR@10 on passage retrieval because that is what production search systems actually optimize. HELM stratifies by task and reports a multi-metric scorecard (accuracy, calibration, robustness, fairness, efficiency) because a single headline number hides the failure modes that matter for deployment. KILT specifically tests whether retrieved evidence supports the generated answer, because grounding is the actual product claim — not just preference.
A research-grade version of Glean's May 2026 eval would publish nDCG on the retrieval step, faithfulness scores on the generated answers (using a frameworks like RAGAS or ARES), and a per-task-type breakdown. Glean's blog acknowledges this gap in a footnote: "Future work will publish stratified results across query types." That footnote is the most procurement-relevant sentence in the post.
What the 1.9× number tells a procurement team
Three things it supports. Glean's retrieval-grounded answer engine outperforms a frontier LLM with default-only retrieval scaffolding on Glean's own query distribution. Glean's product team has invested in evaluation infrastructure, which is a positive signal for the engineering culture. And Glean is willing to publish methodology, which sets a useful precedent the rest of the category should match.
Three things it does not support. It does not say Glean beats a competently deployed ChatGPT Enterprise stack with custom retrieval. It does not say Glean wins on the query types your business actually runs — until Glean publishes the stratified breakdown, your workload is not in the bench. And it does not say Glean wins on the dimensions that often decide enterprise deals: TCO, governance integration, MCP ecosystem fit, multi-cloud deployment, regulatory residency.
The procurement question is not "is Glean 1.9× better than ChatGPT." It is "on the 30 queries we actually care about, with our actual corpus, against the competitor we are actually considering with the configuration we would actually run, which product wins." That eval is yours to build, and no vendor will run it for you.
A platform-first take on vendor evals
Disclosure: this site is published by ASCENDING, which builds Jarvis AI. Stating this up front because the next paragraph is going to recommend a structural approach that sidesteps single-vendor eval framing entirely. Read it with that frame.
Single-vendor self-evals are useful and limited for the same reason. They measure one product against fixed competitors at a fixed point in time. The half-life of "we beat ChatGPT" claims in this market is roughly six months — OpenAI ships GPT-5.1, the eval result drifts, the marketing page becomes wallpaper. A governance-first platform approach treats the LLM as a swappable backend and tests retrieval quality against your corpus on every model release. Jarvis Registry runs that pattern through the MCP gateway: retrieve once, route the same prompt to OpenAI, Anthropic, and Bedrock, log the answer set, let the eval framework pick the winner per query type.
That architecture does not make the Glean eval wrong. It makes the question different. Instead of "which vendor's answer wins on average," the platform question is "which combination of retrieval source and model wins on our queries, this quarter, on our governance constraints." If your procurement timeline is one year and your model landscape changes every six months, the second framing tends to age better.
Frequently asked
-
Is Glean's 1.9× claim against ChatGPT trustworthy?
It is trustworthy for what it measures: paired blind grader preference on 280 Glean-sampled enterprise queries against retrieval-thin default-config competitors [1]. It does not measure performance against a competently deployed ChatGPT Enterprise stack with custom retrieval, or against query types outside Glean's sampled distribution. Treat the headline as a real-but-narrow data point, not a procurement decision. -
What methodology gaps should an enterprise evaluator flag?
Four. The query distribution is not stratified by task type, so per-task performance is hidden. Competitors were tested in default configuration, not as deployed. Graders were blind to vendor but not to answer formatting, which creates a presentation-cue confound. And the customer industry mix behind the 280 queries is not published, so workload representativeness is opaque. -
How would a research-grade eval look different?
It would stratify by query type (lookup, summarization, multi-hop reasoning, action) per BEIR or HELM conventions [2][3]. It would report nDCG or MRR on the retrieval step, not just preference on the final answer. It would publish faithfulness scores using a framework like RAGAS to measure whether the answer is grounded in the retrieved evidence. And it would test competitors in their best-effort deployed configuration, not their out-of-the-box defaults. -
Should I design my own enterprise-search benchmark?
Yes, if the procurement decision is six figures or more. Start with 30-50 queries drawn from your actual workload, stratified by query type. Define the rubric before you see answers. Have at least three graders, blind to vendor, and report inter-rater agreement. Test each vendor in the configuration you would actually deploy, not in defaults. Forrester's TEI methodology [5] has a useful template for the cost side; the quality side is yours to build. -
Does ChatGPT Enterprise or Claude Enterprise publish comparable evals?
Not as of May 2026. OpenAI publishes GPT-5 system cards with general-capability evals; Anthropic publishes Claude model cards with the same. Neither has published a head-to-head enterprise-search eval against named competitors. Glean's blog is the only vendor-published artifact in this category, which is why the methodology audit matters — it is the only data point on the table.