Every time ChatGPT answers a question with citations, it's making a series of rapid editorial decisions: which URLs to retrieve, which to open, and which to credit. Our analysis of 1.4 million ChatGPT prompts from February 2025 — covering 47 million URLs — reveals that these decisions follow consistent, measurable patterns. Understanding them is the foundation of any serious generative engine optimization (GEO) strategy in 2026.
The 50% Problem: ChatGPT Retrieves Twice as Many Pages as It Cites
The starting point of this research is a deceptively simple observation: ChatGPT retrieves roughly twice as many URLs as it ultimately cites. On average, each prompt generates ~16.57 cited URLs and ~16.58 non-cited URLs — an almost perfect 50/50 split at the aggregate level.
But this 50/50 aggregate masks a far more interesting story. The cited and non-cited pools are not drawn from the same population. They come from different retrieval channels, with radically different citation rates — and understanding that distinction is the key to interpreting everything else in this study.
Inside ChatGPT's Retrieval Pipeline: The Gatekeeping Layer
Before ChatGPT opens and reads any page content, it evaluates a set of retrieval metadata returned with each search result: the page title, a brief snippet or summary, the URL, and an internal ID number. This metadata acts as a gatekeeping layer — the first filter that determines whether a page is even worth opening.
The URLs in this study were returned as part of ChatGPT's retrieval pipeline — but that doesn't mean every one was fetched and read in full. Based on external research into the pipeline, ChatGPT evaluates candidates using retrieval metadata (title, URL, snippet) before deciding which pages to open. Some non-cited URLs were likely never opened at all. Our 50% figure captures the full journey from retrieval to citation, not just the final decision after a page has been read.
This has a profound implication for content strategy: your page's title, URL structure, and snippet are doing the heavy lifting before ChatGPT ever reads a single word of your actual content.
The ref_type Hierarchy: Not All Sources Enter the System Equally
When ChatGPT retrieves results, it categorizes each source using an internal field called ref_type — essentially a label for the retrieval channel the URL came through. We identified five distinct categories in the dataset, and their citation rates are wildly uneven.
| ref_type | Citation Rate | Total URLs in Dataset | Role in ChatGPT's Ecosystem |
|---|---|---|---|
| search | 88.46% | 25,563,589 | General web index — dominant channel |
| news | 12.01% | 3,940,537 | News-specific feed, freshness-weighted |
| 1.93% | 16,182,976 | Dedicated API integration — high volume, rarely cited | |
| youtube | 0.51% | 953,693 | Video platform integration |
| academia | 0.40% | 185,337 | Academic repositories (e.g., arXiv) |
88% of all ChatGPT citations come from the general search index. If you want to be cited by ChatGPT, you need to be in that search selection pool — which means your content needs to rank in web search. Generative engine optimization and traditional SEO are not separate disciplines; they are the same discipline at this stage.
The "search" ref_type does include Reddit and YouTube results — any Reddit or YouTube page that surfaces through a standard web search will appear there. The separate "reddit" and "youtube" ref_types likely represent additional results pulled via dedicated API integrations, supplementing whatever the web search already returned. This is why their volumes are so high — ChatGPT is pulling in a separate feed of Reddit and YouTube content on top of its standard search results.
The Reddit Paradox: ChatGPT's Most-Retrieved, Least-Cited Source
This is arguably the most striking finding in the entire dataset. Reddit has its own dedicated ref_type in ChatGPT's retrieval system, with over 16 million data points in our dataset. Yet it's cited at a rate of just 1.93%.
The pattern suggests a deliberate architectural choice: ChatGPT uses Reddit extensively to understand topics, gauge community consensus, and build contextual understanding — but it almost never gives Reddit the credit. It learns from the crowd, then cites an institution.
If you're a brand or publisher hoping to gain AI citations by building a Reddit presence, this data suggests that strategy has a very low ceiling. Reddit content appears to function as a training signal for ChatGPT's understanding — not as a citation source. Your energy is better spent on indexable web content that can surface through the general search channel.
This finding also has a critical methodological implication: any study comparing "cited vs. non-cited" URLs without isolating by ref_type is almost certainly measuring the difference between search results and Reddit API output — not the actual factors that drive citation decisions. We've isolated by ref_type throughout the rest of this analysis to avoid that distortion.
The Snippet & Publication Date Myth: A Lesson in Analytical Caution
We expected that having more retrieval metadata populated — a snippet, a publication date — would correlate with higher citation rates. The aggregate data initially seemed to tell the opposite story.
| Metric | Cited URLs | Non-Cited URLs |
|---|---|---|
| Has snippet | 4.36% | 14.81% |
| Has publication date | 35.98% | 92.72% |
We almost ran with that as a finding. We're glad we didn't.
When we dug into the data, both discrepancies turned out to be compositional artifacts driven by Reddit, not genuine signals about citation behavior:
- The publication date gap: Because the non-cited pool is overwhelmingly Reddit (67.8%), and Reddit content pulled via API naturally carries pub_date metadata, the 92.72% figure is a Reddit artifact — not a signal about how ChatGPT evaluates web pages.
- The snippet gap: According to research into ChatGPT's retrieval process, the model actually abandons the snippet field once it decides to cite a URL and opens the full page instead. The low snippet percentage for cited pages is a byproduct of how the pipeline works — not a preference for snippet-free pages.
When we isolated the data to just the "search" ref_type, the picture became much clearer:
| Search ref_type only | Has Snippet | Has pub_date | Total URLs |
|---|---|---|---|
| Cited | 2.52% | 33.79% | 22,612,529 |
| Not Cited | 0.09% | 49.00% | 2,951,060 |
Snippet data is essentially non-existent for both groups within the search vertical — it's not a usable signal. The publication date percentages are closer, but non-cited search pages are still slightly more likely to carry a pub_date (49%) than cited ones (33.79%). Any signal — if there is one — is buried under the noise. This problem likely applies to other citation studies too: any research comparing "cited vs. non-cited" without accounting for retrieval channel risks mistaking data quirks for real patterns.
Semantic Title Relevance: The Strongest Predictor of Citation
To figure out what's "citable," ChatGPT estimates relevance — a process sometimes described as semantic scoring — to judge whether an article and a query are related. Since ChatGPT is a closed-source model, we approximated this using cosine similarity computed from embeddings generated by open-source models.
ChatGPT matches URLs against its own "fanout queries" — the sub-questions it generates internally from a user's seed prompt to hunt for specific facts. The data confirms that title relevance to fanout queries is a strong predictor of citation.
For each fanout query, we compute its cosine similarity with the article title. The "max match" score is the highest similarity among all fanout queries for a given prompt — for example, if scores are 0.45, 0.71, and 0.38, the max match is 0.71. This captures the best-aligned sub-question rather than averaging across all interpretations, which would dilute the signal.
URL Structure Also Matters
Beyond title relevance, we found that URL readability plays a measurable role in citation likelihood:
| URL Type | Citation Rate (Search ref_type) |
|---|---|
| Natural language slug (e.g., /why-chatgpt-cites-pages) | 89.78% |
| Opaque / non-descriptive URL (e.g., /p?id=4821) | 81.11% |
An 8.67 percentage point gap between human-readable and opaque URLs is significant. Since ChatGPT evaluates URL structure as part of its pre-read metadata assessment, a descriptive slug that semantically aligns with the query gives your page an additional signal before the model ever opens it.
ChatGPT doesn't just match your title against the user's original query — it matches against the sub-questions it generates internally. A page titled "What Is Semantic Search?" may be highly relevant to the seed query "how does Google work?" only if ChatGPT generates a fanout query like "what is semantic search and how does it affect ranking?" Understanding and targeting these sub-questions is the core of a GEO content strategy. See [Internal Link: Fanout Query Research Guide] for a step-by-step methodology.
The Age Paradox: ChatGPT Prefers Fresh Content But Cites Older Pages
This is where the data gets genuinely counterintuitive — and where the nuance matters most.
It's well-established that ChatGPT skews toward fresher content compared to traditional search engines. A separate study of 17 million citations found that ChatGPT cited URLs that were 458 days newer than Google's organic results — the strongest freshness preference of any platform tested. Citation freshness study, July 2025
But within a single prompt's retrieval set, the pattern reverses: it's the older, more established pages that tend to get cited, and the freshest content that tends to get discarded.
Page Age at Time of Citation — Search ref_type
Across the broader population of AI citations, ChatGPT does skew fresher when compared against Google results and even against its own citation preferences from last year (median dropped from 958 days in July 2025 to 500 days in this dataset). But within a given retrieval set, freshness alone isn't enough. A new page that matches fanout queries well will get cited. A new page that doesn't will be retrieved, then ignored. Relevance does the heavy lifting; freshness is a tiebreaker.
Where Freshness Becomes the Deciding Factor: News Queries
The age dynamic shifts dramatically for the "news" ref_type. In this category, title relevance scores for cited and non-cited pages are nearly identical — the AI can't decide based on relevance alone. So it defaults to a temporal tiebreaker: cited news pages skew younger.
| News ref_type | Median Page Age | Primary Citation Driver |
|---|---|---|
| Cited news pages | ~200 days | Freshness (when relevance is equal) |
| Non-cited news pages | ~300 days | — |
For publishers operating in news or time-sensitive verticals, this is a clear directive: being first matters when relevance scores are comparable across competing sources. The 100-day age advantage of cited news pages represents a meaningful structural edge for publishers who can consistently break stories early.
Three New Developments That Change the Citation Landscape in 2026
OpenAI's April 21, 2026 technical update confirmed that GPT-5's retrieval pipeline now incorporates multi-step reasoning before the fanout query generation phase. This means the sub-questions ChatGPT generates are increasingly context-aware and query-specific — making generic, broad-topic content less likely to match any individual fanout query. Content that answers specific, narrow questions is becoming more valuable, not less. Source: OpenAI technical blog, April 21, 2026
A report published April 24, 2026 by the Reuters Institute for the Study of Journalism found that as more premium publishers implement AI crawler opt-outs, ChatGPT's citation pool is concentrating among a smaller set of sources — increasing citation rates for those who remain accessible, while creating a structural disadvantage for publishers who block AI crawlers without a licensing agreement. Source: Reuters Institute Digital News Report supplement, April 24, 2026
Research published April 26, 2026 by the Oxford Internet Institute found that English-language pages are cited at 3.2× the rate of equivalent-quality pages in other languages, even when controlling for query language. For non-English publishers, this represents a significant structural barrier to AI citation visibility that is not addressed by standard GEO tactics. Source: Oxford Internet Institute working paper, April 26, 2026
What This All Means: A Framework for Being Citable
The 1.4 million prompts paint a clear picture. ChatGPT is an aggressive editor. It favors its general search index, uses semantic similarity to select and cite sources, and treats Reddit as a reference it's reluctant to credit. But the data also taught us a lesson in analytical caution: aggregate comparisons between "cited" and "non-cited" URLs can be deeply misleading if the non-cited pool is dominated by a single source type with its own retrieval mechanics.
-
Rank in web search first — everything else is secondary
88% of ChatGPT citations come from the general search index. GEO without SEO is building on sand. Your content must be indexable, crawlable, and ranking before any other citation optimization tactic will have meaningful impact.
-
Optimize your title for fanout queries, not just the seed keyword
The strongest signal in this study is the cosine similarity between page titles and ChatGPT's internal fanout queries (0.656 for cited vs. 0.484 for non-cited). Research the sub-questions your target audience asks, and make sure your title directly addresses at least one of them.
-
Use natural language URL slugs — the 8.67% gap is real
Pages with descriptive, human-readable URL slugs are cited at 89.78% vs. 81.11% for opaque URLs. Since ChatGPT evaluates URL structure as part of its pre-read metadata assessment, a semantically aligned slug gives your page an additional signal before the model opens it.
-
Don't chase freshness for its own sake — chase relevance
Within a retrieval set, older established pages (median 500 days) are cited more than very new ones. Freshness matters most in news queries where relevance scores are comparable. For evergreen content, depth and semantic alignment outweigh recency.
-
Don't build your AI citation strategy on Reddit
Reddit is cited at 1.93% despite being one of ChatGPT's largest retrieval sources. It functions as a contextual training signal, not a citation source. Publisher energy is better spent on indexable web content in the general search channel.
-
Be cautious about AI crawler opt-outs
As the April 24, 2026 Reuters Institute data shows, publishers who block AI crawlers without licensing agreements are ceding citation share to those who remain accessible. This is a strategic decision that deserves careful cost-benefit analysis, not a reflexive response to AI concerns.
Methodological Limitations & What Future Research Should Address
Intellectual honesty requires acknowledging what this study cannot tell us:
- Dataset temporal scope: The prompts are from February 2025. ChatGPT's retrieval architecture has evolved since then, particularly with GPT-5 integration in early 2026. Some patterns may have shifted.
- Cosine similarity as a proxy: We used open-source embedding models to approximate ChatGPT's internal semantic scoring. The actual mechanism is proprietary and may weight signals differently.
- Non-cited pool size imbalance: Within the search ref_type, the non-cited group (~3M URLs) is far smaller than the cited group (~23M URLs), which limits the confidence with which we can interpret age and metadata differences.
- Causation vs. correlation: Higher semantic similarity correlates with citation — but we cannot rule out that both are caused by a third factor (e.g., pages that rank highly in search also tend to have more semantically precise titles).
- Desktop-only data: The dataset covers desktop prompts only. Mobile behavior may differ, particularly for news and local queries.
The ref_type isolation methodology used in this study should be considered a minimum standard for any future citation research. Aggregate "cited vs. non-cited" comparisons without channel isolation will almost certainly produce misleading results due to the Reddit compositional artifact documented here. We recommend all future studies report findings separately by ref_type.
Further reading: AI Search Trends 2026 · How to Prompt ChatGPT to · AI Visibility in 2026 · Entity Authority Link Building in · People Also Ask PAA Optimization