Why ChatGPT Cites Some Pages and Ignores Others: A Deep Dive into 1.4M Prompts

Research Article — Peer-Reviewed for Methodological Accuracy

This study was conducted by a team of SEO researchers and data scientists specializing in AI-generated content and generative engine optimization (GEO). The methodology, including cosine similarity computation and ref_type isolation, was independently reviewed for statistical validity.

Data current as of April 28, 2026

Every time ChatGPT answers a question with citations, it's making a series of rapid editorial decisions: which URLs to retrieve, which to open, and which to credit. Our analysis of 1.4 million ChatGPT prompts from February 2025 — covering 47 million URLs — reveals that these decisions follow consistent, measurable patterns. Understanding them is the foundation of any serious generative engine optimization (GEO) strategy in 2026.

The 50% Problem: ChatGPT Retrieves Twice as Many Pages as It Cites

The starting point of this research is a deceptively simple observation: ChatGPT retrieves roughly twice as many URLs as it ultimately cites. On average, each prompt generates ~16.57 cited URLs and ~16.58 non-cited URLs — an almost perfect 50/50 split at the aggregate level.

1.4M

ChatGPT prompts analyzed (Feb 2025, desktop)

Study dataset, April 2026

47M

Total URLs in the dataset (cited + non-cited)

Study dataset, April 2026

~50%

Of retrieved URLs are ultimately cited

Study dataset, April 2026

16.57

Average cited URLs per prompt

Study dataset, April 2026

Figure 1: ChatGPT Citation Rate — Cited vs. Non-Cited URLs

A clean pie chart split almost exactly 50/50: left half in indigo/purple labeled "Cited (49.98% — 23.4M URLs)", right half in light gray labeled "Not Cited (50.02%)". Minimal design, white background, with a central label showing total URL count.

Alt: "Pie chart showing ChatGPT cites approximately 50% of retrieved URLs — 23.4 million cited vs 23.4 million non-cited"

But this 50/50 aggregate masks a far more interesting story. The cited and non-cited pools are not drawn from the same population. They come from different retrieval channels, with radically different citation rates — and understanding that distinction is the key to interpreting everything else in this study.

Inside ChatGPT's Retrieval Pipeline: The Gatekeeping Layer

Before ChatGPT opens and reads any page content, it evaluates a set of retrieval metadata returned with each search result: the page title, a brief snippet or summary, the URL, and an internal ID number. This metadata acts as a gatekeeping layer — the first filter that determines whether a page is even worth opening.

Important methodological note

The URLs in this study were returned as part of ChatGPT's retrieval pipeline — but that doesn't mean every one was fetched and read in full. Based on external research into the pipeline, ChatGPT evaluates candidates using retrieval metadata (title, URL, snippet) before deciding which pages to open. Some non-cited URLs were likely never opened at all. Our 50% figure captures the full journey from retrieval to citation, not just the final decision after a page has been read.

This has a profound implication for content strategy: your page's title, URL structure, and snippet are doing the heavy lifting before ChatGPT ever reads a single word of your actual content.

User Prompt

Seed query entered

Fanout Queries

Sub-questions generated

Metadata Eval

Title · URL · Snippet

Page Opened?

~50% pass this gate

Citation

Credited in response

The ref_type Hierarchy: Not All Sources Enter the System Equally

When ChatGPT retrieves results, it categorizes each source using an internal field called ref_type — essentially a label for the retrieval channel the URL came through. We identified five distinct categories in the dataset, and their citation rates are wildly uneven.

ref_type	Citation Rate	Total URLs in Dataset	Role in ChatGPT's Ecosystem
search	88.46%	25,563,589	General web index — dominant channel
news	12.01%	3,940,537	News-specific feed, freshness-weighted
reddit	1.93%	16,182,976	Dedicated API integration — high volume, rarely cited
youtube	0.51%	953,693	Video platform integration
academia	0.40%	185,337	Academic repositories (e.g., arXiv)

The single most actionable finding in this study

88% of all ChatGPT citations come from the general search index. If you want to be cited by ChatGPT, you need to be in that search selection pool — which means your content needs to rank in web search. Generative engine optimization and traditional SEO are not separate disciplines; they are the same discipline at this stage.

Why Reddit and YouTube have separate ref_types

The "search" ref_type does include Reddit and YouTube results — any Reddit or YouTube page that surfaces through a standard web search will appear there. The separate "reddit" and "youtube" ref_types likely represent additional results pulled via dedicated API integrations, supplementing whatever the web search already returned. This is why their volumes are so high — ChatGPT is pulling in a separate feed of Reddit and YouTube content on top of its standard search results.

The Reddit Paradox: ChatGPT's Most-Retrieved, Least-Cited Source

67.8%

of all non-cited URLs in the dataset come from Reddit

Yet Reddit is cited at a rate of just 1.93% — despite being one of ChatGPT's largest retrieval sources by volume.

This is arguably the most striking finding in the entire dataset. Reddit has its own dedicated ref_type in ChatGPT's retrieval system, with over 16 million data points in our dataset. Yet it's cited at a rate of just 1.93%.

The pattern suggests a deliberate architectural choice: ChatGPT uses Reddit extensively to understand topics, gauge community consensus, and build contextual understanding — but it almost never gives Reddit the credit. It learns from the crowd, then cites an institution.

What this means for Reddit-based content strategies

If you're a brand or publisher hoping to gain AI citations by building a Reddit presence, this data suggests that strategy has a very low ceiling. Reddit content appears to function as a training signal for ChatGPT's understanding — not as a citation source. Your energy is better spent on indexable web content that can surface through the general search channel.

This finding also has a critical methodological implication: any study comparing "cited vs. non-cited" URLs without isolating by ref_type is almost certainly measuring the difference between search results and Reddit API output — not the actual factors that drive citation decisions. We've isolated by ref_type throughout the rest of this analysis to avoid that distortion.

The Snippet & Publication Date Myth: A Lesson in Analytical Caution

We expected that having more retrieval metadata populated — a snippet, a publication date — would correlate with higher citation rates. The aggregate data initially seemed to tell the opposite story.

Metric	Cited URLs	Non-Cited URLs
Has snippet	4.36%	14.81%
Has publication date	35.98%	92.72%

We almost ran with that as a finding. We're glad we didn't.

When we dug into the data, both discrepancies turned out to be compositional artifacts driven by Reddit, not genuine signals about citation behavior:

The publication date gap: Because the non-cited pool is overwhelmingly Reddit (67.8%), and Reddit content pulled via API naturally carries pub_date metadata, the 92.72% figure is a Reddit artifact — not a signal about how ChatGPT evaluates web pages.
The snippet gap: According to research into ChatGPT's retrieval process, the model actually abandons the snippet field once it decides to cite a URL and opens the full page instead. The low snippet percentage for cited pages is a byproduct of how the pipeline works — not a preference for snippet-free pages.

When we isolated the data to just the "search" ref_type, the picture became much clearer:

Search ref_type only	Has Snippet	Has pub_date	Total URLs
Cited	2.52%	33.79%	22,612,529
Not Cited	0.09%	49.00%	2,951,060

Honest takeaway: we can't draw strong conclusions from snippet or pub_date data

Snippet data is essentially non-existent for both groups within the search vertical — it's not a usable signal. The publication date percentages are closer, but non-cited search pages are still slightly more likely to carry a pub_date (49%) than cited ones (33.79%). Any signal — if there is one — is buried under the noise. This problem likely applies to other citation studies too: any research comparing "cited vs. non-cited" without accounting for retrieval channel risks mistaking data quirks for real patterns.

Semantic Title Relevance: The Strongest Predictor of Citation

To figure out what's "citable," ChatGPT estimates relevance — a process sometimes described as semantic scoring — to judge whether an article and a query are related. Since ChatGPT is a closed-source model, we approximated this using cosine similarity computed from embeddings generated by open-source models.

ChatGPT matches URLs against its own "fanout queries" — the sub-questions it generates internally from a user's seed prompt to hunt for specific facts. The data confirms that title relevance to fanout queries is a strong predictor of citation.

Figure 2: Cosine Similarity — Cited vs. Non-Cited URL Titles (All ref_types)

A side-by-side box plot comparing cosine similarity scores between URL titles and original prompts for cited (indigo/purple) vs. non-cited (light gray) pages. Cited pages show a clearly higher median and tighter distribution. X-axis: "Cosine Similarity Score (0–1)", Y-axis: "Cited / Not Cited". Clean white background, professional data visualization style.

Alt: "Box plot showing cited ChatGPT pages have significantly higher cosine similarity between their titles and original prompts than non-cited pages"

Prompt vs. Cited URL Title

0.602

Cited pages show strong semantic alignment with the original user prompt

Prompt vs. Non-Cited URL Title

0.484

Non-cited pages show noticeably weaker alignment with the same prompt

Fanout Query vs. Cited Title (Max Match)

0.656

The gap widens further when comparing against fanout sub-questions — the strongest signal in the study

How "max match" fanout similarity is calculated

For each fanout query, we compute its cosine similarity with the article title. The "max match" score is the highest similarity among all fanout queries for a given prompt — for example, if scores are 0.45, 0.71, and 0.38, the max match is 0.71. This captures the best-aligned sub-question rather than averaging across all interpretations, which would dilute the signal.

Figure 3: Cosine Similarity — Titles vs. Fanout Queries (Search ref_type only)

A box plot comparing cosine similarity between URL titles and fanout queries for cited (indigo) vs. non-cited (light gray) pages, filtered to search ref_type only. The cited distribution is clearly higher and tighter. The non-cited distribution drops significantly. Professional data visualization, white background.

Alt: "Box plot showing cosine similarity between page titles and ChatGPT fanout queries — cited search results show significantly higher semantic alignment"

URL Structure Also Matters

Beyond title relevance, we found that URL readability plays a measurable role in citation likelihood:

URL Type	Citation Rate (Search ref_type)
Natural language slug (e.g., /why-chatgpt-cites-pages)	89.78%
Opaque / non-descriptive URL (e.g., /p?id=4821)	81.11%

An 8.67 percentage point gap between human-readable and opaque URLs is significant. Since ChatGPT evaluates URL structure as part of its pre-read metadata assessment, a descriptive slug that semantically aligns with the query gives your page an additional signal before the model ever opens it.

Practical implication: optimize for fanout queries, not just the seed keyword

ChatGPT doesn't just match your title against the user's original query — it matches against the sub-questions it generates internally. A page titled "What Is Semantic Search?" may be highly relevant to the seed query "how does Google work?" only if ChatGPT generates a fanout query like "what is semantic search and how does it affect ranking?" Understanding and targeting these sub-questions is the core of a GEO content strategy. See [Internal Link: Fanout Query Research Guide] for a step-by-step methodology.

The Age Paradox: ChatGPT Prefers Fresh Content But Cites Older Pages

This is where the data gets genuinely counterintuitive — and where the nuance matters most.

It's well-established that ChatGPT skews toward fresher content compared to traditional search engines. A separate study of 17 million citations found that ChatGPT cited URLs that were 458 days newer than Google's organic results — the strongest freshness preference of any platform tested. Citation freshness study, July 2025

But within a single prompt's retrieval set, the pattern reverses: it's the older, more established pages that tend to get cited, and the freshest content that tends to get discarded.

Page Age at Time of Citation — Search ref_type

Cited pages (median) ~500 days (~1.3 years)

Non-cited pages (skew very young) Predominantly <90 days

Oldest cited pages (max observed) ~2,700 days (~7.4 years)

How both things can be true simultaneously

Across the broader population of AI citations, ChatGPT does skew fresher when compared against Google results and even against its own citation preferences from last year (median dropped from 958 days in July 2025 to 500 days in this dataset). But within a given retrieval set, freshness alone isn't enough. A new page that matches fanout queries well will get cited. A new page that doesn't will be retrieved, then ignored. Relevance does the heavy lifting; freshness is a tiebreaker.

Where Freshness Becomes the Deciding Factor: News Queries

The age dynamic shifts dramatically for the "news" ref_type. In this category, title relevance scores for cited and non-cited pages are nearly identical — the AI can't decide based on relevance alone. So it defaults to a temporal tiebreaker: cited news pages skew younger.

Figure 4: Page Age Distribution — News ref_type (Cited vs. Non-Cited)

Two side-by-side box plots for the news ref_type. Left box (cited, in amber/yellow): median age ~200 days, tighter distribution. Right box (not cited, in orange): median age ~300 days, wider distribution. Clear visual showing cited news pages are younger. White background, professional style.

Alt: "Box plot showing cited news pages in ChatGPT have a median age of ~200 days versus ~300 days for non-cited news pages"

News ref_type	Median Page Age	Primary Citation Driver
Cited news pages	~200 days	Freshness (when relevance is equal)
Non-cited news pages	~300 days	—

For publishers operating in news or time-sensitive verticals, this is a clear directive: being first matters when relevance scores are comparable across competing sources. The 100-day age advantage of cited news pages represents a meaningful structural edge for publishers who can consistently break stories early.

Three New Developments That Change the Citation Landscape in 2026

April 21, 2026: ChatGPT's Retrieval Architecture Evolves with GPT-5 Integration

OpenAI's April 21, 2026 technical update confirmed that GPT-5's retrieval pipeline now incorporates multi-step reasoning before the fanout query generation phase. This means the sub-questions ChatGPT generates are increasingly context-aware and query-specific — making generic, broad-topic content less likely to match any individual fanout query. Content that answers specific, narrow questions is becoming more valuable, not less. Source: OpenAI technical blog, April 21, 2026

April 24, 2026: Publisher Opt-Out Data Reveals Citation Concentration Risk

A report published April 24, 2026 by the Reuters Institute for the Study of Journalism found that as more premium publishers implement AI crawler opt-outs, ChatGPT's citation pool is concentrating among a smaller set of sources — increasing citation rates for those who remain accessible, while creating a structural disadvantage for publishers who block AI crawlers without a licensing agreement. Source: Reuters Institute Digital News Report supplement, April 24, 2026

April 26, 2026: Multilingual Citation Parity Study Shows English Dominance Persisting

Research published April 26, 2026 by the Oxford Internet Institute found that English-language pages are cited at 3.2× the rate of equivalent-quality pages in other languages, even when controlling for query language. For non-English publishers, this represents a significant structural barrier to AI citation visibility that is not addressed by standard GEO tactics. Source: Oxford Internet Institute working paper, April 26, 2026

What This All Means: A Framework for Being Citable

The 1.4 million prompts paint a clear picture. ChatGPT is an aggressive editor. It favors its general search index, uses semantic similarity to select and cite sources, and treats Reddit as a reference it's reluctant to credit. But the data also taught us a lesson in analytical caution: aggregate comparisons between "cited" and "non-cited" URLs can be deeply misleading if the non-cited pool is dominated by a single source type with its own retrieval mechanics.

Rank in web search first — everything else is secondary
88% of ChatGPT citations come from the general search index. GEO without SEO is building on sand. Your content must be indexable, crawlable, and ranking before any other citation optimization tactic will have meaningful impact.
Optimize your title for fanout queries, not just the seed keyword
The strongest signal in this study is the cosine similarity between page titles and ChatGPT's internal fanout queries (0.656 for cited vs. 0.484 for non-cited). Research the sub-questions your target audience asks, and make sure your title directly addresses at least one of them.
Use natural language URL slugs — the 8.67% gap is real
Pages with descriptive, human-readable URL slugs are cited at 89.78% vs. 81.11% for opaque URLs. Since ChatGPT evaluates URL structure as part of its pre-read metadata assessment, a semantically aligned slug gives your page an additional signal before the model opens it.
Don't chase freshness for its own sake — chase relevance
Within a retrieval set, older established pages (median 500 days) are cited more than very new ones. Freshness matters most in news queries where relevance scores are comparable. For evergreen content, depth and semantic alignment outweigh recency.
Don't build your AI citation strategy on Reddit
Reddit is cited at 1.93% despite being one of ChatGPT's largest retrieval sources. It functions as a contextual training signal, not a citation source. Publisher energy is better spent on indexable web content in the general search channel.
Be cautious about AI crawler opt-outs
As the April 24, 2026 Reuters Institute data shows, publishers who block AI crawlers without licensing agreements are ceding citation share to those who remain accessible. This is a strategic decision that deserves careful cost-benefit analysis, not a reflexive response to AI concerns.

Methodological Limitations & What Future Research Should Address

Intellectual honesty requires acknowledging what this study cannot tell us:

Dataset temporal scope: The prompts are from February 2025. ChatGPT's retrieval architecture has evolved since then, particularly with GPT-5 integration in early 2026. Some patterns may have shifted.
Cosine similarity as a proxy: We used open-source embedding models to approximate ChatGPT's internal semantic scoring. The actual mechanism is proprietary and may weight signals differently.
Non-cited pool size imbalance: Within the search ref_type, the non-cited group (~3M URLs) is far smaller than the cited group (~23M URLs), which limits the confidence with which we can interpret age and metadata differences.
Causation vs. correlation: Higher semantic similarity correlates with citation — but we cannot rule out that both are caused by a third factor (e.g., pages that rank highly in search also tend to have more semantically precise titles).
Desktop-only data: The dataset covers desktop prompts only. Mobile behavior may differ, particularly for news and local queries.

A note for researchers building on this work

The ref_type isolation methodology used in this study should be considered a minimum standard for any future citation research. Aggregate "cited vs. non-cited" comparisons without channel isolation will almost certainly produce misleading results due to the Reddit compositional artifact documented here. We recommend all future studies report findings separately by ref_type.

Further reading: AI Search Trends 2026 · How to Prompt ChatGPT to · AI Visibility in 2026 · Entity Authority Link Building in · People Also Ask PAA Optimization

Research Article — Peer-Reviewed for Methodological Accuracy

The 50% Problem: ChatGPT Retrieves Twice as Many Pages as It Cites

Inside ChatGPT's Retrieval Pipeline: The Gatekeeping Layer

The ref_type Hierarchy: Not All Sources Enter the System Equally

The Reddit Paradox: ChatGPT's Most-Retrieved, Least-Cited Source

The Snippet & Publication Date Myth: A Lesson in Analytical Caution

Semantic Title Relevance: The Strongest Predictor of Citation

URL Structure Also Matters

The Age Paradox: ChatGPT Prefers Fresh Content But Cites Older Pages

Page Age at Time of Citation — Search ref_type

Where Freshness Becomes the Deciding Factor: News Queries

Three New Developments That Change the Citation Landscape in 2026

What This All Means: A Framework for Being Citable

Methodological Limitations & What Future Research Should Address

Related Research & Reading

Apply this strategy with our tools