Citation & Source Influence
How to Write Content That Gets Cited by AI Search (2026 Craft Guide)
To write content that gets cited by AI search engines like ChatGPT, Perplexity, Google AI Overviews, Claude, and Gemini, structure passages — not pages — for retrieval. Nine content patterns consistently increase citation likelihood: **definition blocks, bulleted lists, compariso
Updated 2026-05-22
Questions this guide answers
- How do I write content that AI engines cite?
- What content patterns does ChatGPT cite?
- How do I make my blog posts AI-friendly?
- Does article length matter for AI citations?
- Should I add FAQ schema to my articles?
Direct answer
To write content that gets cited by AI search engines like ChatGPT, Perplexity, Google AI Overviews, Claude, and Gemini, structure passages — not pages — for retrieval. Nine content patterns consistently increase citation likelihood: definition blocks, bulleted lists, comparison tables, sourced statistics, direct-answer paragraphs, real FAQ blocks, step-by-step lists, glossary entries, and methodology disclosures. AI engines extract passages from pages. Pages shaped only for human scrolling lose to pages that are also shaped for passage-level lift.
Why some content gets cited and most doesn't
Retrieval-augmented generation (RAG) — the system behind ChatGPT Search, Perplexity, Google AI Overviews, AI Mode, Claude with web search, and Gemini — does not read your article the way a human does. It chunks the page (a NAACL 2025 Findings paper found that fixed-window chunks around 200 words match or beat semantic chunking on most retrieval tasks), embeds each chunk, ranks chunks against the user's query, and lifts the highest-ranked passages into the answer. The model then chooses which to cite.
That has one large implication for writers: the unit of retrieval is the passage, not the page. A 2,500-word essay where the key claim is buried in paragraph 14 will lose to a 600-word page where the same claim is the first sentence under a clean H2. A wall-of-text page with no extractable structure will lose to a less-authoritative competitor whose answer is in a four-row table.
Recent work on structural preservation makes the same point from the engineering side: flat chunking that destroys structural boundaries complicates citation tracking, while structure-aware chunking pushes citation accuracy above 0.9 on benchmarks like HotpotQA. Translation for writers: when you give the retriever clean structural boundaries — H2s, lists, tables, definition blocks — you make the engine's job easier. Easier extraction means higher rank. Higher rank means citation.
The rest of this guide is the craft layer: nine specific content shapes that get extracted more often, with before/after examples. None of these are hacks. They are the same writing habits good reference writers — Wikipedia editors, HubSpot's senior writers, TechRadar's review team — have used for years. Those habits now have a measurement layer underneath them.
9 content patterns with before/after examples
Each pattern below shows a "before" (commodity prose, the version most blog posts default to) and an "after" (the citable shape). Apply at least four of the nine on every page you want AI engines to cite.
| Engine | Strongest content preference | Recency weight |
|---|---|---|
| ChatGPT Search | Passage-extractable lists + definitions | Medium-high |
| Perplexity | Recent, sourced, structured | Very high |
| Google AI Overviews | SEO-ranking content + clean H2/bullet structure | Medium |
| Claude (web search) | Authoritative editorial sources | Medium |
| Gemini | Inherits Google ranking + grounding sources | Medium |
2. Bulleted lists: scannable, RAG-extractable
A single-sentence definition AI can lift verbatim is one of the most over-cited content shapes in the dataset.
Before: "When people talk about Answer Engine Optimization, they usually mean a kind of evolution of SEO that takes into account the fact that AI tools are now answering questions for users, which means you have to think about how those tools work."
After: "Answer Engine Optimization (AEO) is the practice of optimizing content, structured data, and source-layer presence so that AI answer engines — ChatGPT, Perplexity, Google AI Overviews, Claude, Gemini — cite your brand when they answer buyer questions."
The "after" version is 36 words, leads with the term, follows with a precise definition, and names the engines. It can be extracted and dropped into an answer with zero rewriting. The "before" version cannot.
Bulleted lists are the most reliably-lifted shape in answer engines. They have natural chunk boundaries, parallel grammar, and finite item counts the model can summarize.
Before: "There are several reasons your content might not be getting cited, including issues with your page structure, your authority on the topic, how recent the page is, whether it's easy for crawlers to access, and a few other technical factors that all interact in different ways."
After: "Five common reasons content doesn't get cited by AI:
- The page isn't crawlable by GPTBot, OAI-SearchBot, or PerplexityBot.
- Key claims are buried below the fold instead of in a direct-answer paragraph.
- The page has no structural boundaries (no H2s, no lists, no tables).
- The topic has no authority signal (no sourced data, no named methodology).
- The page hasn't been updated in 18+ months."
3. Comparison tables: AI engines disproportionately cite tabular content
Tables compress comparison into a structure the retriever can extract whole. TechRadar's review tables and HubSpot's "X vs Y" matrices are two of the most cited shapes in our 17,551-citation dataset for a reason.
Before: "ChatGPT tends to prefer different kinds of content compared to Perplexity, and Google AI Overviews has its own preferences too, while Claude often favors editorial sources..."
After:
7. Step-by-step lists: numbered procedures
The table is liftable as a single unit. The prose version requires the reader to assemble the comparison themselves — and the engine to do the same.
Standalone numbers without provenance get downweighted. Numbers with a named, datable source are some of the most-cited fragments in the corpus.
Before: "Most content is not cited by AI engines, and only a small fraction of brands actually capture any meaningful citation share."
After: "Across our 17,551-citation SolCrys dataset spanning ChatGPT, Perplexity, Google AI Overviews, and Gemini over 30 days, owned content from any single vendor captured under 1% of category citations. Ahrefs's 75K-brand study, published in 2025, found YouTube mentions correlated more strongly (~0.737) with AI brand visibility than any other factor."
The "after" version is two sentences, three numbers, two sources, two methods. It is exactly the shape AI engines lift into answers about "how rare are AI citations."
A direct-answer paragraph is a self-contained 40–80 word summary placed immediately under the H2 (or, for the page-level version, at the very top). It is the single highest-leverage shape on this list for ChatGPT Search and Perplexity.
Before: A 600-word section that opens with three paragraphs of context before stating the answer.
After: A 60-word paragraph that states the answer in the first sentence, qualifies in the second, and lists the components in the third — followed by the long-form discussion below it.
The "after" shape is what every section in this article uses. It is also what Wikipedia opens every article with. There is a reason it dominates citation share.
A real FAQ block — a visible H2 of questions with their answers, written for humans first — is a high-extraction shape. A schema-only FAQ (FAQPage JSON-LD without visible Q&A on the page) is a policy violation under Google's structured data guidelines and, since Google retired FAQ rich results in May 2026, no longer eligible for visible search treatment either. Schema for non-visible content is the wrong move.
Before: A FAQPage JSON-LD block in the page head listing five questions, with no visible FAQ on the page. (Policy violation, and Google has explicitly stopped surfacing FAQ rich results.)
After: A visible "## FAQ" H2 with each question as an H3 (or bold), each answer as a clean 40–100 word paragraph. Add FAQPage schema only if it mirrors the visible content exactly.
Numbered steps are an extraction-friendly shape for procedural prompts ("how do I…"). They have explicit order, parallel grammar, and a finite count.
Before: "First, you'd want to audit your existing pages to see what's working, and then it makes sense to think about what content shapes you're missing, after which you can start adding structure where it's most needed and so on."
After: "How to retrofit existing pages for AI citation:
- Audit the page for the 9 content patterns in this guide.
- Identify the 2–3 patterns missing.
- Add a direct-answer paragraph under the H1 (40–80 words).
- Convert at least one prose section into a comparison table or bulleted list.
- Add a sourced-statistic block with named source and date.
- Re-test the page in a tracked prompt set after 14 days."
9. Methodology disclosures: "we measured X by Y"
A glossary entry is a definition block with even tighter shape: term in bold, single-sentence definition, optional one-sentence example. It is the most consistently-cited shape on Wikipedia (978 citations in our dataset — the #1 cited domain) and a major part of why HubSpot's glossary pages (HubSpot is at 380 citations) over-perform their content investment.
Before: A paragraph defining "retrieval-augmented generation" inside a longer discussion of AI systems.
After:
Retrieval-Augmented Generation (RAG) — A system architecture in which a language model retrieves relevant document chunks from an external source at query time and grounds its answer in those chunks. Used by ChatGPT Search, Perplexity, Google AI Overviews, Claude, and Gemini.
A methodology disclosure earns the trust signal AI engines use to weight authority. It also lets the engine cite your data with a defensible source attribution.
Before: "Our research shows that comparison tables get cited more often than other content shapes."
After: "We measured citation share by tracking 17,551 citations across ChatGPT, Perplexity, Google AI Overviews, and Gemini over a 30-day window, scoped to a 22-prompt AEO category prompt set (1,936 total responses, 2,219 unique cited domains). Within that sample, comparison tables appeared in citation passages at roughly 1.8x the rate of equivalent-length prose sections on the same topics."
The "after" version is liftable, attributable, and defensible. The "before" version is a marketing claim.
What Ahrefs's 75K-brand study found
Ahrefs published a study in 2025 analyzing 75,000 brands across ChatGPT, Google AI Mode, and AI Overviews to find which signals correlate with brand visibility in AI answers. The headline numbers, with the appropriate caveat that correlation is not causation:
The takeaway for writers: scaling page count is a losing strategy; investing in fewer, better-shaped pages that are cited externally (YouTube, third-party editorial, branded mentions) is the winning strategy. This corroborates what our 17,551-citation dataset already showed — owned content alone barely registers; source-layer presence dominates.
- YouTube mentions showed the strongest correlation with AI brand visibility (~0.737), outperforming every other measured factor across all three engines.
- Branded web mentions correlated 0.66–0.71 with AI visibility — beating backlinks 3:1 (0.664 vs 0.218).
- Branded search volume (0.352) and Domain Rating (0.266) correlated weakly with ChatGPT visibility — meaning classic SEO authority metrics are not sufficient on their own.
- Site page volume (~0.194) had almost no relationship with AI visibility. Publishing more pages does not buy you more citations.
- Recency matters: a separate Ahrefs analysis of 17 million citations found AI assistants prefer fresher content than traditional Google rankings do, with ChatGPT being the most freshness-biased.
What the top-cited domains in our dataset consistently do
A note of intellectual honesty here: SolCrys's own owned domain captures only 150 of the 17,551 citations in our 30-day AEO category dataset — about 0.85%. Our own citation success is modest. The right move is to look at what the *top-cited domains in our dataset* actually do, and copy the craft choices that earned them their position.
Wikipedia (978 citations, #1 in our dataset): every article opens with a one-sentence definition (pattern #1), a 40–80 word lede (pattern #5), and a structured infobox (pattern #3). Sections are short, headings are explicit, lists are everywhere.
TechRadar (908 citations, #2): every review uses a comparison table (pattern #3), a "Pros / Cons" bulleted list (pattern #2), a sourced spec block, and a clear verdict paragraph at the top (pattern #5).
HubSpot (380 citations, #4): long-form deep guides with multiple H2s, each opening with a direct-answer paragraph (pattern #5), heavy use of glossary entries (pattern #8), and named methodology where data is presented (pattern #9).
The common thread: structure that respects the reader and the retriever. None of these publishers use AI-only schema, llms.txt files, or other engine-specific hacks. They use the patterns above, consistently, across thousands of pages.
Engine-by-engine content preferences
Each engine has a different retrieval stack, which means different content shapes pay off differently. The summary table:
The pattern: ChatGPT and Perplexity reward passage shape most aggressively; Google AIO and Gemini reward SEO ranking first and then add RAG; Claude rewards authority and depth. Writing for one engine in isolation is a losing approach — the nine patterns above work across all five.
| Engine | Retrieval stack | Strongest content preference | Watch out for |
|---|---|---|---|
| ChatGPT Search | Bing index + GPTBot/OAI-SearchBot + RAG | Passage-extractable lists, definitions, direct-answer paragraphs | Buried answers; pages without H2 structure |
| Perplexity | Real-time RAG over multi-source fetch | Recent, sourced, structured content with named methodology | Stale dates, unsourced claims |
| Google AI Overviews / AI Mode | Google ranking + grounding | Content that already ranks in SEO + clean H2/bullet structure | Pure AEO plays that ignore SEO fundamentals |
| Claude (web search) | Brave Search + Claude grounding | Authoritative editorial sources, longer-form analysis | Thin content, low-authority domains |
| Gemini | Google index + Gemini grounding | Mirrors Google AIO preferences | Same as Google AIO |
Anti-patterns (what to stop doing)
The AI-search content market has produced a long list of "tricks" that do not work, do not survive policy review, or have been explicitly refuted by the engines themselves. Six to retire:
- Question-shaped H2 quotas. Forcing every H2 into a "How do I…?" shape is not a Google ranking factor — Google has explicitly stated this. Use question H2s where the content is genuinely a question; do not force them.
- FAQ schema on pages without visible FAQs. Policy violation under Google's structured data guidelines, and since Google retired FAQ rich results in May 2026, no longer eligible for visible search treatment anyway. Schema must mirror visible content.
- `llms.txt` as a content-discoverability lever. Google's May 2026 AI Features guidance explicitly states no special AI files are required. Ship `llms.txt` only if it does not delay higher-leverage work. See Why llms.txt is not a strategy.
- "AI-only" schema. There is no special AI schema. Google's May 2026 guidance says any structured data you use should match visible page content. Stop chasing vendor-invented schema standards.
- Mass programmatic SEO. Google's spam guidelines flag scaled content abuse — programmatic pages with thin or near-duplicate content trigger downranking, not citation lift. The Ahrefs 75K study showed page count has near-zero correlation with AI visibility (~0.194).
- LLM-generated content without human editing. Pure model output lacks the E-E-A-T signals AI engines (and Google's quality systems) reward. Use AI as a draft layer, not the published layer.
Get the content-pattern checklist + run a free audit
The fastest way to apply the nine patterns to your own pages is to start with the ones AI engines are already trying to extract from your site and failing on. Our free 25-prompt AI visibility audit shows the buyer prompts where your brand is missing from AI answers, the URLs that *are* getting cited (mentions-you data), and the content-shape gaps on each of those pages.
Use the audit as the input to a rewrite plan: pick the 5 highest-value missing prompts, apply the four most-relevant patterns above to each target page, republish, and re-test in 14 days. That is the entire AEO content loop — measurement, craft, ship, verify.
For the strategic version of this same idea — how to choose which pages get the rewrite investment — read How to earn AI citations: the content-source match framework and How AI engines choose sources. For platform-specific craft notes, see Optimize for ChatGPT Search and Google endorses AEO fundamentals.
*Last updated 2026-05-22. Citation data drawn from a continuous 30-day cross-engine measurement (22 prompts × 4 engines, 1,936 responses, 17,551 citations, 2,219 unique cited domains). Ahrefs 75K-brand correlation data: Ahrefs Blog, 2025. Google FAQ rich results retirement: Google Search Central, May 2026. RAG chunking research: NAACL 2025 Findings; EACL 2026 T2-RAGBench.*
FAQ
How long until I see citations after publishing?
Realistically 2–8 weeks for ChatGPT and Perplexity to begin citing a new page, longer for Google AI Overviews (which is gated on SEO ranking first). Pages that already rank organically can be picked up by AIO within days of publishing. Pages with no SEO foundation will rarely be cited by AIO regardless of shape.
Does article length matter?
Less than writers assume. The Ahrefs 75K study found page volume correlates almost not at all (~0.194) with AI visibility. What matters is whether the page contains extractable passages on the right topic. A 600-word page with a strong direct answer, table, and FAQ outperforms a 4,000-word page with the same claim buried in paragraph 18.
What about question-shaped H2s?
Use them when the content under the H2 is a real answer to a real question. Do not force every H2 into question shape — Google has explicitly said this is not a ranking factor, and the practice produces awkward, hard-to-skim pages.
Should I use schema?
Yes, but only schema that mirrors visible content (Article, Product, Organization, Breadcrumb, HowTo, FAQPage — when there's a visible FAQ). Google's May 2026 guidance is clear: no special AI schema exists. Honest structured data is a comprehension signal; deceptive schema is a policy violation.
Do I need a real FAQ block?
A visible FAQ block (questions and answers on the page) is one of the nine high-extraction shapes. A schema-only FAQ with no visible questions is a policy violation and won't earn citation. Write the FAQ for humans first; add schema only if it matches.
Will AI search reward me for publishing more pages?
No. The data is unusually clear here — Ahrefs's 75K study found near-zero correlation between page count and AI visibility, and our own dataset shows owned domain page volume does not translate to owned citation share (we publish heavily and capture 0.85% of category citations). Better pages, not more pages.
Are AI citations guaranteed if I follow these patterns?
No, and any vendor claiming "guaranteed AI citation lift" is overselling. The nine patterns increase the *likelihood* of passage-level retrieval and citation; they do not guarantee it. Authority, source-layer presence, and engine-specific quirks still matter. Treat this as craft, not a hack.
Related guides
Citation & Source Influence
How to Earn LLM Citations Without Becoming Spam: The Content-Source Match Framework
AI engines preferentially cite 5 specific content patterns. This guide breaks down the content-source match framework, the 12 examples of cited content, and the slow-burn strategy that compounds without manipulation.
Citation & Source Influence
How AI Answer Engines Choose Sources: The 7 Signals We've Mapped
AI engines like ChatGPT, Perplexity, Google AI Overviews, and Claude choose sources using overlapping but distinct signals. This guide maps the 7 signals that drive citation eligibility and the engine-specific weighting differences.
Strategy & Positioning
Why llms.txt Is Not a Strategy
llms.txt is a proposed standard for AI-friendly content delivery, but it is neither widely adopted by major AI engines nor a substitute for AEO fundamentals. This essay explains what llms.txt does, what it does not, and why brands should focus on the unsexy basics.
Free AI visibility audit
Find out where your brand is missing, miscited, or misrepresented.
SolCrys maps high-intent prompts to mentions, citations, answer accuracy, and content gaps so your team can prioritize the next pages to ship.