How Reddit became the biggest single source of LLM citations
Reddit accounts for 40 percent of LLM citations across the major AI engines. The $60M Google deal, the OpenAI license, the lawsuits, and what it means for brands in 2026.
Originally published April 13, 2026
In June 2025, Semrush analyzed 150,000 LLM citations across the major AI engines and found one number that should have ended every internal debate about AI visibility budget. Reddit was the source 40.1 percent of the time. Wikipedia came in second at 26.3 percent. YouTube, third at 23.5 percent. No other platform came close. Reddit, a site most marketing teams ignored five years ago, had become the single largest training and retrieval source for ChatGPT, Claude, Perplexity, and Gemini combined. This is how it got there and what it means for any brand making a 2026 AI visibility budget.
Soar is a community marketing agency that has run 4,200+ community campaigns across 280+ brands since 2017. Most of those campaigns are now scoped against this exact dynamic: a single seeded Reddit thread can compound across four AI engines from one program, and most brands still buy Reddit and AI visibility as separate services. The merge is the point.
Reddit's share of citations across ChatGPT, Perplexity, Gemini, and Google AI Overviews.
Source: Semrush, 150K-citation analysis, June 2025Reddit's disclosed 2024 AI licensing contract value.
Source: Reddit S-1, TechCrunchReddit's annual licensing deal with Google for Gemini training.
Source: TechCrunchYear-over-year growth in Reddit's citation share for tech and electronics.
Source: Tinuiti via SaaS Intelligence, Q1 2026The $203 million year that changed everything
The inflection point was 2024. In February, Reddit signed a deal with Google reportedly worth $60 million per year to license content for training Gemini (TechCrunch). On May 16, 2024, OpenAI followed with its own license to train on Reddit data (TechCrunch). The OpenAI deal's terms were never formally announced; the widely cited ~$70 million per year figure is derived from Reddit's IPO disclosures, not an OpenAI press release. Reddit's S-1 showed $203 million in total AI licensing contract value for 2024, with a minimum of $66.4 million recognized as revenue that year.
Those deals did two things. They told every other AI lab that Reddit content was worth paying for, and they gave Reddit a financial reason to police unlicensed scraping. Reddit went from "the messy forum nobody in marketing took seriously" to the most valuable public training corpus in twelve months. Every subsequent licensing conversation and lawsuit downstream of those two contracts effectively prices Reddit as a paid input, not a free one.
Why LLMs love Reddit
The commercial story explains the licensing. The technical story explains why the licensing was worth it. Three structural properties make Reddit unusually valuable to a model training pipeline.
Long-form human text with real reasoning. Most of the internet is either short-form or structured. Reddit threads are the opposite: long comment chains where humans argue through a problem in natural language, with context, caveats, and corrections. A single top thread can run 400 comments and 10,000 words. That is the shape LLMs need to learn reasoning patterns from, and it is rare everywhere else.
Upvote-based quality signal. Reddit's voting gives the training pipeline a cheap, reliable quality filter. Top-voted comments are disproportionately accurate. An LLM training on Reddit can weight by upvotes and filter noise at nearly zero cost. No other public platform has a signal that clean at that scale.
Topic diversity. Reddit has communities for every vertical, from r/homelab to r/specializedfishtank. Wikipedia does not have an article on the right tire pressure for a Rivian R1T in Wisconsin winter. A Reddit thread does. That niche coverage lets LLMs answer specific questions well, and Reddit is nearly the only public source with it at scale.
The 40.1 percent number, and its volatility
The Semrush study (June 2025) analyzed 150,000 citations from ChatGPT, Perplexity, Gemini, and Google AI Overviews. Reddit appeared in 40.1 percent of cited sources, Wikipedia in 26.3 percent, YouTube in 23.5 percent. No other single domain cracked 5 percent.
Tinuiti tracked citation share in Q1 2026 and reported that Reddit's share grew more than 73 percent year-over-year in tech and electronics specifically (SaaS Intelligence summary). The trend line still points up. The cross-engine structural weight is the durable signal; per-engine numbers are tactical and worth tracking but not strategic.
Perplexity's manual boost
Perplexity is the engine that treats Reddit most openly as an authority domain. DataStudios' writeup on Perplexity's source selection identifies a small set of domains Perplexity "manually boosts" as trusted sources, and Reddit is on the list alongside GitHub, Amazon, and LinkedIn (DataStudios). This is a product decision, not an algorithmic accident.
Perplexity sells itself as an answer engine for hard questions, and hard questions are disproportionately answered well inside Reddit threads. If your brand is discussed positively in relevant threads, you have a disproportionate chance of being cited by Perplexity above what the raw algorithm would predict. The implication for B2B brands is direct: if your buyer is a Perplexity user, the highest-leverage place to be talked about positively is in the subreddits where they research vendors.
The lawsuits, and what they tell you
On June 4, 2025, Reddit sued Anthropic in California Superior Court, alleging Anthropic scraped Reddit content without a license from December 2021 through October 2024, including more than 100,000 crawls after Anthropic publicly claimed to have stopped (TechCrunch). In October 2025, Reddit sued Perplexity along with several data-scraping defendants on the same core allegation (FinancialContent).
The interesting part is not the litigation. It is what the litigation confirms. Reddit is willing to spend legal dollars defending the commercial value of its corpus because the corpus is genuinely valuable. The fact that Reddit is litigating tells you how central its content has become to the economics of AI, and how unlikely it is that any major engine cuts Reddit out of its retrieval mix in the next two years.
What this means for brand visibility
Showing up on Reddit is the highest-leverage AI visibility intervention available. One seeded thread that ranks well inside a relevant subreddit has a chance of being training data for the next model checkpoint, retrieval data for ChatGPT Search and Perplexity today, and a cited source in a Google AI Overview. The same content hits four engines from one pipeline. Nothing else in the GEO toolkit has that leverage. The community prioritization framework we use to pick the first targets is in how to prioritize the first communities and prompts to target.
Brand-owned communities are the most efficient version of this work. A single thread is a one-time deposit. A branded subreddit is compounding. Every post in r/YourBrand becomes a permanent source in the training and retrieval pipeline. The Foundation Inc case studies in the 2026 branded subreddit guide report that r/MintMobile drives 44 percent of Mint Mobile's social referral traffic and r/1Password drives around 46 percent of 1Password's. You get the brand reach outcome and the AI citation outcome from the same program.
The old "Reddit marketing is risky" objection no longer holds. Reddit's Rule 5 explicitly permits employees of a company to start and maintain a subreddit as long as no compensation changes hands for moderation actions. The dozen brands that have figured out the semi-official model, documented in our semi-official subreddit post, run working communities without drama. The risk is no longer Reddit. The risk is missing Reddit while competitors compound.
Frequently asked questions
Is Reddit's citation share going to stay this high?
Cross-engine, almost certainly through 2027. Reddit's licensing deals lock its content into the training corpora of the two largest engines, and Perplexity's manual boost is a product decision rather than an algorithmic accident. Per-engine shares move quarter to quarter, but the structural weight across all four major engines is durable for at least the next 18 to 24 months.
Will the Anthropic and Perplexity lawsuits remove Reddit content from those engines?
No. The lawsuits are about licensing and unpaid use, not about removing content. The most likely outcome is that Anthropic and Perplexity license Reddit data formally, the way Google and OpenAI already did. The corpus stays in the training and retrieval mix; the commercial terms get formalized.
How fast does a seeded Reddit thread show up in AI answers?
Real-time-retrieval engines (Perplexity, ChatGPT Search, AI Overviews) can pick up a thread within days if it ranks inside Google. Training-data engines (base ChatGPT, base Claude, base Gemini) require a model checkpoint to refresh, which is months. The fast win is retrieval citation; the slow compounding win is training-data inclusion.
Do I need a branded subreddit, or are seeded threads enough?
For most brands, the right answer is both. Seeded threads in relevant external subreddits build category authority and citation surface area. A branded subreddit is the compounding asset: every post is a permanent training source, and the brand controls the moderation. The branded subreddit becomes the highest-leverage piece once it has 1,000 to 2,000 engaged members.
What is the most common Reddit AI visibility mistake?
Treating Reddit as a one-time content seed instead of a compounding program. Brands publish three threads, measure nothing, and conclude Reddit does not work. The brands that win run a sustained cadence (8 to 12 quality threads per month) across 15 to 25 prioritized subreddits, measured against AI citation share, not Reddit upvotes.
:::
Conclusion
Reddit is the single largest source of LLM citations across every major AI engine, and the economic weight behind that position is not going anywhere. Google paid $60 million per year to train on it. OpenAI signed its own license. Perplexity manually boosts it. Reddit is suing the labs that did not pay. Semrush put Reddit at 40.1 percent of all LLM citations, more than Wikipedia and YouTube combined. If you care about showing up in AI answers, the highest-leverage thing you can ship this quarter is a plan for how your brand shows up on Reddit. Everything else in the GEO toolkit is secondary to that decision.