How Reddit became the biggest single source of LLM citations

April 13, 2026 in ai-visibility·8 min read
How Reddit became the biggest single source of LLM citations

How Reddit became the biggest single source of LLM citations

In June 2025, Semrush analyzed 150,000 LLM citations across the major AI engines and found one number that should have ended every internal debate about AI visibility budget. was the source 40.1 percent of the time. Wikipedia came in second at 26.3 percent. YouTube, third at 23.5 percent. No other platform came close. Reddit, a site most marketing teams ignored five years ago, had become the single largest training and retrieval source for , , , and combined. This is how it got there and why we bundle community marketing and AI visibility into one service.

The $203 million year that changed everything

The inflection point was 2024. In February, signed a deal with Google reportedly worth $60 million per year to license content for training (TechCrunch). On May 16, 2024, OpenAI followed with its own license to train on Reddit data (TechCrunch). The OpenAI deal's terms were never formally announced; the widely cited ~$70 million per year figure is derived from Reddit's IPO disclosures, not an OpenAI press release. Reddit's S-1 showed $203 million in total AI licensing contract value for 2024, with a minimum of $66.4 million recognized as revenue that year.

$203M Reddit's total disclosed 2024 AI licensing revenue TechCrunch

Those deals did two things. They told every other AI lab that Reddit content was worth paying for, and they gave Reddit a financial reason to police unlicensed scraping. Reddit went from "the messy forum nobody in marketing took seriously" to the most valuable public training corpus in twelve months.

Why LLMs love Reddit

The commercial story explains the licensing. The technical story explains why the licensing was worth it.

Long-form human text with real reasoning. Most of the internet is either short-form or structured. threads are the opposite: long comment chains where humans argue through a problem in natural language, with context, caveats, and corrections. A single top thread can run 400 comments and 10,000 words. That is the shape LLMs need to learn reasoning patterns from, and it is rare everywhere else.

Upvote-based quality signal. Reddit's voting gives the training pipeline a cheap, reliable quality filter. Top-voted comments are disproportionately accurate. An LLM training on Reddit can weight by upvotes and filter noise at nearly zero cost. No other public platform has a signal that clean at that scale.

Topic diversity. Reddit has communities for every vertical, from r/homelab to r/specializedfishtank. Wikipedia does not have an article on the right tire pressure for a Rivian R1T in Wisconsin winter. A Reddit thread does. That niche coverage lets LLMs answer specific questions well, and Reddit is nearly the only public source with it at scale.

The 40.1 percent number

The Semrush study (June 2025) analyzed 150,000 citations from , , , and Google AI Overviews. appeared in 40.1 percent of cited sources, Wikipedia in 26.3 percent, YouTube in 23.5 percent. No other single domain cracked 5 percent.

40.1% Reddit's share of all LLM citations across major engines Semrush, June 2025

The number is not static. Semrush follow-up tracking showed Reddit's citation share inside ChatGPT specifically dropped from roughly 60 percent to 10 percent between early August and mid-September 2025 after an OpenAI retrieval change. The 40.1 percent figure is an average. The point is that the structural weight of Reddit is high enough that even a volatile one-engine drop leaves Reddit as the dominant source across the category.

Tinuiti tracked citation share in Q1 2026 and reported that Reddit's share grew more than 73 percent in tech and electronics specifically (SaaS Intelligence summary). The trend line still points up.

73% Year-over-year growth in Reddit's citation share for tech and electronics Tinuiti, Q1 2026

Perplexity's manual boost

is the engine that treats most openly as an authority domain. DataStudios' writeup on Perplexity's source selection identifies a small set of domains Perplexity "manually boosts" as trusted sources, and Reddit is on the list alongside GitHub, Amazon, and LinkedIn (DataStudios). This is a product decision, not an algorithmic accident. Perplexity sells itself as an answer engine for hard questions, and hard questions are disproportionately answered well inside Reddit threads. If your brand is discussed positively in relevant threads, you have a disproportionate chance of being cited by Perplexity, above what the raw algorithm would predict.

The lawsuits, and what they tell you

On June 4, 2025, sued Anthropic in California Superior Court, alleging Anthropic scraped Reddit content without a license from December 2021 through October 2024, including more than 100,000 crawls after Anthropic publicly claimed to have stopped (TechCrunch). In October 2025, Reddit sued along with several data-scraping defendants on the same core allegation (FinancialContent).

The interesting part is not the litigation. It is what the litigation confirms. Reddit is willing to spend legal dollars defending the commercial value of its corpus because the corpus is genuinely valuable. The fact that Reddit is litigating tells you how central its content has become to the economics of AI.

What this means for brand visibility

Showing up on is the highest-leverage AI visibility intervention available. One seeded thread that ranks well inside a relevant subreddit has a chance of being training data for the next model checkpoint, retrieval data for Search and today, and a cited source in a Google AI Overview. The same content hits four engines from one pipeline. Nothing else in the GEO toolkit has that leverage. We send every new client to our community prioritization framework to figure out which subreddits to touch first.

Reddit is not a social channel. It is the training set for every major LLM, and every thread you own compounds across four engines at once.

Soar, community-to-citation pipeline

Brand-owned communities are the most efficient version of this work. A single thread is a one-time deposit. A branded subreddit is compounding. Every post in r/YourBrand becomes a permanent source in the training and retrieval pipeline. The Foundation Inc case studies in the 2026 branded subreddit guide report that r/MintMobile drives 44 percent of Mint Mobile's social referral traffic and r/1Password drives around 46 percent of 1Password's. You get both outcomes from the same program.

The old "Reddit marketing is risky" objection no longer holds. Reddit's Rule 5 explicitly permits employees of a company to start and maintain a subreddit as long as no compensation changes hands for moderation actions. The dozen brands that have figured out the semi-official model, documented in our semi-official subreddit post, run working communities without drama. The risk is no longer Reddit. The risk is missing Reddit.

Conclusion

Reddit is the single largest source of LLM citations across every major AI engine, and the economic weight behind that position is not going anywhere. Google paid $60 million per year to train on it. OpenAI signed its own license. Perplexity manually boosts it. Reddit is suing the labs that did not pay. Semrush put Reddit at 40.1 percent of all LLM citations, more than Wikipedia and YouTube combined. If you care about showing up in AI answers, the highest-leverage thing you can ship this quarter is a plan for how your brand shows up on Reddit. Everything else in the GEO toolkit is secondary to that decision.

How Soar saves you time and money

The highest-leverage AI visibility intervention in 2026 is seeding content on Reddit, and most agencies sell "AI visibility" and "Reddit marketing" as separate services, forcing clients to hire two vendors for what is really one pipeline. We bundle them because the work is the same work. A single Soar engagement hits Reddit, ChatGPT, Claude, Perplexity, and Google AI Overviews from one program, one retainer, and one operating rhythm. When a seeded Reddit thread shows up in a Perplexity answer a month later, it is because the pipeline was designed to connect.

Most brands overpay by splitting budget across specialist vendors that do not talk to each other. The Reddit agency seeds threads, the GEO agency writes content, the SEO agency audits the site, and nobody connects the Reddit work to the visibility outcome. The brand pays for three programs and gets two-thirds of the compound value. Our combined program is usually lower than the sum of the three specialist retainers it replaces. For an assessment, request a proposal. We will run your top 20 prompts through Parse and scope from there.

Community marketing strategy

Ready to grow through community marketing?

Get a custom strategy tailored to your brand, audience, and the conversations already shaping buying decisions.