PCMag editors select and review products independently. If you buy through affiliate links, we may earn commissions, which help support our testing.

Reddit Tries to Block Bots, Web Crawlers to Stop Unlicensed AI Data Scraping

After Reddit's own AI deals with Google and OpenAI, the social platform is now trying to stop others from scraping its data without paying up first.

 & Kate Irwin Reporter

Our team tests, rates, and reviews more than 1,500 products each year to help you make better buying decisions and get more from technology.

Our Expert
LOOK INSIDE PC LABS HOW WE TEST
65 EXPERTS
43 YEARS
41,500+ REVIEWS
(Credit: Tim Goode/PA Images via Getty Images)

Reddit is updating its Robots Exclusion Protocol, or robots.txt file, to try to block bots and web crawlers from swiping data and content from its site.

Reddit says "good faith actors" like the Internet Archive will continue to have access to its platform, however, and adds that most Reddit users won't be affected by or notice the change. Reddit will also continue its practice of rate-limiting, which may help prevent third-party scraping.

This isn't an ironclad solution; as Google notes, there are loopholes to evade robots.txt rules.

"The instructions in robots.txt files cannot enforce crawler behavior to your site; it's up to the crawler to obey them," Google states. "While Googlebot and other respectable web crawlers obey the instructions in a robots.txt file, other crawlers might not."

This means that AI startups could still swipe Reddit data and train their models on the sly—even though Reddit's policies explicitly forbid it. This month, Business Insider reported that both OpenAI and Anthropic have been circumventing robots.txt files to scrape websites anyway. It's unclear whether Reddit's Tuesday update directly addresses these firms' methods.

"You may not use content on Reddit as...input for any model training without explicit consent from Reddit. Commercial use of any model trained with Reddit data is prohibited without explicit approval," the company's policies state.

Last month, Reddit hinted further restrictions and changes were coming in a post on its public content policy. "We see more and more commercial entities using unauthorized access or misusing authorized access to collect public data in bulk, including Reddit public content," the company says. "Worse, these entities perceive they have no limitation on their usage of that data, and they do so with no regard for user rights or privacy, ignoring reasonable legal, safety, and user removal requests."

Reddit has made some data deals of its own. In February, Google and Reddit entered into a $60 million content licensing deal that allows Google to use Reddit's API and lets Reddit use Google's VertexAI. Reddit responses later began appearing in Google Search AI Overviews, with mixed results.

ChatGPT may also start citing Reddit posts soon, thanks to another official partnership announced last month. It's unclear whether Reddit content will help train OpenAI's next models, but it's possible considering AI firms' seemingly endless hunger for new data. Reddit may have to get more specific soon as the FTC in March launched an investigation into its licensing of user data.

All this comes after Reddit limited access to its API last year, in part to prevent AI companies from scraping its data for free. That prompted a developer revolt, a brief subreddit blackout, and the demise of some popular Reddit clients.

About Our Expert

Kate Irwin

Kate Irwin

Reporter

I’m a reporter for PCMag covering tech news early in the morning. Prior to joining PCMag, I was a producer and reporter at Decrypt and launched its gaming vertical, GG. I have previously written for Input, Game Rant, Dot Esports, and other places, covering a range of gaming, tech, crypto, and entertainment news.

I’ve been a PC gamer since The Sims (yes, the original) in the CD-ROM days. I still think about my first-gen pink iPod mini, which, looking back, was not so mini. In 2020, I finally built my own custom Windows PC for gaming with a 3090 graphics card, but I also regularly use Mac and iOS devices. As a reporter, I’m passionate about documenting the wide world of tech and how it affects our daily lives.

My Areas of Expertise

  • Microsoft
  • Google
  • Artificial intelligence 
  • Cybersecurity
  • Video games are a big one. I specialize in shooters (Apex Legends, Fortnite, Overwatch) but I occasionally test out other genres as well, especially indie games or cozy games (The Sims series, Animal Crossing). 
  • The business and tech that powers video games
  • Cryptocurrency and blockchain technology
  • Social media platforms, including Meta’s apps, X/Twitter, Telegram, TikTok, etc.
  • Tech regulation

The Technology I Use

  • MSI gaming laptops
  • Nvidia graphics cards
  • AMD CPUs
  • MacBook Pro and Air laptops
  • An iPhone from 2019 (though I’m thinking about getting a “dumb phone” like the Light Phone)
  • Nintendo Switch
  • PlayStation 5
  • Freewrite Traveler 
  • At home: Sonos speakers (we have them all over the house), Philips Hue + Ring security products

Read full bio