PCMag editors select and review products independently. If you buy through affiliate links, we may earn commissions, which help support our testing.

Google, OpenAI Heavily Weight News Content in AI Training Without Payment

A study conducted by PCMag parent company Ziff Davis reveals heavier weights placed on high-quality news content, with no business model to pay those sites and keep the content flowing.

 & Emily Forlini Senior Reporter

Our team tests, rates, and reviews more than 1,500 products each year to help you make better buying decisions and get more from technology.

Our Expert
LOOK INSIDE PC LABS HOW WE TEST
65 EXPERTS
43 YEARS
41,500+ REVIEWS
(Credit: OpenAI)

AI giants like Google, OpenAI, and Meta are placing greater emphasis on content from reputable news sources when training large language models, according to a new study by Ziff Davis.

The findings could help the public understand where chatbots get their information and give media companies like Ziff Davis, the Chicago Tribune, News Corp, and The New York Times more leverage when seeking copyright protection or payment for material that's gobbled up by AI.

"Our work shows that key LLM training datasets are disproportionately composed of high-quality content owned by commercial publishers of news and media websites," the study says. "Major LLM companies have quantifiably prioritized this content in training the most important LLMs over the short history of the technology."

Ziff Davis is the parent company of PCMag. The study was conducted by the company's lead AI attorney, George Wukoson, and its chief technology officer, Joey Fortuna. It examined open-source replicas of datasets that AI companies have admitted to using, including Common Crawl, C4, OpenWebText, and OpenWebText2.

OpenAI admits to giving more weight to datasets it deems high-quality, including news media, copyrighted books, and links embedded in popular Reddit posts. This is a way of ranking all the content LLMs scrape from the web with the goal of producing better answers for users.

For example, it gives WebText2 22% weight in training GPT-3 despite accounting for 3.8% of tokens. Over 12% of the URLs embedded in OpenWebText2, the open-source version Ziff Davis examined, come from a group of 15 top media publishers, including News Corp, The New York Times, Gannett, Ziff Davis, Vox Media, Axel Springer, Alden Capital, Hearst, The Washington Post, BuzzFeed, Future, IAC, and Bustle.

The contents of the datasets also change over time. For example, OpenAI placed a high emphasis on content from The Washington Post in OpenWebText, but decreased its prominence for the release of OpenWebText2.

(Credit: Ziff Davis)

Ziff Davis says the findings quantify how important the news media is to the future of AI chatbots, with no obligation to pay them for it. This "long-running exploitation of high-quality publisher content (extremely lucrative for the LLM companies) [implies] lost licensing revenue from some of the world’s most highly valued companies."

Without payment for content, publishers could be put out of business, threatening the continuous flow of high-quality information in the AI era.

The report comes after a federal judge dismissed a lawsuit against OpenAI from Raw Story and AlterNet, which said the AI company used its content to train LLMs without permission, Reuters reports. A related case filed by The New York Times is ongoing. OpenAI has also signed licensing deals with many top media companies.

OpenAI's latest product launch, ChatGPT search, now cites some of its sources in addition to summarizing the content within them.

About Our Expert

Emily Forlini

Emily Forlini

Senior Reporter

My Experience

As a news and features writer at PCMag, I cover the biggest tech trends that shape the way we live and work. I specialize in on-the-ground reporting, uncovering stories from the people who are at the center of change—whether that’s the CEO of a high-valued startup or an everyday person taking on Big Tech. I also cover daily tech news and breaking stories, contextualizing them so you get the full picture.

I came to journalism from a previous career working in Big Tech on the West Coast. That experience gave me an up-close view of how software works and how business strategies shift over time. Now that I have my master's in journalism from Northwestern University, I couple my insider knowledge and reporting chops to help answer the big question: Where is this all going?

My Expertise

I'm the expert at PCMag for on-the-ground feature reporting and trending tech news, with a particular focus on electric vehicles and AI. I've published hundreds of articles and am also a podcast host, a bi-weekly tech correspondent for CBS News, a panel speaker and moderator, and a frequent contributor to a range of news and radio channels around the country.

The Technology I Use

All the latest from Apple and Microsoft, but I'll never give up my wired headphones! 

Read full bio