AI companies are looting news websites and researchers documented everything
McGill University in Montreal just dropped a study that is making waves across journalism and tech. Called the AI News Audit, the research led by professors Taylor Owen and Aengus Bridgman from the Centre for Media, Technology and Democracy documented something many people already suspected but few had managed to prove with such clear numbers: large Artificial Intelligence models are using journalistic content on a massive scale, without giving proper credit and without paying a dime to the people who produced it.
Think about it: if a company were caught stealing jewelry or pirating movies, it would probably already be facing a serious lawsuit. But in the world of news and copyright, the consequences are practically nonexistent — at least for now. That is exactly the scenario this Canadian research lays out on the table, backed by hard data, testing the most popular AI models on the planet and showing how journalism is being consumed, reused, and recycled by these platforms without the slightest acknowledgment. The study arrives at a time when the conversation about the future of journalism has never been more urgent. 📰
What the McGill University audit uncovered
The study went well beyond theory and speculation. The researchers designed a fairly straightforward methodology, split into two lines of investigation. The first examined how journalistic content was used to train AI models. The second analyzed how those models cite — or fail to cite — sources when they incorporate web searches into the answers they deliver to users.
To run the tests, they used four of the leading generative Artificial Intelligence models on the market: ChatGPT, Gemini, Claude, and Grok. The dataset included a sample of 2,267 Canadian news articles. The results were, to put it mildly, alarming.
When web search functionality was enabled, 52% of responses contained at least one link to a Canadian news site. Sounds reasonable at first glance, right? But here is the detail that changes everything: the source was named in the body of the response only 28% of the time. In other words, in 82% of cases involving web searches, there was no source attribution whatsoever. The model delivered the information as if it were its own knowledge, without mentioning who did the original reporting.
Another revealing finding: when the researchers specifically asked about a story from a particular outlet, the models identified the source between 74% and 97% of the time. This shows the technology is perfectly capable of giving proper credit. The decision not to do so, therefore, is a design choice, as the study itself highlights. AI companies could consistently name their sources but choose not to. 😬
The problem goes beyond attribution
Professor Aengus Bridgman explained in an interview that chatbots display journalistic content precisely because that content contains accurate and reliable information. This means the AI companies themselves recognize the enormous value journalism provides. They are using that value in consumer-facing products, and Bridgman argues there should be financial and institutional recognition of that contribution.
Even when links are included in AI-generated summaries, most people simply never click on them. So in practice, AI companies are allowing users to consume the news without ever visiting the sites that produced it. The result is that subscription and advertising revenue stays with the AI platforms, not with the newsrooms that invested in reporting, editing, and publishing.
Bridgman went so far as to suggest that including links in responses may largely be just a credibility-building exercise on the part of chatbots. The underlying message is something like: trust us, look at our sources. But if the user already received the full information in the summary, what motivation is there to click?
Paywalls are being bypassed
The audit also identified cases where AI models cited stories that were protected by paywalls — those barriers that require a subscription to access the full content. This raises a serious concern: the payment mechanisms that block human readers may not be working the same way against automated data collection by AI bots.
The McGill team is already conducting additional research specifically on this paywall-piercing issue. Other independent studies have also found that the technical barriers news sites put up to prevent data scraping by AI companies are largely ignored. It is like putting up a fence around your property and then discovering your neighbor has a helicopter. 🚁
Bridgman noted that AI companies are using different approaches to respond to news-related queries. In some cases, the models act like an ordinary person trying to get informed about a topic. When they hit a paywall, they simply look for the same information for free elsewhere on the internet, piecing together fragments until they can assemble the essentials of a story.
An analogy that says it all about the problem
To illustrate how absurd the situation is, consider a direct comparison. Imagine you wanted to watch a newly released movie without paying. Instead of buying a ticket, you could search for trailers, clips, and scenes posted on social media. With powerful computers, it would be possible to quickly stitch all of that together into an edit that closely resembles the original film.
Now imagine that, without any scruples, you started charging people for that Frankenstein version of the movie, without paying a cent to the people who wrote the screenplay, directed, edited, and acted in the original production.
The logical outcome is predictable: eventually, there would be no more trailers, no more clips, no more new movies. Nobody would keep investing in an industry whose product is systematically stolen and redistributed by third parties who pocket all the profit. That is exactly the risk local and independent journalism faces right now. And this kind of journalism is considered essential for civic literacy and democracy.
Copyright in journalism: a battle that has already begun
The discussion around copyright in the context of Artificial Intelligence is not new, but it has gotten significantly louder over the past two years. Several news organizations around the world have already taken AI companies to court, alleging their content was used without permission to train models that now compete directly against them. The most high-profile case so far involves The New York Times, which sued OpenAI and Microsoft in December 2023, alleging that millions of the newspaper’s articles were used to train ChatGPT.
The McGill University study arrives precisely to reinforce the technical argument behind these legal disputes. Previously, AI companies could claim their models simply learned general language patterns and that there was no way to determine exactly what was or was not used in training. With a methodology that demonstrates models can complete and reproduce information from specific articles, sustaining that argument becomes much harder.
The study’s data works almost like a fingerprint, showing that certain content left identifiable traces in the models — which is powerful evidence in any copyright discussion. And the impact does not fall only on large media conglomerates. Smaller outlets, independent news agencies, and freelance journalists also had their work consumed by these models, and they rarely have the resources to defend themselves legally. This creates an enormous asymmetry between those who produce the content and those who profit from it. 📊
What Canada is doing differently
Canada is one of the few countries that has taken concrete steps to tackle this issue. Since 2023, the Online News Act has been in effect, legislation that requires tech giants profiting from news to financially compensate the outlets that produce it.
Google, for example, began paying Canadian publishers 100 million Canadian dollars per year. Meta took a different and more aggressive route: it simply blocked access to news on its platforms in Canada to avoid having to pay. Now, however, there are reports that Meta may be considering paying some outlets, but with a condition: that those same outlets speak out against the legislation designed to protect them. A maneuver that, at the very least, raises serious ethical questions.
After learning about the results of the McGill audit, Canadian Culture Minister Marc Miller stated that the Online News Act is about people paying their fair share, and that this principle does not change with the emergence of AI. He said that having news cannibalized and regurgitated undermines the spirit of how that news is used and the purpose for which it was created, and that a serious conversation needs to happen with the platforms that claim to use it, including AI companies.
And in the United States, where do things stand?
In the United States, a similar piece of legislation called the Journalism Competition and Preservation Act (JCPA) had bipartisan support but stalled in Congress in 2023. Pressure from the tech lobby and its allies has been effective in blocking attempts to guarantee fair compensation for journalism.
Researchers and press freedom advocates argue it is long overdue for a new version of the JCPA to be put on the table, this time specifically addressing how AI companies are transforming the way people consume information and preventing the local news industry from being suffocated.
To help push that process forward, there is a call for American academics to connect with Owen and Bridgman at McGill, who are willing to share their models and methodologies so similar audits can be conducted in the United States. Research like this may not offer definitive answers to every question surrounding AI and journalism, but it certainly helps build a sharper picture of what is happening.
What changes for journalism from here
The publication of this study by McGill University has the potential to accelerate conversations that were moving in slow motion. Governments and regulators in different countries were already paying closer attention to the question of copyright in the age of Artificial Intelligence, but the speed at which regulations advance tends to be far slower than the speed at which technology evolves. The European Union moved ahead with the AI Act, which demands greater transparency about the data used to train models, but there is still plenty of room for interpretation.
In other countries, the debate is also gaining traction, even if more quietly. The McGill study offers valuable technical arguments for these local discussions, especially because it demonstrates the problem empirically and with reproducible results. When you have hard data, the conversation moves out of the realm of guesswork and into the realm of evidence, which completely changes the dynamics of any negotiation or regulatory process.
For journalism as a profession and as an industry, this moment demands heightened attention. Newsrooms that have not yet established clear policies on how their content can or cannot be used by Artificial Intelligence platforms are, in practice, leaving the door wide open. Some outlets have already started including specific clauses in their terms of use, blocking access from crawler bots used for training data collection. Others are choosing to negotiate licensing deals directly with AI companies, as the Associated Press and Axel Springer did with OpenAI. These are different paths, but they all start from the same recognition: journalistic content has value, and that value needs to be respected. 💡
The full picture and what is at stake
The McGill professors summed up the situation bluntly in their report: AI companies have built commercial products that depend, in significant part, on the reporting that Canadian journalists produce. They did this without compensation, without attribution, and without any obligation to sustain the infrastructure they are feeding off of. The result is a system that accelerates the economic decline of the very journalism it depends on.
It is a cycle that feeds on itself in a destructive way. AI companies need quality journalistic content to deliver reliable answers. But by consuming that content without giving back, they weaken the newsrooms that produce it. And weaker newsrooms mean less investigative reporting, less local coverage, less diversity of sources — and ultimately, an AI model that will have access to increasingly poor information.
The McGill University study does not solve the problem on its own, but it puts a magnifying glass on it in a way that will be hard to ignore. And amid so much talk about the future of Artificial Intelligence, it is worth remembering that this future is being built, in large part, on the work of journalists who were never asked if they wanted to be part of this story.
The debate is just getting started, and the next chapters promise to be eventful. 🚀
