Tuesday, October 28, 2025

AI25024 AI and Copyright V01 281025

 Who will stop the AI data heist taking place in plain sight?


Katie Prescott

The Louvre heist had us gasping at the ingenuity of the jewel thieves. Getting less attention is the repeated daylight robbery online, as a growing industry of data launderers scrape websites, steal copyrighted information and flog it to AI companies.

Far from a daredevil evasion of authority, this Great E-scrape is a grubby, damaging practice happening in plain sight and it needs to be stopped. Fast.

Reddit is the latest business to try to prevent its information being taken by the army of “would-be bank robbers” as it calls them.

It is suing three companies, who it claims hide their identities, locations and even mimic real people to dodge its website protections and pilfer at will.

These businesses allegedly worked around Reddit’s technical protections by scraping Google search results — three billion of them — which revealed information from forum posts, comments and discussions.

“Scrape even the toughest sites with market-leading success rates”
Lithuania’s Oxylabs, one of the accused, boasts on its website, advertising its “dynamic CAPTCHA bypassing to ensure uninterrupted access”.

Reddit is a prime target. The website has over 100 million users, discussing all sorts of topics and interests within hundreds of thousands of communities. This web chatter is gold dust for generative AI companies looking for constantly refreshed, real-life human-made information to feed their machines.

The US-listed business has already done a deal with Google and OpenAI for its content, so is understandably keen to protect its valuable asset. Yet, the response from scraping companies is always the same: they are doing nothing wrong. Hey, the internet is open and should be free for all, right? “No single company owns public data ... and private companies should not be permitted to place limits on who may share and receive information online,” Oxylabs said.

Is scraping legal? No. Not when it bypasses controls, scrapes protected content or collects personal data without a valid basis. Yet it continues and it is so easy. Websites have something known as “robots.txt” files, which instruct crawlers and scrapers what they can access, which have all the effectiveness of a road sign. They only work if you’re prepared to obey the rules. Primarily, we need a resolution to the sludgy dispute between publishers and AI companies over copyright. After all, there would be no theft, without a market for the goods.

In the Reddit case, it alleged that Perplexity, an AI-powered search engine, was illegally using its data and set a trap which it described as the equivalent of a marked banknote. Putting out a specific piece of information, Reddit found it was only used by Perplexity.

Perplexity punched back in a Reddit post (!) saying it does not train AI models on content and argued that efforts to stop it citing Reddit’s content were “the opposite of the open internet”. If anyone quotes the statement, they could be breaching Reddit rules, Perplexity seemed to mock.

Along with losing their IP, scrapers also create another cost for publishers who are forced to implement stronger guardrails to keep them out. Furthermore, the AI-related surge in traffic means more pressure on websites which have to increase their server power, even though these “visitors” are just bots, as Jimmy Wales, founder of the (very open) Wikipedia said at The Times Tech Summit.

It has been a year since Feryal Clark, the UK’s former AI minister, told me that the AI copyright issue would be resolved by … last Christmas. Instead, we are stuck with an industry mushrooming online, capitalising on the lack of a definitive legal framework.

Speed is of the essence. But don’t expect any 2025 Christmas miracles.
Early next spring the UK government is due to lay out its proposals for reform and a litany of legal cases might also slowly help shape new rules in the United States.

In the meantime, sites like Wikipedia and Reddit are there for everyone to look at and, like the Louvre’s stolen crown jewels, are ripe for the picking.

Katie Prescott is Technology Business Editor of The Times

No comments:

Post a Comment

AI26019 Copyright and AI V01 100326

  Creative types have the upper hand in AI copyright fight Katie Prescott Kanishka Narayan is the minister for AI Next image  › ‘‘ Pimli-cod...