Technology

OpenAI Announces Its Web Crawler GPTBot, Tells You How To Block The Bot Collecting AI Training Data

Karan Kamble

Aug 09, 2023, 05:56 PM | Updated 05:56 PM IST

GPTBot, OpenAI’s web crawler, will help improve the ChatGPT maker's AI models.

OpenAI has announced a web crawler called GPTBot, whose job will be to scour the internet for public data to improve artificial intelligence (AI) offerings, specifically the ChatGPT maker's large language models GPT-4 and potentially GPT-5.

The name “web crawler” gives away what the function is — crawling the web.

Web crawler (or spider) bots scan the web for content. “Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results,” technology company Cloudflare explains.

“Web pages crawled with the GPTBot user agent,” says OpenAI, “may potentially be used to improve future models.”

GPTBot will, however, steer clear of “sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies.”

“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” OpenAI says.

The AI research and deployment company has given publishers and website owners the option to either fully opt out of GPTBot's surveillance or allow partial access. Check here for how to do that.

Although the option to opt out of web crawling by GPTBot is welcome and suggests a respect for privacy, it does put the onus of taking steps to disable access upon publishers and website owners.

Instead, an opt-in feature, where one is asked for permission, would have been more respectful.

Besides, the GPTBot has become known only now. It is unclear whether it, or any other such OpenAI web crawler, has already been collecting information and for how long — days, months, or years?

OpenAI trains its machine learning models on public web data. This choice has led to questions of ethics and legality.

For one, the aspect of consent for the reuse of information is absent. The source of information isn’t typically highlighted in an ordinary interaction with a chatbot powered by an AI model. A chatbot user also isn't redirected to the source, so the latter doesn't benefit.

In this scenario, a source of information is forced to compete with a platform that rechannels that same information, while also acting as a one-stop shop for any other information necessary, clearly handing the latter the advantage.

“Why would any producer of free online content let OpenAI scrape its material when that data will be used to train future LLMs that later compete with that creator by pulling users away from their site?” asks Alistair Barr, writing for Business Insider.

In addition, some of the information on the web, for instance, is copyrighted.

OpenAI’s free use of copyrighted material — text, images, sounds, videos, and what not — to improve their models and grow their revenue, therefore, becomes a contentious issue. It becomes grounds for copyright infringement.

Comedian Sarah Silverman sued OpenAI for copyright infringement in July, and she is one among several authors who have taken objection legally.

On the other hand, OpenAI and the Associated Press joined hands in July for the ChatGPT maker to license the New York-based news agency’s archive of news stories.

Also Read: How India Is Using AI To Build The Internet For Local Languages

Karan Kamble writes on science and technology. He occasionally wears the hat of a video anchor for Swarajya's online video programmes.

Get Swarajya in your inbox.

Magazine

OpenAI Announces Its Web Crawler GPTBot, Tells You How To Block The Bot Collecting AI Training Data

Get Swarajya in your inbox.

Magazine

Politics

Bihar: Why BJP And RJD Have The Same Ticket Distribution Formula

RJD Needs To Wipe Off Memories Of 'Jungle Raj', These Songs Bring Them Back And How

India’s New Counterterrorism Doctrine Ends The UPA Era's False Binary Of Diplomacy Or Deterrence

Economy

Bihar: How Industry And IT Hubs Can Decongest Chhath Puja Special Trains

India's Critical Minerals: Why Does It Take 15 Years To Dig Up What's Already There And Ours?

Diwali Of Renewal: How GST 2.0 And RBI Reforms Are Powering India’s New Economic Momentum

Infrastructure

Rs 2,380-Crore Nagpur–Nagbhid Rail Gauge Conversion Project Aims To Cut Coal Transport Time From 22 To 4 Hours By 2027

Bhoomi Poojan For Rs 4,217-Crore New Bombay High Court Complex In Bandra, Soon To Replace 146-Year-Old Fort Building

Defence

Operation Sindoor Showed The IAF’s Strength, But Also Its Blind Spots

Indian Army's Rifle Crisis Is A Mess Of Its Own Making

India Needs Asymmetric Weapons To Fight China. The Time To Fund Them Is Now

World

20-Year-Old Of Indian Origin Raped In Walsall; UK Police Calls It ‘Racially Aggravated’ Assault

End Of Japan's Pacifist Era? New PM Takaichi's Rise Signals A Harder Line On China

The Bitter Aftertaste Of Pakistan's Cup Of Tea In Kabul

Culture

Madhya Pradesh: The Sanatani Heart Of Incredible India

Discussing Koodalmanikyam? Remember the Sangh Institution That Spent Decades Training Priests From All Castes

Soora Samharam: Social Re-enactment Of A Spiritual Liberation