Technology

OpenAI Announces Its Web Crawler GPTBot, Tells You How To Block The Bot Collecting AI Training Data

GPTBot, OpenAI’s web crawler, will help improve the ChatGPT maker's AI models.

Karan KambleAug 09, 2023, 05:56 PM | Updated 05:56 PM IST

OpenAI

OpenAI has announced a web crawler called GPTBot, whose job will be to scour the internet for public data to improve artificial intelligence (AI) offerings, specifically the ChatGPT maker's large language models GPT-4 and potentially GPT-5.

The name “web crawler” gives away what the function is — crawling the web.

Web crawler (or spider) bots scan the web for content. “Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results,” technology company Cloudflare explains.

“Web pages crawled with the GPTBot user agent,” says OpenAI, “may potentially be used to improve future models.”

GPTBot will, however, steer clear of “sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies.”

“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” OpenAI says.

The AI research and deployment company has given publishers and website owners the option to either fully opt out of GPTBot's surveillance or allow partial access. Check here for how to do that.

Although the option to opt out of web crawling by GPTBot is welcome and suggests a respect for privacy, it does put the onus of taking steps to disable access upon publishers and website owners.

Instead, an opt-in feature, where one is asked for permission, would have been more respectful.

Besides, the GPTBot has become known only now. It is unclear whether it, or any other such OpenAI web crawler, has already been collecting information and for how long — days, months, or years?

OpenAI trains its machine learning models on public web data. This choice has led to questions of ethics and legality.

For one, the aspect of consent for the reuse of information is absent. The source of information isn’t typically highlighted in an ordinary interaction with a chatbot powered by an AI model. A chatbot user also isn't redirected to the source, so the latter doesn't benefit.

In this scenario, a source of information is forced to compete with a platform that rechannels that same information, while also acting as a one-stop shop for any other information necessary, clearly handing the latter the advantage.

“Why would any producer of free online content let OpenAI scrape its material when that data will be used to train future LLMs that later compete with that creator by pulling users away from their site?” asks Alistair Barr, writing for Business Insider.

In addition, some of the information on the web, for instance, is copyrighted.

OpenAI’s free use of copyrighted material — text, images, sounds, videos, and what not — to improve their models and grow their revenue, therefore, becomes a contentious issue. It becomes grounds for copyright infringement.

Comedian Sarah Silverman sued OpenAI for copyright infringement in July, and she is one among several authors who have taken objection legally.

On the other hand, OpenAI and the Associated Press joined hands in July for the ChatGPT maker to license the New York-based news agency’s archive of news stories.

Also Read: How India Is Using AI To Build The Internet For Local Languages

Join our WhatsApp channel - no spam, only sharp analysis

Support Swarajya's 50 Ground Reports Project & Sponsor A Story

Every general election Swarajya does a 50 ground reports project.

Aimed only at serious readers and those who appreciate the nuances of political undercurrents, the project provides a sense of India's electoral landscape. As you know, these reports are produced after considerable investment of travel, time and effort on the ground.

This time too we've kicked off the project in style and have covered over 30 constituencies already. If you're someone who appreciates such work and have enjoyed our coverage please consider sponsoring a ground report for just Rs 2999 to Rs 19,999 - it goes a long way in helping us produce more quality reportage.

You can also back this project by becoming a subscriber for as little as Rs 999 - so do click on this links and choose a plan that suits you and back us.

Click below to contribute.

Latest

What Did Muslim Sultan And Hindu King Of Kashmir Have In Common — They Were Both Turuksha

Sunanda Vashisht

Shipping Ministry Targets 80 Per Cent Landlord Model At Major Ports By 2030

Swarajya Staff

Indian Army Takes Delivery Of 27,000 Locally-Made AK-203 Assault Rifles

Ujjwal Shrotryia

Do You Know Who Set To Tune The Immortal Song 'Kurai Ondrum Illai'? Most Likely You Wouldn’t

K Balakumar

Indian Elections: Honest Forecasting Versus Svengali Psephology

Venu Gopal Narayanan

[LIVE] PM Modi Dashboard: INDI Alliance will break apart ‘Khata Khat, Khata Khat’, the princes will go on summer vacations ‘Khata Khat, Khata Khat’, says Modi

Anmol Jain

PM Modi Dashboard: Part 1 (March-April 2024)

Anmol Jain

What Are The 2024 General Elections About?

Venu Gopal Narayanan