OpenAI has announced a web crawler called GPTBot, whose job will be to scour the internet for public data to improve artificial intelligence (AI) offerings, specifically the ChatGPT maker's large language models GPT-4 and potentially GPT-5.
The name “web crawler” gives away what the function is — crawling the web.
Web crawler (or spider) bots scan the web for content. “Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results,” technology company Cloudflare .
“Web pages crawled with the GPTBot user agent,” says OpenAI, “may potentially be used to improve future models.”
GPTBot will, however, steer clear of “sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies.”
“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” OpenAI says.
The AI research and deployment company has given publishers and website owners the option to either fully opt out of GPTBot's surveillance or allow partial access. Check here for .
Although the option to opt out of web crawling by GPTBot is welcome and suggests a respect for privacy, it does put the onus of taking steps to disable access upon publishers and website owners.
Instead, an opt-in feature, where one is asked for permission, would have been more respectful.
Besides, the GPTBot has become known only now. It is unclear whether it, or any other such OpenAI web crawler, has already been collecting information and for how long — days, months, or years?
OpenAI trains its machine learning models on public web data. This choice has led to questions of ethics and legality.
For one, the aspect of consent for the reuse of information is absent. The source of information isn’t typically highlighted in an ordinary interaction with a chatbot powered by an AI model. A chatbot user also isn't redirected to the source, so the latter doesn't benefit.
In this scenario, a source of information is forced to compete with a platform that rechannels that same information, while also acting as a one-stop shop for any other information necessary, clearly handing the latter the advantage.
“Why would any producer of free online content let OpenAI scrape its material when that data will be used to train future LLMs that later compete with that creator by pulling users away from their site?” Alistair Barr, writing for Business Insider.
In addition, some of the information on the web, for instance, is copyrighted.
OpenAI’s free use of copyrighted material — text, images, sounds, videos, and what not — to improve their models and grow their revenue, therefore, becomes a contentious issue. It becomes grounds for copyright infringement.
Comedian Sarah Silverman OpenAI for copyright infringement in July, and she is one among several authors who have taken objection legally.
On the other hand, OpenAI and the Associated Press in July for the ChatGPT maker to license the New York-based news agency’s archive of news stories.
As you are no doubt aware, Swarajya is a media product that is directly dependent on support from its readers in the form of subscriptions. We do not have the muscle and backing of a large media conglomerate nor are we playing for the large advertisement sweep-stake.
Our business model is you and your subscription. And in challenging times like these, we need your support now more than ever.
We deliver over 10 - 15 high quality articles with expert insights and views. From 7AM in the morning to 10PM late night we operate to ensure you, the reader, get to see what is just right.
Becoming a Patron or a subscriber for as little as Rs 1200/year is the best way you can support our efforts.