Major Global Sites Take Steps to Restrict Data Access for OpenAI and Other AI Crawlers

Recently, according toartificial intelligence (AI)Content Detector Originality.AI The latest data shows that about 20% of the world's top 1,000 websites have taken steps to limit crawler bots' access to their web data, which would have been used to support AI services.

Given the lack of clear legal or regulatory norms governing AI access to copyrighted material for use, websites of all sizes have taken action on their own.OpenAIlaunched its GPTBot crawler in early August, announcing that the data collected may be used to improve future models, as well as promising to exclude paid content and providing guidance to websites on how to block the crawler. Subsequently, prominent news sites including The New York Times, Reuters, and CNN have begun to take steps to limit GPTBot, while many others have followed suit.

According to Originality.AI, among the top 1000 most popular websites in the world for theChatGPT The bot's restriction increased from 9.11 TP3T on August 22nd to 121 TP3T on August 29th. with large sites such as Amazon, Quora, and Indeed blocking the ChatGPT bot the most. The data suggests that larger sites are more likely to have taken restrictive measures against AI crawler bots.

Another crawler program that regularly collects Web data needed for certain AI services is Common Crawl Bot, with a blocking rate of 6.77% on the top 1,000 global websites.

Any page that can be accessed through a web browser can be "crawled" by a crawler program, which operates in a similar way to a browser, but stores the data in a database rather than displaying it to the user. It's likeGoogle Internet companyand other search engines to gather information. While website owners have always had the ability to issue directives telling these crawlers to leave their sites, cooperation has been entirely voluntary and malicious operators have been able to ignore these directives.

Google and other web companies argue that their data-crawling efforts fall under fair use, but many publishers and intellectual property holders have long objected to the practice, and the company has faced multiple lawsuits as a result. The issue has gained renewed attention with the rise of large-scale language models and generative AI, as AI companies send their own crawlers to collect data to train their models and provide the fodder needed for chatbots.

since (a time)Google Internet companyand other search engines direct users to ad-supported sites, some publishers at least see some value in allowing search crawlers access to their sites. In the age of AI, however, publishers are taking more active steps to limit crawlers' access to their sites because for the time being there is no benefit to handing over their data to AI companies. Many media companies are currently in talks with AI companies about licensing their data to them for a fee, but these talks are still in the early stages.

Over the past 20 years, some media organizations have argued that search engines such as Google have taken away their content for theOpenAIand other fast-commercializing AI services are hostile and say "we won't fall for it again." OpenAI is expected to bring in more than $1 billion in revenue over the next year, according to The Information.

News media companies are struggling to find the balance between embracing and resisting artificial intelligence. This industry desperately needs to find innovative ways to improve the profitability of its labor-intensive operations. But at the same time, introducing AI into newsroom workflows raises a host of challenging ethical questions.

If too many networks continue to restrict AI crawlers, owners of these crawlers may find it more difficult to improve and update their AI offerings, and finding high-quality data will become more difficult.Originality.AI's findings show that the blocking rate of GPTBot in the top 1,000 websites increases by about 51 TP3T per week.

This article comes from users or anonymous contributions, does not represent the position of Mass Intelligence; all content (including images, videos, etc.) in this article are copyrighted by the original author. Please refer to this site for the relevant issues involvedstatement denying or limiting responsibilityPlease contact the operator of this website for any infringement of rights (Contact Us) We will handle this as stated. Link to this article: https://dzzn.com/en/2023/683.html

Like (0)
Previous September 2, 2023 at 2:44 pm
Next September 3, 2023 am10:08

Recommended

Leave a Reply

Please Login to Comment