Baidu’s Big No To Google And Bing For Content Scrapping

Baidu stops Google and Bing from content scrapping.

Baidu, the Chinese online search giant, has taken measures, to obstruct content scraping by Google and Microsoft Bing. As the AI industry continues to progress, it is expected that more companies will reassess their data-sharing strategies. This could possibly lead to further changes in how data is indexed and retrieved on the internet.

This trend is coming to light at a time when generative AI developers globally are intensifying their collaborations with content publishers to access top-notch content for their initiatives. For example, not so long ago, OpenAI entered a pact with Time magazine to access its entire archive, tracing back to the inaugural day of the magazine’s issue more than a hundred years ago.

Baidu’s move is in response to the rising need for extensive datasets for the training of AI models and applications. This is akin to measures taken by other firms to safeguard their online content. In July, Reddit barred several search engines, with the exception of Google, from indexing its posts and discussions. Google, has a monetary contract with Reddit for data access to educate its AI services.

The move by Baidu to limit major search engines from accessing its Baidu Baike content underscores the escalating significance of data in the age of AI. With firms pouring substantial resources into AI advancement, the worth of extensive, well-organised datasets has remarkably surged. This has caused a change in the way web platforms control access to their material, with numerous choosing to restrict or profit from access to their data.

The latest modification to the Baidu Baike robots.txt file, which blocks Googlebot and Bingbot crawlers, was noticed.

The Wayback Machine suggests that this modification happened on August 8. Before this, Google and Bing search engines had permission to index Baidu Baike’s main repository, comprising nearly 30 million entries, even though certain target subdomains on the site were limited.

As per reports, over the previous year, Microsoft contemplated limiting the accessibility of internet-search data to its competitor search engine providers; this was predominantly pertinent for those who employed the data for chatbots and generative AI services.

On the other hand, the Chinese Wikipedia, consisting of 1.43 million entries, continues to be accessible to search engine crawlers. A study carried out by the South China Morning Post revealed that Baidu Baike entries are still visible in both Bing and Google search results. It’s possible that these search engines are still utilizing older cached content.

Such a move is emerging against the background where developers of generative AI around the world are increasingly working with content publishers in a bid to access the highest-quality content for their projects. For instance, relatively recently, OpenAI signed an agreement with Time magazine to access the entire archive, dating back to the very first day of the magazine’s publication over a century ago. A similar partnership was created with the Financial Times in April. Read More