The internet giant has outlined plans to turn robots exclusion protocol (REP) — better known as robots.txt — into an internet standard after 25 years. To that effect, it has also made its C++ robots.txt parser that underpins the Googlebot web crawler available on GitHub for anyone to access.
“We wanted to help website owners and developers create amazing experiences on the internet instead of worrying about how to control crawlers,” Google said. “Together with the original author of the protocol, webmasters, and other search engines, we’ve documented how the REP is used on the modern web, and submitted it to the IETF.”
The REP is one of the cornerstones of web search engines, and it helps website owners manage their server resources more easily. Web crawlers — like Googlebot — are how Google and other search engines routinely scan the internet to discover new web pages and add them to their list of known pages.
Crawlers are also used by sites like the Wayback Machine to periodically collect and archive web pages, and can be designed with an intent to scrape data from specific websites for analytics purposes.
A website’s robots.txt file specifically informs automated crawlers about what content to scan and what to exclude, thereby minimizing useless pages from being indexed and served. It can also forbid crawlers from visiting confidential information stored in certain folders and prevent those files being indexed by other search engines.
By open-sourcing the parser used to decipher the robots.txt file, Google is aiming to eliminate all confusion by creating a standardized syntax to create and parse rules.
“This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly,” Google wrote in a blog post.
It said the library will help developers build their own parsers that “better reflect Google‘s robots.txt parsing and matching.”
The robots.txt standard is currently in its draft stage, and Google has requested feedback from developers. The standard will be modified as web creators specify “how much information they want to make available to Googlebot, and by extension, eligible to appear in Search.”