White House Web Site Now ‘Crawler’ Friendly

January 22, 2009

The minute President Obama assumed his office, before he had even taken his oath, WhiteHouse.gov was updated to reflect the new executive. As expected, the new WhiteHouse.gov used some of the tools that were used online throughout the campaign. In addition, WhiteHouse.gov became drastically more accessible to search engines with an update to its robots.txt file.

Robots.txt is a standard way for Web sites to tell search engine Web crawlers which Web pages the crawlers can index for search purposes. In 2007, our Hiding in Plain Sight report noted that many federal Web sites abused the robots.txt file to hide content from search engines, making it hard for users to find federal information online. One of the worst offenders was the White House Web site itself, with almost 2400 specific (and chuckle-worthy) exclusions from the search index. In comparison, the new robots.txt file has just two lines and excludes almost none of its content from Web crawlers’ reach- the only folder that is blocked is the ‘includes’ folder, which is generally used for files that will be used as a part of another page.

The widespread abuse of robots.txt on federal government Web sites is a questionable practice that serves to limit the availability of government information. We applaud the White House for stepping up its commitment to transparency and setting a good example for other federal Web sites to follow.