Resource Access

robots.txt

The robots.txt file allows websites to tell clients which resources they can or cannot access. This file is managed by an administrator and must be published at the service's root path (e.g. https://example.com/robots.txt). We regularly retrieve this file and parse it according to the Robots Exclusion Protocol specification. The file may be cached for a minimum of 24 hours, but other HTTP caching properties will be respected.

All client requests are evaluated against a service's robots.txt policies. If the file is unavailable, then we assume any resource may be accessed.

Dropped Requests

Rule Match

If a client request matches the pattern of a Disallow rule, then the request will not be performed and a local error will be used instead.

Service Unreachable

If we encounter a problem retrieving the robots.txt file, then we treat it as a Disallow: / rule where no requests should be performed. Retrieval problems may include:

  • Network Errors (e.g. name resolution, connectivity)
  • HTTP Errors (e.g. server-related status codes)