Resource Access
robots.txt
The robots.txt
file allows websites to tell clients which resources they can or cannot access. This file is managed by an administrator and must be published at the service's root path (e.g. https://example.com/robots.txt
). We regularly retrieve this file and parse it according to the Robots Exclusion Protocol specification. The file may be cached for a minimum of 24 hours, but other HTTP caching properties will be respected.
All client requests are evaluated against a service's robots.txt
policies. If the file is unavailable, then we assume any resource may be accessed.
Dropped Requests
Rule Match
If a client request matches the pattern of a Disallow
rule, then the request will not be performed and a local error will be used instead.
Service Unreachable
If we encounter a problem retrieving the robots.txt
file, then we treat it as a Disallow: /
rule where no requests should be performed. Retrieval problems may include:
- Network Errors (e.g. name resolution, connectivity)
- HTTP Errors (e.g. server-related status codes)