How to exclude and include URLs in Site Audit?
When do you need to use it?
We introduce the "Include and Exclude URLs" feature in the Site Audit settings, streamlining the customization of website crawling without the need to modify robots.txt rules. This simplifies specifying which pages or domains to include or exclude from the crawling process, directly through a user-friendly interface.
Setting up rules for crawling
Accessing the feature
To access the "Include and Exclude URLs" feature, navigate to the Site Audit section of your project settings and look for the "Include and Exclude URLs" option.
You can create rules in two categories:
- Include rules: Define which pages should be included in the crawl.
- Exclude rules: Specify which pages should be explicitly excluded from the crawl.
How rules interact
- Independent rules: When multiple rules are set, either in the "Include" or "Exclude" categories, they operate independently. This means the system will apply each rule separately to determine which URLs to crawl or exclude.
- Priority of exclusion: If URLs fall under both "Include" and "Exclude" rules, the exclusion rules take precedence to ensure precise control over the crawling scope.
Types of rules available
Following recent updates and discussions, we have simplified the rules to include:
- Contains
- Equals
- Starts with
- Ends with
- Robots.txt rule: Users can apply robots.txt syntax for precise control, as detailed in Google's Robots.txt Specifications.
Changes to crawling behavior
- Inclusion as a filter: Specifying "Include" rules acts as an additional filter to the domain scope, meaning only the URLs matching your defined rules are crawled.
- Exclusion for specificity: By setting "Exclude" rules, you signal the crawler to omit those URLs, enhancing the focus of your site audit.
- Respecting robots.txt: The feature respects the "Respect robots.txt rules" setting, meaning any "Include" rules conflicting with robots.txt will be ignored if the setting is enabled, and considered if disabled. In another words, Robots.txt rules have the highest priority.
Migration of old projects
Some existing projects, particularly those with custom robots.txt settings and the "Respect robots.txt rules" checkbox disabled, will be migrated to utilize the new rules system for enhanced accuracy and performance.