How to control Sitechecker's Web Crawler?
"Crawler" is a generic term for any program (such as a robot or spider) that is used to automatically discover and scan websites by following links from one webpage to another. Sitechecker's Web Crawler doesn't crawl all websites on the internet. It crawls only websites and pages that users requested to scan.
Parameters of Sitechecker's Web Crawler:
- User-Agent: SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)
- The current list of IPs we use : https://crawler.bmp.rocks/crawlerInfo/getIpList
Tools in Sitechecker Platform where SiteCheckerBotCrawler works:
- Site Audit
- Site Monitoring
- On-Page Checker/Page Details report
How SiteCheckerBotCrawler scans your website
SiteCheckerBotCrawler's crawling process starts from a user's request to crawl a specific domain or URL.
In On-Page Checker and Page Details report SiteCheckerBotCrawler scans only a specific URL and its internal and external links.
In Site Audit SiteCheckerBotCrawler scans all URLs he finds on the website starting from the homepage. So, if your website has pages without even one internal link from other pages crawler won't detect this page (unless it is in the sitemap of the website or in your Google Search Console).
How to block SiteCheckerBotCrawler from scanning your website
There are a few ways how to block SiteCheckerBotCrawler:
1. Block using robots.txt file
Add this content to the robots.txt file of your website.
2. Block using .htaccess file
Add this content to the .htaccess file of your website. Don't forget to replace yourdomain.com with your domain!
You also can block the bot by IP address. Check this guide to learn more about how to block bots via the .htaccess file.
3. Block using the firewall
If you are using a web application firewall (WAF) to manage your incoming traffic, block SiteCheckerBotCrawler by creating a specific rule on the side of WAF. This guide is a good example of how to block bots using the Cloudflare Firewall.
How to allow SiteCheckerBotCrawler to scan your website
To allow SiteCheckerBotCrawler to scan the website you might make sure that our bot isn't blocked using the methods described above.
1. Check the website's robots.txt file
Make sure that there is no disallow rule for SiteCheckerBotCrawler user agent. If such a rule exists change it to the below one.
2. Check the website's .htaccees file
Make sure that SiteCheckerBotCrawler isn't blocked in the .htaccess file by user agent or IP address. If you found that the bot is blocked delete this rule. If you don't know how to work with the .htaccess file contact your web developer or hosting provider.
3. Check rules in a web application firewall (if you are using one)
Make sure, that there is no rule to block SiteCheckerBotCrawler requests to the website on the side of the web application firewall (WAF). In this case, the bot also can be blocked by user agent such as by an IP address. If you don't know how to work with the WAF, contact support of this service, so they can delete the rule of blocking SiteCheckerBotCrawler for you.
How to Allow SiteCheckerBotCrawler/1.0 on CloudFlare
1. Log in to your CloudFlare account
2. Select the account associated with the website
3. Then, select WAF.
4.Go to the Firewall rules tab.
5. Create a new Access Rule
6. Configure the action of the rule as ‘Allow’.
7. Select “User Agent” as match criteria and enter our user agent string “SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)” into Match Value field
8. Set Priority value to be 3 (Medium) or above to make sure that this rule gets applied correctly relative to other rules you may have already setup adding restrictions against robots or crawlers
9. Click “Save” at the bottom of page when done
10. Then proceed to crawl your website with Sitechecker — this should confirm if your adjustments were successful 10 If everything worked well and the pass-through was successful, you will now be able to receive SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro) crawling your website!