In your settings, you can choose the following kinds of block:
Your site returns 403 Forbidden to IP addresses associated with a given bot or company.
Hard block obeys indexed
directory exclusion (see below).
Generally, only IP addresses associated with the crawler bots are blocked. But, for Google and Amazon, all their addresses are blocked entirely, including Google Cloud, AWS and other services.
The "Others" include the following lists: ahrefsbot betteruptimebot bunnycdn cloudflare duckduckbot facebookbot freshpingbot imagekit imgix marginalia mojeekbot molliewebhook outageowl pingdombot rssapi stripewebhook telegrambot twitterbot uptimerobot webpagetestbot.
When hard block is active, soft block is also applied, just for case.
Currently implemented as a
standard robots.txt
.
Essentially, your site requests that specified robots ignore everything except of
the indexed
folder (see below).
Soft block allows your site to avoid a suspicion of what search companies call "deceptive hiding", if you care about their ranking.
Caution: a single instance of robots.txt
is served to everywhere.
The robots.txt
which you upload to your files
will never be used, because the generated version
(controlled by settings) will take precedence.
Caution: AI companies, especially at their early stage, are known
to ignore robots.txt
.
The battle for user data goes as far as many companies
are trying to circumvent IP blocklists, not to mention robots.txt
.
Details: [1],
[2].
A special folder indexed
is created in the root of your file tree. This folder is allowed for crawling.
For example, you have Facebook blocked by "Others" radiobutton in your settings. You can still place a social preview for Facebook in the indexed
folder, bypassing the block.