About our spider

Openindex operates a web crawling cluster for the purpose of research and development of universal and focused search engines using several enhanced Apache Nutch crawlers running on an Apache Hadoop cluster.

Crawler ethics

Our crawler intends to implement general crawler ethics such as politeness, adherance of the robots exclusion standard and identification. We intend not to send successive HTTP requests to the same host more than once every few seconds. We also respect the Crawl-delay directive. The spiders can be identified with the following HTTP User-Agent string:

Mozilla/5.0 (compatible; OpenindexSpider; +https://www.openindex.io/saas/about-our-spider/)

How do I control which webpages of my website are crawled

Our software obeys the robots.txt exclusion standard, described at www.robotstxt.org and responds to the agent name "OpenindexSpider" and "Openindex". To restrict access to certain location put the following in your robots.txt file:

User-agent: OpenindexSpider
Disallow: /

If you do not have permission to edit the /robots.txt file on your server, you can still tell robots not to index your pages or follow your links. The standard mechanism for this is the robots META tag, as described at www.robotstxt.org.

About our spider

Crawler ethics

How do I control which webpages of my website are crawled

Solutions

Resources