Custom crawl components
At Openindex we spend a lot of time on research and development. We have developed some great custom components that make crawling much easier.
Openindex data extractor
Over the years we developed Openindex content extractor, as an alternative to Tika's boilerpipe extractor. With the Openindex extractor, you are able to extract less 'noise' and therefore more relevant content from a HTML-document. The Openindex extractor is also able to extract a date and an image from an article and determines whether the URL is a content-page (article, forum thread, product page, etc) or not. See our data extraction tool in action! If you operate a web crawler or are interested in getting information from the web without tuning it for every website, don't hesitate to contact us.
Awesome spider trap detector
Spider traps, also known as crawler traps or black holes, are a major problem when crawling the internet on any scale larger than trivial. Using a well trained and tuned neural network, we developed an algorithm that detect spider traps. We use this detector for our Sitesearch service to avoid crawling useless pages over and over again. See our spider trap detection tool in action! If you operate a web crawler and this problem plagues you over and over, don't hesitate to contact us.