As an Apache Nutch Committer, Openindex is an expert in web crawling and web scraping. We can help set up a Nutch-based web crawler service that meets your needs. Or we take it even further off your hands: we do the crawling for you (Crawling as a Service) and only provide you with the data you need. As a feed or posted directly to your system.
In any case, with Hadoop we can set up a single machine or a cluster, depending on the scale to be crawled.
This can be a subset of the internet or a specific data source. Our Data Extraction software extracts the correct information from a wide variety of unstructured content.
Data Extraction comprises a number of services at various levels in the data acquisition and processing process. This concerns crawling, scraping, parsing and submitting in a SOLR or other application. All these processes can also be offered as a service (Crawling as a Service). That way you don’t have to install, configure and maintain any applications yourself.
Let us feed your search engine
Named Entity Recognition (Entity Search)
Collect specific data from the web
Scrape certain websites
Spider trap detector
Data as a Service
Try the metadata extraction demo below:
Enter a URL and see which meta information is extracted directly by our parser.
To recognize specific entities such as names and locations in the body of the text, it is best to use our Entity Extraction demo.
Data Extraction uses the following techniques: