Utilizations

Data Extraction comprises a number of services at various levels in the data acquisition and processing process. This concerns crawling, scraping, parsing and submitting in a SOLR or other application. All these processes can also be offered as a service (Crawling as a Service). That way you don’t have to install, configure and maintain any applications yourself.

Advanced parser

Our advanced parser displays information we extract from specific web pages. This can be the detected language, the extracted article date, the extracted body text and an optional image. It can also request currencies and prices for web store products, international phone numbers, email addresses, and acronyms embedded in the text. And it can even determine whether a particular HTML page is an article or a wiki, a homepage, a forum thread, or an online store product (and more).

Let us feed your search engine

Our web crawler can be used to send data to your existing external Apache Solr / Lucene search engine. We can help you set up a crawler that will feed your search engine and address any issues that arise: content extraction, crawler traps, duplicates, etc.

Named Entity Recognition (Entitysearch)

As part of Openindex Data Extraction we can extract all kinds of relevant entities from a certain text. For example names, people, companies, organizations, locations, cities, products, brands etc. Try our demo!

Collect specific data from the web

We can provide a crawler that collects specific data from the internet. For example, we may provide a list of domains using a particular CMS, containing particular words or content, or a particular widget. These datasets can be very useful for e.g. research or sales leads.

Scrape certain websites

We can provide you with a scraper that collects specific data from specific websites. This is a great solution if, for example, you regularly want to obtain all product descriptions from a certain (set of) webshop(s).

Spider trap detector

The spider trap detector developed by us is also delivered separately on site. Our detector has been proven successful. We supply the spider trap detector for a fixed license fee on various platforms.

Data as a Service

Openindex is happy to do the crawling, parsing or scraping for you. In this case, we will automatically, at regular intervals or just once, provide the data you need. In a file, a feed or directly into your application.

Try the metadata extraction demo below:

Enter a URL and see which meta information is extracted directly by our parser.

To recognize specific entities such as names and locations in the body of the text, it is best to use our Entity Extraction demo.

Techniques

Data Extraction uses the following techniques:

Apache Nutch

Nutch is the foundation of our crawl solution. Openindex actively participates in the development of Nutch. In addition, the Nutch version of Openindex contains a number of improvements compared to the standard version.

SaX

The SaX parser is the basis of our information extraction. It is able to use specific extractors depending on the type of web page (e.g. forum, article). It also finds the language of the page and the corresponding date. Openindex has itself developed a much improved version of the standard SaX parser.

Part of Speech tagging (OpenNLP)

Nouns are recognized by means of Part-of-Speech tagging. Nouns are usually more interesting for data processing than adverbs / auxiliary verbs etc.

Host Deduplication

The crawler can recognize duplicate hosts and deal with them intelligently.

Apache Jena

An Open Source Java framework for building Semantic Web and Linked Data applications. It provides an API to extract and write data from RDF charts.

SparQL

SPARQL is an RDF search language used to query RDF-based data through queries. With this search language it is possible to request information for applications on the semantic web.

Pricing Crawling as a Service

Starter

Small

Large

Enterprise

Custom

25 / month
125 / month
500 / month
1500 / month
3000 / month
CALL / month

Documents

Nr of webpages to be crawled
10.000
100.000
1.000.000
10.000.000
100.000.000
CUSTOM

Start up fee

Getting you up and running
€ 100,-
€ 200,-
€ 300,-
€ 400,-
€ 500,-
CUSTOM

Spider trap detector

Don’t get caught in meaningless webs.
€ 450,-
€ 450,-
€ 450,-
€ 450,-
€ 450,-
CUSTOM

Advanced Parsing

Have your documents properly processed
€ 1.000,-
€ 1.000,-
€ 1.000,-
€ 1.000,-
€ 1.000,-
CUSTOM

Starter

25 / month

Documents

Nr of webpages to be crawled
10.000

Start up fee

Getting you up and running
€ 100,-

Spider trap detector

Don’t get caught in meaningless webs.
€ 450,-

Advanced Parsing

Have your documents properly processed
€ 1.000,-

Small

125 / month

Documents

Nr of webpages to be crawled
100.000

Start up fee

Getting you up and running
€ 200,-

Spider trap detector

Don’t get caught in meaningless webs.
€ 450,-

Advanced Parsing

Have your documents properly processed
€ 1.000,-

Large

1500 / month

Documents

Nr of webpages to be crawled
10.000.000

Start up fee

Getting you up and running
€ 400,-

Spider trap detector

Don’t get caught in meaningless webs.
€ 450,-

Advanced Parsing

Have your documents properly processed
€ 1.000,-

Enterprise

3000 / month

Documents

Nr of webpages to be crawled
100.000.000

Start up fee

Getting you up and running
€ 500,-

Spider trap detector

Don’t get caught in meaningless webs.
€ 450,-

Advanced Parsing

Have your documents properly processed
€ 1.000,-

Custom

CALL / month

Documents

Nr of webpages to be crawled
CUSTOM

Start up fee

Getting you up and running
CUSTOM

Spider trap detector

Don’t get caught in meaningless webs.
CUSTOM

Advanced Parsing

Have your documents properly processed
CUSTOM