Gain insight with Data Extraction
Collect data with our Apache Nutch Committer software and gain the insights you need.
Reliable and accurate insights:
Flexibility for all users:
Make Data Extraction easy with our applications
Collect information from across the web with Data Extraction. Get to know various aspects of data collection, such as crawling, scraping, and parsing. Learn more about the applications we offer.
- Advanced Parser
- Entity Extraction
- Collect specific data from the web
- Scrape specific websites
- Avoid duplicate links: detect spider traps
- Data as a service
A webcrawler (also known as a spider) roams the internet looking for new pages. The goal of a webcrawler is to index pages for search engines. We help you set up the webcrawler so you don’t have to worry about it.
Our parser retrieves all kinds of data from the internet. It detects languages, main texts, images and product prices. It also distinguishes an article from a homepage and a forum thread from a webshop product and so on. This allows you to search for specific information.
Entity extraction determines relevant parts in a text. Identify names, people, companies, organizations, locations, cities, and products in a text. Curious how this works? Try the demo on this webpage!
Our crawler is able to find particular information on the internet. For instance, it can provide you with a list of domains that use a specific CMS, contain certain words or content. This makes doing research and finding sales opportunities easy.
Use our scraper to gather specific data from certain websites. This is useful if you want to analyze product descriptions from online stores.
Our spider trap detector detects and bypasses spider traps. This prevents indexing irrelevant and duplicate pages. We offer the spider trap detector for a fixed license fee across various platforms.
To make things easy for you, we offer Data as a Service, where we handle the crawling, parsing, and scraping for you. With Data as a Service, you automatically receive the data you need, either periodically, or as a one-time delivery. We provide it as a file, a feed, or directly into your application.
Try our demo
Curious about our data extraction? Enter a URL and see which meta-information is directly extracted by our parser.
Data Extraction: the techniques
In Data Extraction, we use the following techniques:
- Apache Nutch
- Part of Speech tagging (OpenNLP)
- Host Deduplication
- Apache Jena