Web Data extraction
The web provides us with a wide variety of completely unstructured and chaotic HTML pages. Our HTML data extraction software deals with this problem in a generic way.
This tool displays information we extract from given web pages. It uses the HTML parsing, extracting and classification technology we use for our Sitesearch service, and it used by some of our customers. Try it and if you think we got it wrong, contact support so we can fix it.
Some examples are:
- travel article on The Guardian
- CPU related forum thread
- a webshop selling shoes
- wiki article on David Attenborough
The fields displayed can be detected language, extracted article date, extracted main text and an optional image.
It can also extract currency and prices for webshop products, but also international phone numbers, e-mail addresses and acronyms embedded in the text. And even determine whether a given HTML page is either an article or a wiki, a homepage, a forum thread or a webshop product and more.