Web scraping

The extraction of data from web pages is known as "web scraping". Data Splitter uses HTTP, the Hypertext Transfer Protocol, to fetch the web pages in the Input URLs list. It then uses the "rules" specified in the solution to parse the fetched web pages, one-by-one.

Most web pages are created in a format known as HTML, HyperText Markup Language. The Data Splitter sample web scraping solutions are HTML parsers. Developing or modifying a web scraping solution requires at least a basic knowledge of HTML syntax, tags, etc.

Breakage

A scraper may stop working, or "break", if a web page's layout changes. For example, if a table's column order is changed, or if text used to locate information on the page is changed. When designing a scraper it's best to make as few assumptions as possible regarding the source page's organization.

Data Splitter's features can be used to minimize breakage :

string sets can provide lists of alternative text locators
checking the "ignore case" box on search strings (string sets, variables, node strings, etc.)

It's possible to create scraping solutions that are flexible enough to work on different web pages, even pages on different sites. Contact Data Splitter support for more information.

More info

Wikipedia has a discussion of web scraping, a good starting point for understanding the technical and legal issues.