What is web scraping?
Web scraping generically describes any of various means to extract specific information or content from a website over the HTTP protocol for the purpose of transforming that content into another format suitable for use in another context.
A typical example application for web scraping is a web crawler that copies content from one or more existing websites in order to generate a scraper site.
Web scraping differs from screen scraping in the sense that a website is really not a visual screen, but a live HTML and / or JavaScript-based script, with a graphics interface in front of it. Therefore, web scraping does not involve working at the visual interface as screen scraping, but rather working on the underlying object structure (Document Object Model) of the HTML and JavaScript.
Web scraping also differs from screen scraping in that screen scraping typically occurs many times from the same dynamic screen “page”, whereas web scraping occurs only once per web page over many different static web pages. Recursive web scraping, by following links to other pages over many web sites, is called “web harvesting”. Google is the largest web harvester in the world, with yahoo a close second.
