Web Scrapers Generator BrowserExt

BrowserExt - php extension for web scraping


Posted 2013-06-16

BrowserExt php extension is a programmatic web browser, based on QtWebKit and intended for web scraping.

  • Supports javascript and AJAX.
  • Uses xpath for selecting elements
  • Allows you to fill forms, click on the elements of the document
  • Allows to retrieve attributes, properties and other parameters of the elements of the document, iterate through the elements in the tree
  • Allows to download files by links
  • Allows to scroll the page vertically
  • Supports a list of proxy servers, checking proxies in few threads

https://github.com/scraperlab/browserext

A short example:

Usage

First you need to create a browser class:

Next lets loading the page:

Each page is loaded in a new tab. To load in the same tab, you must pass a second parameter to true. To go to the previous page you call the `back()`. You can click on a link or button, passing its xpath.

The page will be loaded in a new tab, to load in same tab, you must pass a second parameter to true. You can select elements by xpath:

This method returns an array of objects of class PhpWebElement. For each element you can retrieve attributes, properties, tag name, element value and others:

You can go to the parent or to the child elements, they will also be an objects of PhpWebElement class:

In the above we iterate through all parents and stores its tags in an array. For element can be performed a relative xpath:

This code loops through all the elements with a class item and displays the text of the first and second links. You can retrieve the xpath of the element or click on it:

The browser can use a list of proxy servers for loading pages. Each new page is loaded with new proxy:

In the above given an array of two proxies and pass to the browser. The second parameter specifies the need to check the proxy. Next command returns a list of remaining proxies after checking.