BrowserExt - php extension for web scraping
BrowserExt php extension is a programmatic web browser, based on QtWebKit and intended for web scraping.
- Uses xpath for selecting elements
- Allows you to fill forms, click on the elements of the document
- Allows to retrieve attributes, properties and other parameters of the elements of the document, iterate through the elements in the tree
- Allows to download files by links
- Allows to scroll the page vertically
- Supports a list of proxy servers, checking proxies in few threads
A short example:
First you need to create a browser class:
Next lets loading the page:
Each page is loaded in a new tab. To load in the same tab, you must pass a second parameter to true. To go to the previous page you call the `back()`. You can click on a link or button, passing its xpath.
The page will be loaded in a new tab, to load in same tab, you must pass a second parameter to true. You can select elements by xpath:
This method returns an array of objects of class PhpWebElement. For each element you can retrieve attributes, properties, tag name, element value and others:
You can go to the parent or to the child elements, they will also be an objects of PhpWebElement class:
In the above we iterate through all parents and stores its tags in an array. For element can be performed a relative xpath:
This code loops through all the elements with a class item and displays the text of the first and second links. You can retrieve the xpath of the element or click on it:
The browser can use a list of proxy servers for loading pages. Each new page is loaded with new proxy:
In the above given an array of two proxies and pass to the browser. The second parameter specifies the need to check the proxy. Next command returns a list of remaining proxies after checking.