Web Scrapers Generator BrowserExt

PHP Web Scraper

Simple web scraper in php. Depending on commands, used in the script, generated web scraper could implement following operations:

  • support for both single-threaded and multi-threaded page loading, the ability to set the delay before loading
  • cookies support, logging in to the web site
  • list of proxy servers support
  • data extracting by means of both xpath and regular expressions
  • images and files saving by link
  • ability to fill out the forms and send the data to the server
  • collected data export to CSV, XML, SQL or another text format, export to Excel is also available
  • data filtration commands before the export is supported by the script
  • work with mysql database

System requirements: php5, standard extensions curl and tidy.

Generated web scraper consists of the one file named scraper.php. After opening this file in the text editor, after comments with web scraper description and license, you could find changable settings.

If Excel export is being used in the program, it appears in zip file where needed libraries are placed. Settings remain the same.

Settings description is given in the table:

Setting The Default Value Description
$storefile_folder 'files'

Set the catalog where all saved by storefile images and files will be saved.

Web scraper writes images at the same catalog all the time by default. To generate new catalog every time web scraper's launched you can point:

$storefile_folder = date('Y-m-d H-i-s');

The name of the current catalog will be defined by current time.

$proxy_filename 'proxy.txt' Defines the file with the list of proxy-servers.
$log_filename 'scraper.log' Defines the file where web scraper's log will be written.
$use_cookies true Defines if the curl would use and process cookies. Cookies would be saved and loaded from cookie.txt file, generated by the web scraper.
$console_out true Defines it there will be console output when web scraper is working. If false, there will be an output to the log file only.
$filesystem_encoding 'UTF-8' Defines filesystem encoding. It is necessary for the web scraper to generate names of the files correctly with national letters. If web scraper is launched under Windows, parameter needs to be set to 'windows-1251'.
$loadpage_delay 0 Specifies the delay, which is used for single-loading pages and files.
$number_of_connections 5 Specifies the number of connections in multi-threaded (parallel) loading pages and files.