Simple web scraper in php. Depending on commands, used in the script, generated web scraper could implement following operations:
System requirements: php5, standard extensions curl and tidy.
Generated web scraper consists of the one file named scraper.php. After opening this file in the text editor, after comments with web scraper description and license, you could find changable settings.
If Excel export is being used in the program, it appears in zip file where needed libraries are placed. Settings remain the same.
Settings description is given in the table:
|Setting||The Default Value||Description|
Set the catalog where all saved by storefile images and files will be saved.
Web scraper writes images at the same catalog all the time by default. To generate new catalog every time web scraper's launched you can point:
$storefile_folder = date('Y-m-d H-i-s');
The name of the current catalog will be defined by current time.
|$proxy_filename||'proxy.txt'||Defines the file with the list of proxy-servers.|
|$log_filename||'scraper.log'||Defines the file where web scraper's log will be written.|
|$use_cookies||true||Defines if the curl would use and process cookies. Cookies would be saved and loaded from cookie.txt file, generated by the web scraper.|
|$console_out||true||Defines it there will be console output when web scraper is working. If false, there will be an output to the log file only.|
|$filesystem_encoding||'UTF-8'||Defines filesystem encoding. It is necessary for the web scraper to generate names of the files correctly with national letters. If web scraper is launched under Windows, parameter needs to be set to 'windows-1251'.|
|$loadpage_delay||0||Specifies the delay, which is used for single-loading pages and files.|
|$number_of_connections||5||Specifies the number of connections in multi-threaded (parallel) loading pages and files.|