Web Scrapers Generator BrowserExt

Browser Mode

You can go to browser mode by clicking "Browser" tab at the right side of the Editor.

browser-en-small.jpg

In order to load the page you need to type the address in the address field and press "Enter" button. For loaded page scraping rules could be set up by using " Text", " Image", " Link", " HTML", "RegExp" buttons. Use these buttons you may select the relevant elements of the page. By default clicking the link in the loaded page does not leads to loading the link. To go to the link you need click "Load Link" button and click on the link.

Attention! Some web-sites does not allow using frame. Such web-sites could not be loaded in "built-in browser", they could be processed only by script usage. The example is yandex.ru web-site and it's subpages.

Lets take detailed look at the buttons:

  • The " Text" button creates Text rule and serves for page text elements parsing, or, for example, for link anchor. Similarly to the function gettext in the script.
  • The " Image" button creates Image rule and serves for getting the link for the image or image itself. Similarly to the function getimglink in the script.
  • The " Select Link" button creates Link rule and serves for getting link url. Similarly to function getlink in the script.
  • The " HTML" button creates a rule of type HTML and serves to get html-code of an element of the document. It may also be using to achieve elements of the document regarding this html-code as you can load this html-code by clicking the "Load" and create rules on a separate page. Similarly to function gethtml in the script.
  • The "RegExp" button creates a rule of type regular expression and is used to retrieve the text by the regular expression. Similarly to function getregexp in the script.

If any of above-listed buttons is activated and click on any element of loaded page was made, the window with the space for typing is shown and the rule name appears. Selected element of the page is highlightened with red dashed line, corresponding rule with a name tag, type and xpath rule appears below the page in the table. If you point the mouse cursor over the rule in the table, corresponding element is outlined with green dashed line in the browser.

For RegExp rule there is a special window with html-code of the page, fields to enter a name of the rule, a regular expression, and a group number, which should be returned.

There are several buttons above the table in the bottom part of the page.

  • To merge rules you need to tick two rules and press "Merge". Merging procedure is comparing two xpath rules and forms general way. It is necessary for generating one rule for all elements in the list. First thing is to create the rule for the first element from the list, then for second, then merge both of this rules, as a result all list's elements would be selected.
  • Use "Delete" button you can delete selected rules.
  • The drop-down box "Group Filter" is used for filtering rules for a particular group.
  • Use "Export rule" to show window for control export rules for the page.
  • Use "Extracted data" you can open a window showing the data extracted from the page, divided into groups.

Every rule in the table has "Actions" button which calls for pulldown with available options for rule given.

  • "Edit XPath" action allows to change xpath for this rule, another objects in the browser could be selected at the same time, but rule type won't be changed.
  • "Parameters" action calls for the window with settings for that rule. Every rule has its own number of settings. They would be described further.
  • "Load" action is available from the Link or Form rule. If you choose this action, another page with pointed address or form will be loaded. You could set up the rules in new opened page too. You can go back by pressing "Back" button above the page near the page address.
  • "Grouping" action calls the window where group for the rules could be changed. All rules are going to group1 by default.
  • "Filter" action brings up a window where you can set filtering data for the rule. For description of the types of filtering, refer to the function filter_value. Results of filtering can be viewed in the "Extracted data" window, rules with filter have names with asterisk.

Rule Parameters

For every rule type there is number of parameters. Lets take a look to the whole list of parameters:

  • Just text node. Points on the fact, that rule would return text nodes which belong directly to askable node. Otherwise the text containing nodes from all the included elements is returned.
  • Next element. Points on the fact, that rule would select an element which is next from the element, pointed by xpath. Type is pointing on the new rule type. For example, you can select the text at the first place and then point on the next element and Link type.
  • Select the word. Points on the fact, that particular words should be taken from the text. From k to n. Counting starts from 1.
  • Concatenate strings into one. Concatenate all the elements of the rule into one string using the delimiter specified in the input field.
  • Replace characters. The first field is a regular expression that specifies the characters to search, in the second - the string to be replaced.
  • Save the file specified with link. Points on the fact, that file, pointed by the link will be saved. Similarly to storefile.
  • Add to download queue. Points on the fact that link will be added in the download queue. Similarly to continue.
  • Save the image file. Points on the fact, that image, pointed by the link will be saved. Similarly to storefile.
  • Returns only the contents (innerHTML). Specifies that returns a html-code with no top tag, only its contents.

Export Rule For The Page

Above the table of rules, which is located at the bottom of the page there is a button "Export rule". When you press this button, a window for setting the export rules of the page is showed. In the window you can see a table with the export rules. Each rule corresponds to a function store in the script. You can add a new rule, edit, delete, and move up and down already existing rules.

browser-exp-en.jpg

Each export rule is configured based on the export profile (Export tab in the "Editor"), the values ​​of variables or scraping rules are appointed to the parameters. And also set more general parameters, such as the name of the file to save and name of the variable that will be store the result.

When you click "Add" or "Edit", a new window with the parameters of export rules is showed. Export profile name is selected in the drop-down box "Select the profile". Depending on the selected profile in the parameters table will be a list of export profile parameters. For profiles of csv type you must add parameters and move them to set the required order, for other profiles list of the parameters are to be formed, it is only necessary to set the parameter value by selecting a scraping rule or variable from the list. You can also define a constant (string must be enclosed in quotation marks). You must also specify the filename to save without the extension (for profiles of RDB type this is not required because there is a store in the database and not to the file.) And you can set a variable to which the result will be stored. If data must be added to the array, it is necessary to specify the adding to the end of the array operator, and the directive @global to make the global variable. For instance, @global prod[].

Also for the profile of RDB type, if you click on the name of the parameter, there is an asterisk, which means that this parameter will be updated if the record is found in the database. To remove the asterisk enough to click on the name of the parameter once again.

browser-exp2-en.jpg

Forms

If you filled the form, press "Enter" at the loaded page, Form type rule with xpath address will be generated automatically, and the page, called by this form, will be loaded. You can go "Back" and reload the form again with the same setting by clicking "Load" in the actions menu.