Web Scrapers Generator BrowserExt

Creating a Web Scraper for Movies from IMDb Top 250 (video)

Consider the example of creating a web scraper for a list of movies of imdb top 250, which is on page http://www.imdb.com/chart/top. On the page we will see the table with links to pages with movies. First collect these links and in turn will load each of them. On the page with film will collect the name, year, position in the top, actors and other characteristics. The collected data will be saved in CSV format.

To start, create a web scraper called "imdb top 250" and the start page http://www.imdb.com/chart/top. After creating of the web scraper we get to the editor in the browser mode, which will load our start page. Now we can create web scraping rules.

We need to extract all links to the films from the table in Top. To do this, press the button " Link" on the toolbar at the top and click on the link in the first line of the table. Link should be highlighted with a red dashed line. In the resulting window, enter a name for the rule, such as "link" and press Enter. The first rule is created.

Similarly, create the second rule named "link2" to refer to the second row of the table.

созданные правила

Next, tick both rules and click "Merge". Should selected all the links to the pages with the movies. We can see what data is extracted by our rule by clicking "Extracted data". Also, do not forget to save.

Similarly, we can extract data from other columns. But we will do differently and we will load the collected links. For this purpose, you must click "Load" in the "Actions" menu of the rule in the table at the bottom of the page. This loads the page on which the first link in the list indicates, ie page with the film.

нажимаем загрузить

On the page with the film let's create rules of the type "Text" for the title, year, rating and description of the film. It will be the single rule. To select, for example, all the actors, you must first select one actor, then the second, to merge these two rules and get the right to change the group to "group2" or "actors" ("Actions" -> "Grouping"). Then all the actors will be selected. In the "Extracted Data" you can see that there will be two tabs, one for each group. If the group was the one that only one line would be scraped, because the rules for title, year and others have one element. Similarly, you can select genres in a separate group.

Also, we can select the position of the film in the top, ie, text "Top 250 #". So that leave only the number, you can select the third word. For this purpose, in the rule go into "Actions" -> "Options", tick the "Select the word" and enter into the fields 3 (highlight from the third to the third). In the "Extracted data" will be seen that the only the number is extracted.

собранные данные

We're going to save the extracted data to CSV, ie all data about the film should make in one line separated by commas. However, the rule for the actors has more items. That to join scraped data in one line, go to "Actions" -> "Options" in the rule and check the "Concatenate strings into one", and set comma-space in the input box, ie the string to be inserted between the elements. Thus, the extracted data of all rules will consist of a single element.

Now click the "Export rule", then "Add" and in the window that appears select csv profile, the filename set to "imdb_top_250". Now you need to add all stored rules from the page to the parameters section. Click "Add" and select the rule. then you can change the order of the stored data. Thus is established the stored rule for all pages with films, because for each page of the movie will be executed the same rules.

правило экспорта для страницы

By default, the data in csv separated by commas, you can specify a different delimiter in the export profile settings (tab "Export").

You can also view a different page with the same rules. To do this, go back to the page with the list of movies by clicking or by going to the "Loaded pages" and selecting the desired page. Then, in the top menu select " Load link" and click on any of the selected links.

The web scraper is ready. Now it is possible to generate the program. To do this, go to the generation page by clicking the "Generate" in the Editor on the right, select "PHP web scraper" and click on the button "Generate".