- Web Scraping Using Selenium C#
- Web Scraping Using Selenium Java
- Web Scraping Using Selenium In Python
- Web Scraping Using Selenium Python
- Selenium Web Scraping Tutorial
- Web Scraping Using Selenium Python
This selenium tutorial is designed for beginners to learn how to use the python selenium module to perform web scraping, web testing and create website bots. 'select CountryName from CountryList where Region = 'EU' But this assumes you have a country list hanging around. Another way is to go to a website that has a list of Countries, navigate to the page with a list of European Countries, and get the list from there - and that's where web-scraping comes in. Web-scraping is the process of writing code that combines HTTP calls with HTML parsing, to.
In this post I’ll talk about the RSelenium
package as a tool to navigate websites and how it can be combined with the rvest
package to scrape dynamic web pages. To understand this post, you’ll need basic knowledge of rvest
, HTML and CSS. You can download the full R script HERE!
Observation: Even if you are not familiar with them, I explained as much as possible everything I did. For that reason, those who know about this stuff might find some parts of the post redundant. Feel free to read what you need and skip what you aldeady know!
Let’s compare the following websites:
On IMDb, if you search for a particular movie (for example, this one), you can see that the URL changes, and that URL is different from any other movie (for example, this one). The same behavior is shown if you search for different actors.
On the other hand, if you go to Premier League Player Stats, you will notice that modifying the filters or clicking the pagination button to access more data doesn’t produce changes on the URL.
As I understand it, the first website is an example of a static web page, while the second one is an example of a dynamic webpage.
The following definitions where taken from https://www.pcmag.com/.
Static Web Page: A Web page (HTML page) that contains the same information for all users. Although it may be periodically updated from time to time, it does not change with each user retrieval.
Dynamic Web Page: A Web page that provides custom content for the user based on the results of a search or some other request. Also known as “dynamic HTML” or “dynamic content”, the “dynamic” term is used when referring to interactive Web pages created for each user.
rvest
is a great tool to scrape data from static web pages (check out Creating a Movies Dataset to see an example!).
But when it comes to dynamic web pages, rvest
alone can’t get the job done. This is when RSelenium
joins the party…
Java
You need to have Java installed. You can use Windows’ Command Prompt to check this. Just type java -version and press Enter. You should see something that looks like this:
If it throws an error, it might mean that you don’t have Java installed. You can download it from HERE.
R Packages
The following packages need to be installed and loaded in order to run the code written in this post.
Starting a Selenium server and browser is pretty straightforward using rsDriver()
.
However, when you run the code above it may produce the following error:
This error is addressed in this StackOverflow post. Basically, it means that there is a mismatch between the ChromeDriver and the Chrome Browser versions. As mentioned in the post, each version of ChromeDriver supports Chrome with matching major, minor, and build version numbers. For example, ChromeDriver 73.0.3683.20 supports all Chrome versions that start with 73.0.3683.
The parameter chromever
defined this way always uses the latest compatible ChromeDriver version (the code was edited from this StackOverflow post).
After you run rD <- RSelenium::rsDriver(..)
, if everything worked correctly, a new chrome window will open. This window should look like this:
You can find more information about rsDriver()
in the Basics Vignette.
In this section I’ll apply different methods to the remDr
object created above. I’m only going to describe the methods that I think will be used most frequently. For a complete reference, check the package documentation.
navigate(url)
: Navigate to a given url.
goBack()
: Equivalent to hitting the back button on the browser.goForward()
: Equivalent to hitting the forward button on the browser.
refresh()
: Reload the current page.
getCurrentUrl()
: Retrieve the url of the current page.
maxWindowSize()
: Set the size of the browser window to maximum. By default, the browser window size is small, and some elements of the website you navigate to might not be available right away (I’ll talk more about this in the next section).
getPageSource()[[1]]
Get the current page source. This method combined withrvest
is what makes possible to scrape dynamic web pages. The xml document returned by the method can then be read usingrvest::read_html()
. This method returns alist
object, that’s the reason behind[[1]]
.
open(silent = FALSE)
: Send a request to the remote server to instantiate the browser. I use this method when the browser closes for some reason (for example, inactivity). If you have already started the Selenium server, you should run this instead ofrD <- RSelenium::rsDriver(..)
to re-open the browser.
Web Scraping Using Selenium C#
close()
: Close the current session.
Web Scraping Using Selenium Java
Working with Elements
findElement(using, value)
. Search for an element on the page, starting from the document root. The located element will be returned as an object of webElement class. To use this function you need some basic knowledge of HTML and CSS (or xpath, etc). This chrome extension, called SelectorGadget, might help.highlightElement()
: Utility function to highlight current Element. Amule download mac catalina. This helps to check that you selected the wanted element.sendKeysToElement()
: Send a sequence of key strokes to an element. The key strokes are sent as a list. Plain text is enter as an unnamed element of the list. Keyboard entries are defined in ‘selKeys‘ and should be listed with name ‘key‘.clearElement()
: Clear a TEXTAREA or text INPUT element’s value.clickElement()
: Click the element. You can click links, check boxes, dropdown lists, etc.
Other Methods
Even though I have never used them, I believe this methods are worth mentioning. For more information, check the package documentation.
In this example, I’ll scrape data from Premier League Player Stats. This is what the website looks like:
You will notice that when you modify the Filters, the URL does not change. So you can’t use rvest
alone to dynamically scrape this website. Also, if you scroll down to the end of the table you’ll see that there are pagination buttons. If you click them, you get more data, but again, the URL does not change. Here you can see how those pagination buttons look like:
Observation: Even though choosing a different stat does change the URL, I’ll work as if it didn’t.
Target Dataset
Braid game download mac. The dataset I want will have the following variables:
- Player: Indicates the player name.
- Nationality: Indicates the nationality of the player.
- Season: Indicates the season the stats corresponds to.
- Club: Indicates the club the player belonged to in the season.
- Position: Indicates the player position in the season.
- Stats: One column for each Stat.
For simplicity, I’ll scrape data from seasons 2017/18 and 2018/19, and only from the Goals, Assists, Minutes Played, Passes, Shots and Fouls stats. This means that our dataset will have a total of 11 columns.
Before we start…
In order to run the code below, you have to start a Selenium server and browser, and create the remDr
object. This procedure was described in the Start Selenium section.
First Steps
The code chunk below navigates to the website, increases the windows size to find elements that might be hidden (for example, when the window is small I can’t see the Filters) and then clicks the “Accept Cookies” button.
You might notice two things:
The use of the
Sys.sleep()
function. Here, this function is used to give the website enough time to load. Sometimes, if the element you want to find isn’t loaded when you search for it, it will produce an error.The use of CSS selectors. To select an element using CSS you can press F12 an inspect the page source (right clicking the element and selecting Inspect will show you which part of that code refers to the element) and/or use this chrome extension, called SelectorGadget. I recommend learning a little about HTML and CSS and use this two approaches simultaneosly. SelectorGadget helps, but sometimes you will need to inspect the source to get exactly what you want. In the next subsection I’ll show how I selected certain elements by inspecting the page source.
Web Scraping Using Selenium In Python
Getting Values to Iterate Over
I know that in order to get the data, I’ll have to iterate over different lists of values. In particular, I need a list of stats, seasons, and player positions.
We can use rvest
to scrape the website and get these lists. To do so, we need to find the corresponding nodes. As an example, after the code I’ll show where I searched for the required information in the page source for the stats and seasons lists.
The code below uses rvest
to create the lists we’ll use in the loops.
Observation: Even though in the source we don’t see that each word has its first letteruppercased, when we check the dropdown list we see exactly that (for example, we have “Clean Sheets” instead of “Clean sheets”). I was getting an error when trying to scrape these type of stats, and making them look like the dropdown list solved the issue. That’s the reason behind str_to_title()
.
Stats
This is my view when I open the stats dropdown list and right click and inspect the Clean Sheets stat.
Taking a closer look to the source where that element is present we get:
Seasons
This is my view when I open the seasons dropdown list and right click and inspect the 2016/17 season.
Taking a closer look to the source where that element is present we get:
As you can see, we have an attribute named data-dropdown-list
whose value is FOOTBALL_COMPSEASON
and inside we have li
tags where the attribute data-option-name
changes for each season. This will be useful when defining how to iterate using RSelenium
.
Positions
The logic behind getting the CSS for the positions is similar to the one described above, so I won’t be showing it.
Webscraping Loop
The code has comments on each step, so you can check it out! But before that, I’ll give an overview of the loop.
Preallocate stats vector. This list will have a length equal to the number of stats to be scraped.
For each stat:
- Click the stat dropdown list
- Click the corresponding stat
- Preallocate seasons vector. This list will have a length equal to the number of seasons to be scraped.
- For each season inside stat:
- Click the seasons dropdown list
- Click the corresponding season
- Preallocate positions vector. This list will have
length = 4
(positions are fixed: GOALKEEPER, DEFENDER, MIDFIELDER and FORWARD). - For each position inside season inside stat
- Click the position dropdown list
- Click the corresponding position
- Check that there is a table with data (if not, go to next position)
- Scrape the first table
- While “Next Page” button exists
- Click “Next Page” button
- Scrape new table
- Append new table to table
- Change stat colname and add position data
- Go to the top of the website
- Rowbind each position table
- Add season data
- Rowbind each season table
- Assign the table to the corresponding stat element.
The result of this loop is a populated list
with a number of elements equal to the number of stats scraped. Each of this elements is a tibble
.
This may take some time to run, so you can choose less stats to try it out.
As I mentioned, you can check the code!
Observation: Be careful when you add more stats to the loop. For example, Clean Sheets has the Position filter hidden, so the code should be modified (for example, by adding some “if” statement).
Data Wrangling
Finally, some data wrangling is needed to create our dataset. data_topStats
is a list
with 6 elements, each one of those elements is a tibble
. The next code chunk removes the Rank
column from each tibble
, reorders the columns and then makes a full join by all the non-stat variables using reduce()
(the reason behind this full join is that not all players have all stats). In the last line of code I replace NA
values with zero in the stats variables.
This is how the data looks like.
Season | Position | Club | Player | Nationality | Goals | Assists | Minutes Played | Passes | Shots | Fouls |
---|---|---|---|---|---|---|---|---|---|---|
2018/19 | DEFENDER | Brighton and Hove Albion | Shane Duffy | Ireland | 5 | 1 | 3088 | 1305 | 37 | 22 |
2018/19 | DEFENDER | AFC Bournemouth | Nathan Aké | Netherlands | 4 | 0 | 3412 | 1696 | 25 | 28 |
2018/19 | DEFENDER | Cardiff City | Sol Bamba | Cote D’Ivoire | 4 | 1 | 2475 | 550 | 22 | 35 |
2018/19 | DEFENDER | Wolverhampton Wanderers | Willy Boly | France | 4 | 0 | 3168 | 1715 | 24 | 29 |
2018/19 | DEFENDER | Everton | Lucas Digne | France | 4 | 4 | 2966 | 1457 | 34 | 39 |
2018/19 | DEFENDER | Wolverhampton Wanderers | Matt Doherty | Ireland | 4 | 5 | 3147 | 1399 | 46 | 30 |
The framework described here is an approach to working in parallel
with RSelenium
.
First, we load the libraries we need.
The function defined below stops Selenium on each core.
We determine the number of cores we’ll use. In this example, I use four cores.
We have to list the ports that are going to be used to start Selenium.
We use clusterApply()
to start Selenium on each core. Pay attention to the use of the Superassignment operator. When you run this function, you will see that four chrome windows are opened.
This is an example of pages that we will open in parallel. This list will change depending on the particular scenario.
Use parLapply()
to work in parallel. When you run this, you will see that each browser opens one website, and one is still blank. This is a simple example, I haven’t defined any scraping, but of course you can!
when you are done, stop Selenium on each core and stop the cluster.
Observation: Sometimes, when working in parallel some of the browsers close for no apparent reason (or at least a reason that I don’t understand).
Workaround browser closing for no reason
Consider the following scenario: your loop navigates to a certain website, clicks some elements and then gets the page source to scrape using rvest
. If in the middle of that loop the browser closes, you will get an error (for example, it won’t navigate to the website, or the element won’t be found). You can work around these errors using tryCatch()
, but when you skip the iteration where the error occurred, when you try to navigate to the website in the following iteration, an error would occur again (because there is no browser open!).
You could, for example, use remDr$open()
in the beggining of the loop, and remDr$close()
in the end, but I think that will open and close many browsers and make the process slower.
So I created this function that handles part of the problem (even though the iteration where the browser closed will not finish, the next one will and the process won’t stop).
It basically tries to get the current URL using remDr$getCurrentUrl()
. If no browser is open, this will throw an error, and if we get an error, it will open a browser.
Closing Selenium
Sometimes, even if the browser window is closed, when you re-run rD <- RSelenium::rsDriver(..)
you might encounter an error like:
This means that the connection was not completely closed. You can execute the lines of code below to stop Selenium.
You can check this. StackOverflow post for more information.
Web Scraping Using Selenium Python
Wrapper Functions
You can create functions in order to type less. Suppose that you navigate to a certain website where you have to click one link that sends you to a site with different tabs. You can use something like this:
Observation: this function is theoretical, it won’t work if you run it.
I won’t show it here, but you can create functions to find elements, check if an element exists on the DOM (Document Object Model), try to click an element if it exists, parse the data table you are interested in, etc. You can check this StackOverflow for examples.
Selenium Web Scraping Tutorial
The following list contains different videos, posts and StackOverflow posts that I found useful when learning and working with RSelenium.
Web Scraping Using Selenium Python
The ultimate online collection toolbox: Combining RSelenium and Rvest ( Part I and Part II ). If you know about
rvest
and just want to learn aboutRSelenium
, I’d recommend watching Part II. It gives an overview of what you can do when combiningRSelenium
andrvest
. It has nice an practical examples. As a final comment regarding these videos, I wouldn’t pay too much attention to setting up Docker because at least I didn’t need to work that way in order to getRSelenium
going. In fact, at least now, getting it going is pretty straightforward.RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium. I found this post really useful when trying to set up
RSelenium
. The solution given in this StackOverflow post, which is mentioned in the article, seems to be enough.Dungeons and Dragons Web Scraping with rvest and RSelenium. This is a great post! It starts with a general tutorial for scraping with
rvest
and then dives intoRSelenium
. If you are not familiar withrvest
, you can start here.RSelenium Tutorial. This post might be helpful too.
RSelenium Package Website. It has more advanced and detailed content. I just took a look to the Basics Vignette.
These StackOverflow posts helped me when working with dropdown lists:
RSelenium: server signals port is already in use. This post gives a solution to the “port already in use” problem. Even though is not marked as best, the last line of code of the second answer is useful.
Data Scraping in R. Thanks to this post I found the Premier League Stats website, which was exactly what I was looking for to write a post about
RSelenium
. Also, I took some hints from the answer marked as best.CSS Tutorials: