Categories
Java Spring Boot Web Scraping for Beginners

Web Scraping Dynamic Websites with Java and Selenium

https://youtu.be/PF0iyeDmu9E

Get the starter code for this tutorial:


Categories
Python Web Scraping for Beginners

Webscraping Stock Data with Selenium and Python

https://youtu.be/8aedS8sQ9lg

run pip install selenium

Install the chromedriver – –https://chromedriver.chromium.org/

In case you’re not on a mac https://zwbetz.com/download-chromedriver-binary-and-add-to-your-path-for-automated-functional-testing/

Make sure that you choose the version that matches the version of Chrome that you currently have installed. This will most likely be the stable release.

Then run the following:

cd Downloads Or to whatever directory you saved the unzipped chromedriver to.

mkdir ~/bin

mv chromedriver ~/bin

cd ~/bin

chmod +x chromedriver

vi ~/.zshrc (or vi ~/.bash_profile if you don’t have zsh installed)

Add this line: export PATH=”​HOME/bin”

source ~/.zshrc or source ~/.bash_profile

If you’re on Mac OS catalina, you will need to allow the chrome driver to execute by going to the security sections of your system preferences.

Get the xPath finder chrome extension here: https://chrome.google.com/webstore/detail/xpath-finder/ihnknokegkbpmofmafnkoadfjkhlogph/related?hl=en

Here is the code for this video:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import time

driver = webdriver.Chrome()
driver.get('https://www.londonstockexchange.com/prices-and-markets/stocks/stocks-and-prices.htm')

wait = WebDriverWait(driver, 10)

def search(ticker):
    searchEl = driver.find_element_by_id('head_solr_search_input')
    searchEl.send_keys(ticker)
    wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'ui-menu-item')))
    searchEl.send_keys(Keys.DOWN)
    searchEl.send_keys(Keys.DOWN)
    searchEl.send_keys(Keys.ENTER)


def get_stock_url(): 
    return driver.current_url

def get_last_price():  
    return driver.find_element_by_xpath("/html/body/div[4]/div/div[3]/div[2]/div[1]/div[5]/div[1]/table/tbody/tr[1]/td[2]").text 

search('UU') 
print(get_stock_url())
print(get_last_price())
time.sleep(3)

driver.quit()
Categories
Python Web Scraping for Beginners

Web scraping files from websites with Python

https://www.youtube.com/watch?v=0v1kp2JTZMA

In this video, you will learn how to scrape and download files from websites using python, BeautifulSoup and the requests module.

For someone who really likes podcasts, I often find myself searching for the entire archive of a podcast online. 

The problem is many of these podcasts have hundreds of files. 

Manually downloading these files would take forever!

Web scraping for beginners with python

Obviously, there has to be a better solution. 

Remember that scene from the Social Network, where Mark Zuckerberg downloads a whole load of files using with ‘a little wget magic’?

This one: 

I always thought being able to download files like that would be really useful. 

But instead of using the wget command, let’s use Python for fun. 

Let’s start by importing some modules we’ll need:

from bs4 import BeautifulSoup as bs
import requests

In this example, we’re going to be downloading the archive episodes from the Pure Humbug podcast here: http://purehumbug.com/shows/2006/1-99/

This is a really simple site to scrape. 

One of the things to notice is that the mp3 links on the website are relative. This means that they don’t include the full path of the url. For example, ‘/shows/2006/1–99/001_Get_This_030406_No_Guest_Host.mp3’. This means that for us to download the files, we are going to need to concatenate the domain name ‘http://purehumbug.com’ with each link. 

Let’s add some global variables to make things easy to change: 

DOMAIN = 'http://purehumbug.com'
URL = 'http://purehumbug.com/shows/2006/1-99/'
FILETYPE = '.mp3'

Now, let’s create a little function to give us a Beautiful Soup object which we can use to find all the mp3 links on this page.

def get_soup(url):
    return bs(requests.get(url).text, 'html.parser')

All this method is doing is constructing and returning a Beautiful Soup object, by passing two parameters. Firstly, the HTML of the page, which we are getting by calling .text the response object returned by the requests.get method. Secondly, we are specifying the parser we want to use. In this case, ‘html.parser’.

Once we’ve created that we can use it to find all the links on the web page and iterate through them. 

for link in get_soup(URL).find_all('a'):
    file_link = link.get('href')

Here we call the .find_all method on the soup object returned from the get_soup function we just created, specifying that we only want to find the ‘a’ elements. (That’s the (‘a’) part). On the second line we find the actual relative HTML link.

Next we should make sure that we are only processing files which are of the type we want to download. In this case, mp3 files. 

    if FILETYPE in file_link:
        print(file_link)

We’ll output the link here just so we can verify that the script is actually doing something. 

Next let’s download the file. 

with open(link.text, 'wb') as file:<br>
            response = requests.get(DOMAIN + file_link)<br>
            file.write(response.content)

All we are doing here is saying we want to give the file the name of whatever is in the text section of the a element, which in this case is the show name. 

‘wb’ here means write bytes. 

On the second line, we create another response object using the requests.get method. This time taking the link to the file and concatenating it with the domain name, because the file links in this case are relative as mentioned above. 

Then finally, we write the file to the file system. 

And we’re done!

If you have any questions or have any feedback (definitely welcome!)

Full code below:

from bs4 import BeautifulSoup as bs
import requests

DOMAIN = 'http://purehumbug.com'
URL = 'http://purehumbug.com/shows/2006/1-99/'
FILETYPE = '.mp3'

def get_soup(url):
    return bs(requests.get(url).text, 'html.parser')

for link in get_soup(URL).find_all('a'):
    file_link = link.get('href')
    if FILETYPE in file_link:
        print(file_link)
        with open(link.text, 'wb') as file:
            response = requests.get(DOMAIN + file_link)
            file.write(response.content)

As you can see from the code. We don’t need a lot of code to do this, but it can save so much time!

Categories
Python Web Scraping for Beginners

How to Login and Scrape Websites with Python

https://www.youtube.com/watch?v=SA18JCBtlXY
Watch and code along with the video

In this tutorial, we are going to cover web scrapping a website which has a login. So we are going to use the python module requests to login into the website and Beautiful Soup 4 to parse the HTML.

The first thing to do is inspect the website that we want to login to and scrape. The example website I will be using in this tutorial is https://shorttracker.co.uk/

Find the login form and then open the chrome developer tools.

Click on the network tab, and make sure you click ‘preserve log’. This, as you would expect, will preserve the network log, which is important because logging to a website is almost always redirected after the user is authenticated, meaning that the network log will be loss on the load of the new page, that is, unless we check the preserve log checkbox.

Check the Preserve log checkbox

Then login in to the website.

Once you’ve logged in, look in the requests in the network tab for something that looks like ‘login’. The actual name of this will vary based on the website you’re scraping.

The Login Route

The request method is going to be POST. The status code is either going to be a 200 status code or in the 300s. 302 just means that the website redirects the user after they login.

Make a note of the request URL, we are going to need that.

Inspect the headers.

You’re going to want to include the ‘User-agent’ headers in your code.

'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'

There may be other headers which you need to include in the request, but it’s difficult to tell you what they will be, because each website different. See the code at the bottom of this page to see what else I included in the headers for this particular website.

Scroll down to the bottom of the login route information. You will see something like ‘Form Data’. You’ll know it’s what you’re looking for because it will have your username and your password in there. Something like this:

The Form Data sent to authenticate the user.

Notice the csrfmiddlewaretoken. That’s used to prevent cross-site attacks. That’s something we are going to need, but we’re not going to copy and paste the actual token. We just need the key ‘csrfmiddlewaretoken’.

Make a note of all the keys in this form, (except ‘next’ we don’t need it for this tutorial). Make sure that you get the keys exactly right, if you don’t you won’t be able to login.

Let’s get down to some code!

First let’s import the modules we are going to use and define some constants: urls and headers

from bs4 import BeautifulSoup as bs
import requests

URL = 'https://shorttracker.co.uk/'
LOGIN_ROUTE = 'accounts/login/'

HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36', 'origin': URL, 'referer': URL + LOGIN_ROUTE}

Easy peasy.

Next, we are going to want to create a sessions object. This sessions object is going to hold all the information we need to persist the login when we visit different pages. Normally, when you browse a website that you’ve logged into via a web browser, you don’t have log in each time you visit a new page, this is because the browser retains your session data. This is what we are going to do with the sessions object. Every request we make will be made from this object.

s = requests.session()

In the example website used in this tutorial, we have to get hold of the csrf token, if we don’t, our requests are going to start getting blocked by the server.

So to do that, all we need to do is make a standard get request to the website and take the token we receive from the response to that request. Then we can use it in all subsequent responses.

Note you might not have to do this step, as not all websites implement the use of csrf tokens. You’ll know if you do, because you’ll see it in the form data when you login in the above network requests.

So a simple get request and then pull the csrftoken from the cookies of that request:

csrf_token = s.get(URL).cookies['csrftoken']

Wunderbar!

Now, we just need to construct our login payload that we are going to post to the login route of the website we are going to scrape. Again make sure that the keys you are using here match the keys you saw in the form data object exactly, otherwise you won’t be able to login.

login_payload = {
        'login': <yourloginhere>,
        'password': <yourpasswordhere>, 
        'csrfmiddlewaretoken': csrf_token
        }

Now we’re ready to login.

We’re just going to create a simple POST request which goes to the login route, includes the headers we defined earlier, and of course includes our login payload.

login_req = s.post(URL + LOGIN_ROUTE, headers=HEADERS, data=login_payload)

Now, we should be logged in. We can verify this by printing the status code of the request. If we have logged in, it should be 200.

print(login_req.status_code)

I’ve noticed that sometimes the requests session object doesn’t always retain the cookies that it should. So to make sure that we keep them let’s save them and include them in any subsequent requests.

cookies = login_req.cookies

Now you’re all logged and can scrape to your hearts content.

I’m just going to use Beautiful Soup to get a table from the watchlist page of the example website, just to show that this example works.

soup = bs(s.get(URL + 'watchlist').text, 'html.parser')
tbody = soup.find('table', id='companies')
print(tbody)

And… we’re done. At least for this tutorial. Hopefully, you have found this useful.

If you have any questions leave them below or on the video’s comments.

Here is the complete code for this tutorial.

from bs4 import BeautifulSoup as bs
import requests

URL = 'https://shorttracker.co.uk/'
LOGIN_ROUTE = 'accounts/login/'

HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36', 'origin': URL, 'referer': URL + LOGIN_ROUTE}

s = requests.session()

csrf_token = s.get(URL).cookies['csrftoken']

login_payload = {
        'login': <yourloginhere>,
        'password': <yourpasswordhere>, 
        'csrfmiddlewaretoken': csrf_token
        }

login_req = s.post(URL + LOGIN_ROUTE, headers=HEADERS, data=login_payload)

print(login_req.status_code)

cookies = login_req.cookies

soup = bs(s.get(URL + 'watchlist').text, 'html.parser')
tbody = soup.find('table', id='companies')
print(tbody)
Categories
Python Web Scraping for Beginners

Introduction to Scrapy | Web Scraping for Beginners

https://youtu.be/CoYTfF2KbFg

Here is the website we scraped: https://uk.advfn.com/stock-market/london/hsbc-HSBA/share-chat

Below is the code from this video.

import scrapy

hsbc_url = 'https://uk.advfn.com/stock-market/london/hsbc-HSBA/share-chat?page='

class StockSpider(scrapy.Spider):
    name = 'stocks'

    def start_requests(self):
        urls = [hsbc_url + str(i) for i in range(355, 0, -1)]

        for url in urls: 
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for chat, author in zip(response.css('a.msgStyle::text').getall(), response.css('table.TableElement').css('td:nth-of-type(3)')):
            yield {
                'chat_message': chat,
                'author': author.re(r'profile\/[^"]+')[0].split('profile/')[-1]
            }

Categories
Python Web Scraping for Beginners

Scraping Tabular Data from Websites

Website scraped in this example: https://www.sunrise-and-sunset.com/en/sun/united-kingdom/London

Code from this video.

import requests
from bs4 import BeautifulSoup as bs
from datetime import date

URL = 'https://www.sunrise-and-sunset.com/en/sun/united-kingdom/London'

TODAY = str(date.today())

soup = bs(requests.get(URL).text, 'html.parser')

table = soup.find('table', attrs={'class', 'table table-striped table-hover well'})
rows = table.find('tbody').find_all('tr')

for row in rows[:-1]:
    date_cell = row.find_all('td')[0]
    sunrise = row.find_all('td')[1]
    sunset = row.find_all('td')[2]
    day_length = row.find_all('td')[3]

    datetime = date_cell.text
    date_cell = date_cell.find('time')['datetime']

    if date_cell == TODAY:
        sunrise = sunrise.find('time')['datetime']
        sunset = sunset.find('time')['datetime']
        day_length = day_length.find('time').text

        print(datetime)
        print(date_cell)
        print(sunrise)
        print(sunset)
        print(day_length)