- 7 Tem 2013
- 8,188
- 636
Part 2: Scrape HTML Content From a Page
Now that you have an idea of what youre working with, its time to get started using Python. First, youll want to get the sites HTML code into your Python script so that you can interact with it. For this task, youll use Pythons requests library. Type the following in your terminal to install it:
$ pip3 install requests
Then open up a new file in your favorite text editor. All you need to retrieve the HTML are a few lines of code:
import requests
This code performs an HTTP request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.
If you take a look at the downloaded content, then youll notice that it looks very similar to the HTML you were inspecting earlier with developer tools. To improve the structure of how the HTML is displayed in your console output, you can print the objects .content attribute with pprint().
Static Websites
The website youre scraping in this tutorial serves static HTML content. In this scenario, the server that hosts the site sends back HTML ********s that already contain all the data youll get to see as a user.
When you inspected the page with developer tools earlier on, you discovered that a job posting consists of the following long and messy-looking HTML:
It can be difficult to wrap your head around such a long block of HTML code. To make it easier to read, you can use an HTML formatter to automatically clean it up a little more. Good readability helps you better understand the structure of any code block. While it may or may not help to improve the formatting of the HTML, its always worth a try.
Note: Keep in mind that every website will look different. Thats why its necessary to inspect and understand the structure of the site youre currently working with before moving forward.
The HTML above definitely has a few confusing parts in it. For example, you can scroll to the right to see the large number of attributes that the <a> element has. Luckily, the class names on the elements that youre interested in are relatively straightforward:
In case you ever get lost in a large pile of HTML, remember that you can always go back to your browser and use developer tools to further explore the HTML structure interactively.
By now, youve successfully harnessed the power and user-friendly design of Pythons requests library. With only a few lines of code, you managed to scrape the static HTML content from the web and make it available for further processing.
However, there are a few more challenging situations you might encounter when youre scraping websites. Before you begin using Beautiful Soup to pick the relevant information from the HTML that you just scraped, take a quick look at two of these situations.
Hidden Websites
Some pages contain information thats hidden behind a login. That means youll need an account to be able to see (and scrape) anything from the page. The process to make an HTTP request from your Python script is different than how you access a page from your browser. That means that just because you can log in to the page through your browser, that doesnt mean youll be able to scrape it with your Python script.
However, there are some advanced techniques that you can use with the requests to access the content behind logins. These techniques will allow you to log in to websites while making the HTTP request from within your script.
Dynamic Websites
Static sites are easier to work with because the server sends you an HTML page that already contains all the information as a response. You can parse an HTML response with Beautiful Soup and begin to pick out the relevant data.
On the other hand, with a dynamic website the server might not send back any HTML at all. Instead, youll receive JavaScript code as a response. This will look completely different from what you saw when you inspected the page with your browsers developer tools.
Note: To offload work from the server to the clients machines, many modern websites a**** crunching numbers on their servers whenever possible. Instead, theyll send JavaScript code that your browser will execute locally to produce the desired HTML.
As mentioned before, what happens in the browser is not related to what happens in your script. Your browser will diligently execute the JavaScript code it receives back from a server and create the DOM and HTML for you locally. However, doing a request to a dynamic website in your Python script will not provide you with the HTML page content.
When you use requests, youll only receive what the server sends back. In the case of a dynamic website, youll end up with some JavaScript code, which you wont be able to parse using Beautiful Soup. The only way to go from the JavaScript code to the content youre interested in is to execute the code, just like your browser does. The requests library cant do that for you, but there are other solutions that can.
For example, requests-html is a project created by the author of the requests library that allows you to easily render JavaScript using syntax thats similar to the syntax in requests. It also includes capabilities for parsing the data by using Beautiful Soup under the hood.
Note: Another popular choice for scraping dynamic content is Selenium. You can think of Selenium as a slimmed-down browser that executes the JavaScript code for you before passing on the rendered HTML response to your script.
You wont go deeper into scraping dynamically-generated content in this tutorial. For now, its enough for you to remember that youll need to look into the above-mentioned options if the page youre interested in is generated in your browser dynamically.
this post is quoted
Now that you have an idea of what youre working with, its time to get started using Python. First, youll want to get the sites HTML code into your Python script so that you can interact with it. For this task, youll use Pythons requests library. Type the following in your terminal to install it:
$ pip3 install requests
Then open up a new file in your favorite text editor. All you need to retrieve the HTML are a few lines of code:
import requests
Kod:
URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL)
If you take a look at the downloaded content, then youll notice that it looks very similar to the HTML you were inspecting earlier with developer tools. To improve the structure of how the HTML is displayed in your console output, you can print the objects .content attribute with pprint().
Static Websites
The website youre scraping in this tutorial serves static HTML content. In this scenario, the server that hosts the site sends back HTML ********s that already contain all the data youll get to see as a user.
When you inspected the page with developer tools earlier on, you discovered that a job posting consists of the following long and messy-looking HTML:
Kod:
<section class="card-content" data-jobid="4755ec59-d0db-4ce9-8385-b4df7c1e9f7c" onclick="MKImpressionTrackingMouseDownHijack(this, event)">
<div class="flex-row">
<div class="mux-company-logo thumbnail"></div>
<div class="summary">
<header class="card-header">
<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="4" data-m_impr_j_coc="" data-m_impr_j_jawsid="371676273" data-m_impr_j_jobid="0" data-m_impr_j_jpm="2" data-m_impr_j_jpt="3" data-m_impr_j_lat="30.1882" data-m_impr_j_lid="619" data-m_impr_j_long="-95.6732" data-m_impr_j_occid="11838" data-m_impr_j_p="3" data-m_impr_j_postingid="4755ec59-d0db-4ce9-8385-b4df7c1e9f7c" data-m_impr_j_pvc="4496dab8-a60c-4f02-a2d1-6213320e7213" data-m_impr_s_t="t" data-m_impr_uuid="0b620778-73c7-4550-9db5-df4efad23538" href="https://job-openings.monster.com/python-developer-woodlands-wa-us-lancesoft-inc/4755ec59-d0db-4ce9-8385-b4df7c1e9f7c" onclick="clickJobTitle('plid=619&pcid=4&poccid=11838','Software Developer',''); clickJobTitleSiteCat('{"events.event48":"true","eVar25":"Python Developer","eVar66":"Monster","eVar67":"JSR2CW","eVar26":"_LanceSoft Inc","eVar31":"Woodlands_WA_","prop24":"2019-07-02T12:00","eVar53":"1500127001001","eVar50":"Aggregated","eVar74":"regular"}')">Python Developer
</a></h2>
</header>
<div class="company">
<span class="name">LanceSoft Inc</span>
<ul class="list-inline">
</ul>
</div>
<div class="********">
<span class="name">
Woodlands, WA
</span>
</div>
</div>
<div class="**** flex-col">
<time datetime="2017-05-26T12:00">2 days ago</time>
<span class="mux-tooltip applied-only" data-mux="tooltip" title="Applied">
<i aria-hidden="true" class="icon icon-applied"></i>
<span class="sr-only">Applied</span>
</span>
<span class="mux-tooltip saved-only" data-mux="tooltip" title="Saved">
<i aria-hidden="true" class="icon icon-saved"></i>
<span class="sr-only">Saved</span>
</span>
</div>
</div>
</section>
Note: Keep in mind that every website will look different. Thats why its necessary to inspect and understand the structure of the site youre currently working with before moving forward.
The HTML above definitely has a few confusing parts in it. For example, you can scroll to the right to see the large number of attributes that the <a> element has. Luckily, the class names on the elements that youre interested in are relatively straightforward:
Kod:
class="title": the title of the job posting
class="company": the company that offers the position
class="********": the ******** where youd be working
By now, youve successfully harnessed the power and user-friendly design of Pythons requests library. With only a few lines of code, you managed to scrape the static HTML content from the web and make it available for further processing.
However, there are a few more challenging situations you might encounter when youre scraping websites. Before you begin using Beautiful Soup to pick the relevant information from the HTML that you just scraped, take a quick look at two of these situations.
Hidden Websites
Some pages contain information thats hidden behind a login. That means youll need an account to be able to see (and scrape) anything from the page. The process to make an HTTP request from your Python script is different than how you access a page from your browser. That means that just because you can log in to the page through your browser, that doesnt mean youll be able to scrape it with your Python script.
However, there are some advanced techniques that you can use with the requests to access the content behind logins. These techniques will allow you to log in to websites while making the HTTP request from within your script.
Dynamic Websites
Static sites are easier to work with because the server sends you an HTML page that already contains all the information as a response. You can parse an HTML response with Beautiful Soup and begin to pick out the relevant data.
On the other hand, with a dynamic website the server might not send back any HTML at all. Instead, youll receive JavaScript code as a response. This will look completely different from what you saw when you inspected the page with your browsers developer tools.
Note: To offload work from the server to the clients machines, many modern websites a**** crunching numbers on their servers whenever possible. Instead, theyll send JavaScript code that your browser will execute locally to produce the desired HTML.
As mentioned before, what happens in the browser is not related to what happens in your script. Your browser will diligently execute the JavaScript code it receives back from a server and create the DOM and HTML for you locally. However, doing a request to a dynamic website in your Python script will not provide you with the HTML page content.
When you use requests, youll only receive what the server sends back. In the case of a dynamic website, youll end up with some JavaScript code, which you wont be able to parse using Beautiful Soup. The only way to go from the JavaScript code to the content youre interested in is to execute the code, just like your browser does. The requests library cant do that for you, but there are other solutions that can.
For example, requests-html is a project created by the author of the requests library that allows you to easily render JavaScript using syntax thats similar to the syntax in requests. It also includes capabilities for parsing the data by using Beautiful Soup under the hood.
Note: Another popular choice for scraping dynamic content is Selenium. You can think of Selenium as a slimmed-down browser that executes the JavaScript code for you before passing on the rendered HTML response to your script.
You wont go deeper into scraping dynamically-generated content in this tutorial. For now, its enough for you to remember that youll need to look into the above-mentioned options if the page youre interested in is generated in your browser dynamically.
this post is quoted