Beautiful Soup: Build a Web Scraper With Python Part 1

Gauloran

Global Moderatör
7 Tem 2013
8,188
637
In this tutorial, you’ll build a web scraper that fetches Software Developer job listings from the Monster job aggregator site. Your web scraper will parse the HTML to pick out the relevant pieces of information and filter that content for specific words.

You can scrape any site on the Internet that you can look at, but the difficulty of doing so depends on the site. This tutorial offers you an introduction to web scraping to help you understand the overall process. Then, you can apply this same process for every website you’ll want to scrape.

Part 1: Inspect Your Data Source
The first step is to head over to the site you want to scrape using your favorite browser. You’ll need to understand the site structure to extract the information you’re interested in.

Explore the Website
Click through the site and interact with it just like any normal user would. For example, you could search for Software Developer jobs in Australia using the site’s native search interface:

search_result.ed99b1b57c4f.png


You can see that there’s a list of jobs returned on the left side, and there are more detailed descriptions about the selected job on the right side. When you click on any of the jobs on the left, the content on the right changes. You can also see that when you interact with the website, the URL in your browser’s address bar also changes.

Decipher the Information in URLs
A lot of information can be encoded in a URL. Your web scraping journey will be much easier if you first become familiar with how URLs work and what they’re made of. Try to pick apart the URL of the site you’re currently on:

https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia
You can deconstruct the above URL into two main parts:

The base URL represents the path to the search functionality of the website. In the example above, the base URL is https://www.monster.com/jobs/search/.
The query parameters represent additional values that can be declared on the page. In the example above, the query parameters are ?q=Software-Developer&where=Australia.
Any job you’ll search for on this website will use the same base URL. However, the query parameters will change depending on what you’re looking for. You can think of them as query strings that get sent to the database to retrieve specific records.

Query parameters generally consist of three things:

Start: The beginning of the query parameters is denoted by a question mark (?).
Information: The pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (key=value).
Separator: Every URL can have multiple query parameters, which are separated from each other by an ampersand (&).
Equipped with this information, you can pick apart the URL’s query parameters into two key-value pairs:

q=Software-Developer selects the type of job you’re looking for.
where=Australia selects the ******** you’re looking for.
Try to change the search parameters and observe how that affects your URL. Go ahead and enter new values in the search bar up top:

search_bar.f4ae83849d4e.png


Next, try to change the values directly in your URL. See what happens when you paste the following URL into your browser’s address bar:

https://www.monster.com/jobs/search/?q=Programmer&where=New-York
You’ll notice that changes in the search box of the site are directly reflected in the URL’s query parameters and vice versa. If you change either of them, then you’ll see different results on the website. When you explore URLs, you can get information on how to retrieve data from the website’s server.

Inspect the Site Using Developer Tools
Next, you’ll want to learn more about how the data is structured for display. You’ll need to understand the page structure to pick what you want from the HTML response that you’ll collect in one of the upcoming steps.

Developer tools can help you understand the structure of a website. All modern browsers come with developer tools installed. In this tutorial, you’ll see how to work with the developer tools in Chrome. The process will be very similar to other modern browsers.

In Chrome, you can open up the developer tools through the menu View → Developer → Developer Tools. You can also access them by right-clicking on the page and selecting the Inspect option, or by using a keyboard shortcut.

Developer tools allow you to interactively explore the site’s DOM to better understand the source that you’re working with. To dig into your page’s DOM, select the Elements tab in developer tools. You’ll see a structure with clickable HTML elements. You can expand, collapse, and even edit elements right in your browser:

dev_tools.4a10589b5ca3.png


You can think of the text displayed in your browser as the HTML structure of that page. If you’re interested, then you can read more about the difference between the DOM and HTML on CSS-TRICKS.

When you right-click elements on the page, you can select Inspect to zoom to their ******** in the DOM. You can also hover over the HTML text on your right and see the corresponding elements light up on the page.

Task: Find a single job posting. What HTML element is it wrapped in, and what other HTML elements does it contain?

Play around and explore! The more you get to know the page you’re working with, the easier it will be to scrape it. However, don’t get too overwhelmed with all that HTML text. You’ll use the power of programming to step through this maze and cherry-pick only the interesting parts with Beautiful Soup.​
 
Üst

Turkhackteam.org internet sitesi 5651 sayılı kanun’un 2. maddesinin 1. fıkrasının m) bendi ile aynı kanunun 5. maddesi kapsamında "Yer Sağlayıcı" konumundadır. İçerikler ön onay olmaksızın tamamen kullanıcılar tarafından oluşturulmaktadır. Turkhackteam.org; Yer sağlayıcı olarak, kullanıcılar tarafından oluşturulan içeriği ya da hukuka aykırı paylaşımı kontrol etmekle ya da araştırmakla yükümlü değildir. Türkhackteam saldırı timleri Türk sitelerine hiçbir zararlı faaliyette bulunmaz. Türkhackteam üyelerinin yaptığı bireysel hack faaliyetlerinden Türkhackteam sorumlu değildir. Sitelerinize Türkhackteam ismi kullanılarak hack faaliyetinde bulunulursa, site-sunucu erişim loglarından bu faaliyeti gerçekleştiren ip adresini tespit edip diğer kanıtlarla birlikte savcılığa suç duyurusunda bulununuz.