What Is Web Scraping?

Gauloran · 20 Ara 2020

What Is Web Scraping?

Web scraping is the process of gathering information from the Internet. Even copy-pasting the lyrics of your favorite song is a form of web scraping! However, the words web scraping usually refer to a process that involves automation. Some websites dont like it when automatic scrapers gather their data, while others dont mind.

If youre scraping a page respectfully for educational purposes, then youre unlikely to have any problems. Still, its a good idea to do some research on your own and make sure that youre not violating any Terms of Service before you start a large-scale project. To learn more about the legal aspects of web scraping, check out Legal Perspectives on Scraping Data From The Modern Web.

Why Scrape the Web?
Say youre a surfer (both online and in real life) and youre looking for employment. However, youre not looking for just any job. With a surfers mindset, youre waiting for the perfect opportunity to roll your way!

Theres a job site that you like that offers exactly the kinds of jobs youre looking for. Unfortunately, a new position only pops up once in a blue moon. You think about checking up on it every day, but that doesnt sound like the most fun and productive way to spend your time.

Thankfully, the world offers other ways to apply that surfers mindset! Instead of looking at the job site every day, you can use Python to help automate the repetitive parts of your job search. Automated web scraping can be a solution to speed up the data collection process. You write your code once and it will get the information you want many times and from many pages.

In contrast, when you try to get the information you want manually, you might spend a lot of time clicking, scrolling, and searching. This is especially true if you need large amounts of data from websites that are regularly updated with new content. Manual web scraping can take a lot of time and repetition.

Theres so much information on the Web, and new information is constantly added. Something among all that data is likely of interest to you, and much of it is just out there for the taking. Whether youre actually on the job hunt, gathering data to support your grassroots organization, or are finally looking to get all the lyrics from your favorite artist downloaded to your computer, automated web scraping can help you accomplish your goals.

Challenges of Web Scraping
The Web has grown organically out of many sources. It combines a ton of different technologies, styles, and personalities, and it continues to grow to this day. In other words, the Web is kind of a hot mess! This can lead to a few challenges youll see when you try web scraping.

One challenge is variety. Every website is different. While youll encounter general structures that tend to repeat themselves, each website is unique and will need its own personal treatment if you want to extract the information thats relevant to you.

Another challenge is durability. Websites constantly change. Say youve built a shiny new web scraper that automatically cherry-picks precisely what you want from your resource of interest. The first time you run your script, it works flawlessly. But when you run the same script only a short while later, you run into a discouraging and lengthy stack of tracebacks!

This is a realistic scenario, as many websites are in active development. Once the sites structure has changed, your scraper might not be able to navigate the sitemap correctly or find the relevant information. The good news is that many changes to websites are small and incremental, so youll likely be able to update your scraper with only minimal adjustments.

However, keep in mind that because the internet is dynamic, the scrapers youll build will probably require constant maintenance. You can set up continuous integration to run scraping tests periodically to ensure that your main script doesnt break without your knowledge.

APIs: An Alternative to Web Scraping
Some website providers offer Application Programming Interfaces (APIs) that allow you to access their data in a predefined manner. With APIs, you can a**** parsing HTML and instead access the data directly using formats like JSON and XML. HTML is primarily a way to visually present content to users.

When you use an API, the process is generally more stable than gathering the data through web scraping. Thats because APIs are made to be consumed by programs, rather than by human eyes. If the design of a website changes, then it doesnt mean that the structure of the API has changed.

However, APIs can change as well. Both the challenges of variety and durability apply to APIs just as they do to websites. Additionally, its much harder to inspect the structure of an API by yourself if the provided ********ation is lacking in quality.

What Is Web Scraping?

Gauloran

Global Moderatör

Sosyal medya sayfalarımız