书名：Python Automation Cookbook
作者名：Jaime Buelta
本章字数：372字
更新时间：2025-04-04 16:17:47

How it works...

Let's see each of the components of the script:

A loop that goes through all the found links, in the main function:

Note that there's a retrieval limit of 10 pages, and it's checking that any new link to add is not added already.

Note these two things are limits. We won't download the same link twice and we'll stop at some point.

Downloading and parsing the link, in the process_link function:

It downloads the file, and checks that the status is correct to skip errors such as broken links. It also checks that the type (as described in Content-Type) is a HTML page to skip PDFs and other formats. And finally, it parses the raw HTML into a BeautifulSoup object.

It also parses the source link using urlparse, so later, in step 4, it can skip all the references to external sources. urlparse divides a URL into its composing elements:

>>> from urllib.parse import urlparse
>>> >>> urlparse('http://localhost:8000/files/b93bec5d9681df87e6e8d5703ed7cd81-2.html')
ParseResult(scheme='http', netloc='localhost:8000', path='/files/b93bec5d9681df87e6e8d5703ed7cd81-2.html', params='', query='', fragment='')

It finds the text to search, in the search_text function:

It searches the parsed object for the specified text. Note the search is done as a regex and only in the text. It prints the resulting matches, including source_link, referencing the URL where the match was found:

for element in page.find_all(text=re.compile(text)):
    print(f'Link {source_link}: --> {element}')

The get_links function retrieves all links on a page:

It searches in the parsed page all <a> elements, and retrieves the href elements, but only elements that have such href elements and that are a fully qualified URL (starting with http). This removes links that are not a URL, such as a '#' link, or that are internal to the page.

An extra check is done to check they have the same source as the original link, then they are registered as valid links. The netloc attribute allows to detect that the link comes from the same URL domain than the parsed URL generated in step 2.

We won't follow links that point to a different address (for example, a http://www.google.com one).

Finally, the links are returned, where they'll be added to the loop described in step 1.