书名：Python Automation Cookbook
作者名：Jaime Buelta
本章字数：210字
更新时间：2025-04-04 16:17:47

How to do it...

The full script, crawling_web_step1.py, is available in GitHub at the following link: https://github.com/PacktPublishing/Python-Automation-Cookbook/blob/master/Chapter03/crawling_web_step1.py. The most relevant bits are displayed here:

...

def process_link(source_link, text):
    logging.info(f'Extracting links from {source_link}')
    parsed_source = urlparse(source_link)
    result = requests.get(source_link)
    # Error handling. See GitHub for details
    ...
    page = BeautifulSoup(result.text, 'html.parser')
    search_text(source_link, page, text)
    return get_links(parsed_source, page)

def get_links(parsed_source, page):
    '''Retrieve the links on the page'''
    links = []
    for element in page.find_all('a'):
        link = element.get('href')
        # Validate is a valid link. See GitHub for details
        ...
        links.append(link)
    return links

Search for references to python, to return a list with URLs that contain it and the paragraph. Notice there are a couple of errors because of broken links:

$ python crawling_web_step1.py https://localhost:8000/ -p python
Link http://localhost:8000/: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/files/5eabef23f63024c20389c34b94dee593-1.html: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/files/33714fc865e02aeda2dabb9a42a787b2-0.html: --> This is the actual bit with a python reference that we are interested in.
Link http://localhost:8000/files/archive-september-2018.html: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/index.html: --> A smaller article , that contains a reference to Python

Another good search term is crocodile. Try it out:

$ python crawling_web_step1.py http://localhost:8000/ -p crocodile

本周热推：

Python编程：从入门到实践深入理解Java虚拟机：JVM高级特性与最佳实践（第3版）Java从初学到精通 Python编程：从入门到实践（第2版）Visual Basic数据库开发全程指南