- The full script, crawling_web_step1.py, is available in GitHub at the following link: https://github.com/PacktPublishing/Python-Automation-Cookbook/blob/master/Chapter03/crawling_web_step1.py. The most relevant bits are displayed here:
...
def process_link(source_link, text):
logging.info(f'Extracting links from {source_link}')
parsed_source = urlparse(source_link)
result = requests.get(source_link)
# Error handling. See GitHub for details
...
page = BeautifulSoup(result.text, 'html.parser')
search_text(source_link, page, text)
return get_links(parsed_source, page)
def get_links(parsed_source, page):
'''Retrieve the links on the page'''
links = []
for element in page.find_all('a'):
link = element.get('href')
# Validate is a valid link. See GitHub for details
...
links.append(link)
return links
- Search for references to python, to return a list with URLs that contain it and the paragraph. Notice there are a couple of errors because of broken links:
$ python crawling_web_step1.py https://localhost:8000/ -p python
Link http://localhost:8000/: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/files/5eabef23f63024c20389c34b94dee593-1.html: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/files/33714fc865e02aeda2dabb9a42a787b2-0.html: --> This is the actual bit with a python reference that we are interested in.
Link http://localhost:8000/files/archive-september-2018.html: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/index.html: --> A smaller article , that contains a reference to Python
- Another good search term is crocodile. Try it out:
$ python crawling_web_step1.py http://localhost:8000/ -p crocodile