Crawling

I want to emphasize here that Burp has an amazing crawling mechanism to map the site structure with the closest possible accuracy. Crawling may seem to be a simple task, but with modern dynamic applications it is not. As pentesters, we have always witnessed the scanners going in huge loops in the crawling phase due to the URL scheme implementations, and the scan never seems to finish, especially when you are testing a shopping cart. It is really frustrating when such things happen, because then you have to rely on completely manual strategies. Burp, on the other hand, has a very smart approach. The crawler of Burp mimics the way a user would browse the application on the browser. It simulates user clicks, navigation, and input submissions, and constructs a map of the site structure. It is a very detailed structure, as shown in the following diagram:

The crawler smartly navigates the application and makes use of the understanding of the URLs and the dependencies to reach them. This is a major plus point for the scanner, as it will work well with modern applications. It understands Cross-Site Request Forgery (CSRF) tokens, it changes at every request, and navigates the application accordingly. Here's a graphical representation to help you understand this:

One of the other known features of the crawler is that it understands the flow of the application, even if the URL does not change. For example, purchasing a product if the URL is the same, there are a certain number of steps to be followed in order to complete a purchase. The crawler will understand these steps just like a real-world user. 

One of the bigger problems with the automated scanners is how they interpret the hyperlinks. The automated scanners store the hyperlinks and visit them directly as and when they find them. However, it is different in the case of Burp crawling. What Burp does is, it keeps a note of these pages and tries to find a path to such pages just like a normal user would. If it cannot access that page directly, it will go to the root node and try to traverse to that particular page. A better way to understand this is through the following representation:

You might also wonder how the crawler travels when there are sessions involved, such as post-authentication crawling. What if the application logs the user out on wrong requests, what if there are CSRF filters on forms, how will burp navigate then? Well, here is the good news; the new Burp, as I stated earlier, navigates the application just like any user would, in that it deploys multiple crawler agents that behave as a particular role.

Once the crawler agent is authenticated, it collects a set of cookie jars and navigates across the application to detect if the agent is logged out. If it is logged out, then the agent re-logs into the application and the cookie jar is cleared and filled again so as to have a smooth traversal across the pages. The requests that are made by Burp are dynamic in nature, so if you have any CSRF tokens in place, it will understand them and formulate the next request accordingly. So, even if there is a scenario as shown in the following screenshot, Burp will be able to understand it and generate a site structure accordingly:

Current world applications are huge and very complicated in nature. Let's take the example of a shopping website; they are humongous and behave differently based on the different input provided to the application. For example, a shopping cart URL can contain two different states for two different scenarios: if the cart is full and if the cart is empty. It is imperative and difficult for a tool to realize and keep a mapping of this state change. How does Burp handle this? As shown in the following screenshot, Burp navigates through the application, as if it was a real person doing it. Hence, Burp will understand both these different states and store it in the site map:

Here's  a short summary of the Burp crawler: 

  • Burp simulates a crawl like a real user, unlike a traditional crawler that crawls with the help of hyperlinks.
  • Burp deploys different agents to differentiate and understand role matrix to find authorization flaws.
  • Burp can understand multiple different states of the same page and treats them differently.
  • Burp keeps a track of how it reaches a particular page right from the root node, thus creating an almost near accurate site structure.
  • Burp also does session management with the help of cookie jars, thus ensuring that the session doesn't log out.
  • Burp also has a way around dynamic CSRF cookies per request, as it simulates traffic like a real user. It intercepts the CSRF token, and passes the next request.