Katana: web crawler for offensive security and web exploration

Katana is CLI tool and Go library for automatically traversing (crawling) across all pages of given website(s) to map them out. It can work in two main modes - requests-based and through browser automation (headful or headless). To allow for discovery of API endpoints it can optionally do JavaScript parsing even when running in requests-based mode. Furthermore, Katana can do passive crawling by leveraging pre-crawled data from Internet Archive Wayback Machine, CommonCrawl and Alien Vault OTX. Since mapping out site pages and APIs is useful for security research activities (e.g bug bounty hunting) Katana is designed to fit into larger automation workflows, esp. when used together with other tooling from Project Discovery.

On macOS Katana is available through Homebrew and there is also official Docker image on Docker Hub that includes a Chromium browser.

If you have Golang development environment you can install it via Go CLI:

$ go install github.com/projectdiscovery/katana/cmd/katana@latest

If we simply want to see a list of pages a site has we only have to pass the site URL via -u parameter:

$ katana -u https://public-firing-range.appspot.com/

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/							 

		projectdiscovery.io

[INF] Current katana version v1.1.0 (latest)
[INF] Started standard crawling for => https://public-firing-range.appspot.com/
https://public-firing-range.appspot.com/
https://public-firing-range.appspot.com/vulnerablelibraries/index.html
https://public-firing-range.appspot.com/stricttransportsecurity/index.html
https://public-firing-range.appspot.com/urldom/index.html
https://public-firing-range.appspot.com/reverseclickjacking/
https://public-firing-range.appspot.com/reflected/index.html
https://public-firing-range.appspot.com/mixedcontent/index.html
https://public-firing-range.appspot.com/tags/index.html
https://public-firing-range.appspot.com/redirect/index.html
https://public-firing-range.appspot.com/remoteinclude/index.html
https://public-firing-range.appspot.com/insecurethirdpartyscripts/index.html
https://public-firing-range.appspot.com/leakedcookie/index.html
https://public-firing-range.appspot.com/flashinjection/index.html
https://public-firing-range.appspot.com/vulnerablelibraries/jquery.html
...

But now we don’t see status codes and we’re not even sure if all of these pages load properly. We can use -mdc argument with Project Discovery DSL snippet to filter out only responses with status code 200:

$ katana -u https://public-firing-range.appspot.com/ -mdc 'status_code == 200' 

We can use this tool to store responses in case you want to separate web crawling from HTTP parsing:

$ katana -u https://books.toscrape.com/ -store-response

This decouples former from the latter so that you can download all the pages once and then work on code that parses the HTML documents. The above command creates a new directory named katana_responses with an index.txt file listing requests, status codes and relative file paths:

$ head index.txt 
katana_response/books.toscrape.com/3060d727707e12334c7134afe04df480af3f0fdf.txt https://books.toscrape.com/ (200 OK)
katana_response/books.toscrape.com/f77b190299d9ad01298054c2b1ce13f35765e4d1.txt https://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/locales/bootstrap-datetimepicker.all.js (200 OK)
katana_response/books.toscrape.com/9e4c7d8e5630912730d1194fec599ec3f5f8580e.txt https://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.js (200 OK)
katana_response/books.toscrape.com/7fdd4edbfcbe5d86df9904b18b6747e7cf7f0efa.txt https://books.toscrape.com/static/oscar/js/oscar/ui.js (200 OK)
katana_response/books.toscrape.com/cd148dbefc26230ec0d29104eb0408ae640720d2.txt https://books.toscrape.com/static/oscar/js/bootstrap3/bootstrap.min.js (200 OK)
katana_response/books.toscrape.com/0eb507686604bb3002bdf01dc8d1b917c7352867.txt https://books.toscrape.com/static/oscar/css/datetimepicker.css (200 OK)
katana_response/books.toscrape.com/0a63ac4de25fb4ba06a76137240cc663a716b506.txt https://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.css (200 OK)
katana_response/books.toscrape.com/a2652354fa625fb0f225b7cd87f035f8f6d6af2c.txt https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html (200 OK)
katana_response/books.toscrape.com/bf2072cbc394277fe024f4992dde7a498b7d6a9f.txt https://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html (200 OK)
katana_response/books.toscrape.com/e4f9cda62c7f551a9e54439e5ed27f6e3306902a.txt https://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html (200 OK)

Each of the page snapshot files retain original URL, request/response headers and response payload:

$ head -n 50 books.toscrape.com/f77b190299d9ad01298054c2b1ce13f35765e4d1.txt 
https://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/locales/bootstrap-datetimepicker.all.js


GET /static/oscar/js/bootstrap-datetimepicker/locales/bootstrap-datetimepicker.all.js HTTP/1.1
Host: books.toscrape.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36
Accept-Encoding: gzip



HTTP/1.1 200 OK
Content-Length: 31995
Accept-Ranges: bytes
Connection: keep-alive
Content-Type: application/javascript
Date: Tue, 20 Aug 2024 09:17:54 GMT
Etag: "63e40de9-7cfb"
Last-Modified: Wed, 08 Feb 2023 21:02:33 GMT
Strict-Transport-Security: max-age=0; includeSubDomains; preload

/**
* Arabic translation for bootstrap-datetimepicker
* Ala' Mohammad <[email protected]>
*/
;(function($){
	$.fn.datetimepicker.dates['ar'] = {
		days: ["الأحد", "الاثنين", "الثلاثاء", "الأربعاء", "الخميس", "الجمعة", "السبت", "الأحد"],
		daysShort: ["أحد", "اثنين", "ثلاثاء", "أربعاء", "خميس", "جمعة", "سبت", "أحد"],
		daysMin: ["أح", "إث", "ث", "أر", "خ", "ج", "س", "أح"],
		months: ["يناير", "فبراير", "مارس", "أبريل", "مايو", "يونيو", "يوليو", "أغسطس", "سبتمبر", "أكتوبر", "نوفمبر", "ديسمبر"],
		monthsShort: ["يناير", "فبراير", "مارس", "أبريل", "مايو", "يونيو", "يوليو", "أغسطس", "سبتمبر", "أكتوبر", "نوفمبر", "ديسمبر"],
		today: "هذا اليوم",
		suffix: [],
		meridiem: [],
		rtl: true
	};
}(jQuery));
/**
 * Bulgarian translation for bootstrap-datetimepicker
 * Apostol Apostolov <[email protected]>
 */
;(function($){
	$.fn.datetimepicker.dates['bg'] = {
		days: ["Неделя", "Понеделник", "Вторник", "Сряда", "Четвъртък", "Петък", "Събота", "Неделя"],
		daysShort: ["Нед", "Пон", "Вто", "Сря", "Чет", "Пет", "Съб", "Нед"],
		daysMin: ["Н", "П", "В", "С", "Ч", "П", "С", "Н"],
		months: ["Януари", "Февруари", "Март", "Април", "Май", "Юни", "Юли", "Август", "Септември", "Октомври", "Ноември", "Декември"],
		monthsShort: ["Ян", "Фев", "Мар", "Апр", "Май", "Юни", "Юли", "Авг", "Сеп", "Окт", "Ное", "Дек"],
		today: "днес",
		suffix: [],

Katana is not primarily designed to extract information from web pages - it is crawling, not scraping tool. For those in need of a comprehensive web scraping framework Colly and Scrapy are well established options. However, it is is possible to do regex-based data extraction by putting regular expressions in YAML file like the following example from the README.md file (simplified for brevity):

- name: email
  type: regex
  regex:
  - '([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)'

Once we have fields configured, we can run Katana to extract some data:

$ katana -u https://allbirds.com -f email -flc field-config.yaml 


   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/							 

		projectdiscovery.io

[INF] Current katana version v1.1.0 (latest)
[INF] Started standard crawling for => https://allbirds.com
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
...

This enables Katana to be used for opportunistic contact data (email and/or phone number) scraping across a list of websites (which can be passed to Katana one per line via standard input). For example, one could source a list of companies within a niche by Google dorking or scraping a business directory website, then run Katana with the above custom field configuration to traverse the list of websites and grab emails whenever they happen to be somewhere on the page. Contacts most likely will be generic to the entire company, but this is significantly cheaper than running the domain list through enrichment API. Some growth hackers are using Scrapebox, a proprietary Windows desktop program to do this, but Katana requires no GUI environment if you want to run it on the server and is 100% open source.

However, regular expressions should be used sparingly for web scraping and one should refrain from trying to parse HTML with them. Consider that an anti-pattern.

If you run Katana with -js-crawl CLI argument it will download JavaScript files and parse them to find API endpoints:

$ katana -u https://allbirds.com -js-crawl

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/							 

		projectdiscovery.io

[INF] Current katana version v1.1.0 (latest)
[INF] Started standard crawling for => https://allbirds.com
https://allbirds.com
https://www.allbirds.com/js.iterable.com/analytics.js
https://www.allbirds.com/cdn/shop/t/2473/assets/homepage.a1ce3f2b8cdbde6e27e6.chunk.js
https://www.allbirds.com/cdn/shop/t/2473/assets/44.1560197b842f08b3fe77.chunk.js
https://www.allbirds.com/cdn/shop/t/2473/assets/34.390ec96b64c918e7a51f.chunk.js
https://www.allbirds.com/cdn/shop/t/2473/assets/33.f428389d95947b0ea31f.chunk.js
https://www.allbirds.com/cdn/shop/t/2473/assets/32.c205e74564192b41988c.chunk.js
https://www.allbirds.com/cdn/shop/t/2473/assets/35.aec7b6810f22105211fb.chunk.js
https://www.allbirds.com/cdn/shop/t/2473/assets/43.8f8d6489f56adf20dcff.chunk.js
https://www.allbirds.com/cdn/shopifycloud/shopify/assets/shop_events_listener-61fa9e0a912c675e178777d2b27f6cbd482f8912a6b0aa31fa3515985a8cd626.js
https://www.allbirds.com/cdn/shop/t/2473/assets/29.81f74e37a33dc7fbe9dd.chunk.js
https://www.allbirds.com/cdn/shop/t/2473/assets/42.b0afa4f09987aaabeeaf.chunk.js
https://www.allbirds.com/filter.json
https://www.allbirds.com/reviews.json
https://www.allbirds.com/cdn/s/trekkie.storefront.d0db9c6b604f2af4af0875dc118feaf816931b65.min.js
https://www.allbirds.com/graphql.json
...

To take it a bit further and potentially find more APIs one may add -jsluice argument that enables JS parsing with jsluice library (at the cost of increased RAM consumption).

This may be useful when exploring a website for API scraping or security assessment purposes.

By default Katana prints results to standard output. You can use -silent option to disable the ASCII art logo or -o to provide a file the output is to be written. Furthermore, -jsonl option formats each line into JSONL format that can be further processed programmatically.

When used from Go code, Katana is used like framework, not a library - you have to configure it to run the crawl for you and it will keep calling the callback function you provide on each page it goes through. See the following snippet that was made by extending the sample code Katana project provides:

package main

import (
	"bytes"
	"log"
	"math"
	"strconv"

	"github.com/goccy/go-graphviz"
	"github.com/goccy/go-graphviz/cgraph"
	"github.com/projectdiscovery/gologger"
	"github.com/projectdiscovery/katana/pkg/engine/standard"
	"github.com/projectdiscovery/katana/pkg/output"
	"github.com/projectdiscovery/katana/pkg/types"
)

func main() {
	g := graphviz.New()
	graph, _ := g.Graph()
	defer func() {
		graph.Close()
		g.Close()
	}()

	nodes := map[string]*cgraph.Node{}

	options := &types.Options{
		MaxDepth:     3,               // Maximum depth to crawl
		FieldScope:   "rdn",           // Crawling Scope Field
		BodyReadSize: math.MaxInt,     // Maximum response size to read
		Timeout:      10,              // Timeout is the time to wait for request in seconds
		Concurrency:  10,              // Concurrency is the number of concurrent crawling goroutines
		Parallelism:  10,              // Parallelism is the number of urls processing goroutines
		Delay:        0,               // Delay is the delay between each crawl requests in seconds
		RateLimit:    150,             // Maximum requests to send per second
		Strategy:     "breadth-first", // Visit strategy (depth-first, breadth-first)
		OnResult: func(result output.Result) { // Callback function to execute for result
			referer := result.Request.Source

			if referer != "" {
				gologger.Info().Msg(referer + " -> " + result.Request.URL)
			} else {
				gologger.Info().Msg(result.Request.URL)
			}

			node, _ := graph.CreateNode(result.Request.URL)
			nodes[result.Request.URL] = node
			prevNode := nodes[referer]

			if prevNode != nil {
				e, _ := graph.CreateEdge("", prevNode, node)
				e.SetLabel(strconv.Itoa(result.Response.StatusCode))
			}
		},
	}
	crawlerOptions, err := types.NewCrawlerOptions(options)
	if err != nil {
		gologger.Fatal().Msg(err.Error())
	}
	defer crawlerOptions.Close()
	crawler, err := standard.New(crawlerOptions)
	if err != nil {
		gologger.Fatal().Msg(err.Error())
	}
	defer crawler.Close()
	var input = "https://public-firing-range.appspot.com/"
	err = crawler.Crawl(input)
	if err != nil {
		gologger.Warning().Msgf("Could not crawl %s: %s", input, err.Error())
	}

	var buf bytes.Buffer
	if err := g.Render(graph, "dot", &buf); err != nil {
		log.Fatal(err)
	}

	if err := g.RenderFilename(graph, graphviz.SVG, "katana.svg"); err != nil {
		log.Fatal(err)
	}

}

Here we use Graphviz wrapper library to keep track of what links were followed between pages as crawling was in progress and render the resulting graph into SVG file. The end result of this is admittedly not very interesting, but the above code can be viewed as example of some greater points:

  1. One must first set up Katana configuration structures with many of parameters that apply when Katana is used as CLI tool.
  2. It is possible to do custom stuff in your callback function, such as registering the HTTP metadata and/or content somewhere for further processing.

Trickster Dev

Code level discussion of web scraping, gray hat automation, growth hacking and bounty hunting


By rl1987, 2024-10-01