Artificial intelligence, especially large language models such as ChatGPT are all the rage now. ChatGPT has the distinction of being the fastest growing product in the history of technology - it reached the first one million users in just a few days. As such, it is the talk of the global village now. There is no shortage of overly excited people posting their takes on how it will change the world for the better and what you should be doing right now to benefit from that.
In the world of desktop software, the concept of packer is not new. A packer is a tool that takes a binary executable file as input, applies transformations (e.g. compression, encryption, introducing anti-debugging tricks) and outputs a new, modified executable file that is different at binary level, but retains the functionality of original program. Some packers, such as UPX are only meant to make the executable files smaller. More advanced packers apply machine code encryption to make reverse engineering harder.
Suppose you are looking to collect pricing data on male footwear from the official website of one of the industry leaders - Nike.com. There’s a product list page that seems like a good place to start, but the infinite scroll feature might seem puzzling to budding web scraper developers. In this post, we will go through developing a Scrapy project for scraping sneaker price data from this website. Some familiarity with Scrapy and web scraping in general is assumed.
Not every website wants to let the data to be scraped and not every app wants to allow automation of user activity. If you work in scraping and automation at any capacity you certainly have dealt with sites that work just fine when accessed through normal browser throwing captchas or error pages at your bot. There are multiple security mechanisms that can cause this to happen. Today we will do a broad review of automation countermeasures that can be implemented at various levels.
The front page of Scrapy project provides a basic example of Scrapy spider:
class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['https://www.zyte.com/blog/'] def parse(self, response): for title in response.css('.oxy-post-title'): yield {'title': title.css('::text').get()} for next_page in response.css('a.next'): yield response.follow(next_page, self.parse) In this code snippet, crawling is performed imperatively - link to next page is extracted and from it a new request is generated. However, Scrapy also supports another, declarative, approach of crawling that is based on setting crawling rules for the spider and letting it follow links without explicit request generation.
Bash is a Linux/UNIX program that reads users commands from the users, parses them and executes the appropriate programs through OS-specific APIs. Since it covers these APIs and provides some extra features on top of them this kind of program is called a shell. Bash is not merely an interface between keyboard and exec(2) et. al. It is also a scripting language and interpreter.
Today we are going to explore various scripting facilities in Bash to learn about programmable nature of this shell.
In the previous post we went through using Babel AST transforms to simplify JSFuck-generated unary and binary expressions. This was shown to undo a lot, but not all of the obfuscation. What we still have to do is to deal with certain API and runtime hacks that JSFuck leverages to obfuscate some of the characters that are not covered just by abusing type coercion and atomic components of JS (string and array indexing, logical/arithmetic operations).
Introduction JSFuck is a prominent JavaScript obfuscation tool that converts JS code into a rather weird looking form with only six characters: ()+[]!. For example, console.log(1) is turned into this:
([][(![]+[])[+[]]+(![]+[])[!+[]+!+[]]+(![]+[])[+!+[]]+(!![]+[])[+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+(![]+[])[!+[]+!+[]]+(![]+[])[+!+[]]+(!![]+[])[+[]]])[+!+[]+[+[]]]+([][[]]+[])[+!+[]]+(![]+[])[!+[]+!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+(![]+[])[!+[]+!+[]]+(![]+[])[+!+[]]+(!![]+[])[+[]]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(+(+!+[]+[+!+[]]+(!![]+[])[!+[]+!+[]+!+[]]+[!+[]+!+[]]+[+[]])+[])[+!+[]]+(![]+[])[!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+(![]+[])[!+[]+!+[]]+(![]+[])[+!+[]]+(!![]+[])[+[]]])[+!+[]+[+[]]]+(![]+[+[]]+([]+[])[([][(![]+[])[+[]]+(![]+[])[!+[]+!+[]]+(![]+[])[+!+[]]+(!![]+[])[+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+(![]+[])[!+[]+!+[]]+(![]+[])[+!+[]]+(!![]+[])[+[]]])[+!+[]+[+[]]]+([][[]]+[])[+!+[]]+(![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[+!+[]]+([][[]]+[])[+[]]+([][(![]+[])[+[]]+(![]+[])[!+[]+!+[]]+(![]+[])[+!+[]]+(!![]+[])[+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[][(![]+[])[+[]]+(![]+[])[!+[]+!+[]]+(![]+[])[+!+[]]+(!![]+[])[+[]]])[+!+[]+[+[]]]+(!![]+[])[+!+[]]])[!+[]+!+[]+[+[]]]+([][(![]+[])[+[]]+(![]+[])[!+[]+!+[]]+(![]+[])[+!+[]]+(!![]+[])[+[]]]+[])[+!+[]+[!+[]+!+[]+!+[]]]+[+!+[]]+([+[]]+![]+[][(![]+[])[+[]]+(![]+[])[!+[]+!+[]]+(![]+[])[+!+[]]+(!![]+[])[+[]]])[!+[]+!+[]+[+[]]] Screenshot 1
Code is made to be practically unreadable, but still retains it’s original functionality. What’s the trick here? How can this possibly work?
Fortunately JSFuck is open source and we can take a look into the source code. We see that no AST level transformation are performed and that source code from the user is being treated as a string.
Incoming HTTPS traffic can be fingerprinted by server-side systems to derive technical characteristics of client side systems. One way to do this is TLS fingerprinting that we have covered before on this blog and that is commonly done by antibot vendors as part of automation countermeasures suite. But that’s not all they do. Fingerprinting can be done at HTTP/2 level as well. Let us discuss HTTP/2 and how HTTP/2 fingerprinting works.
In computer science, control flow graph is a graph that represents order of code block execution and transitions between blocks. For the purposes of todays topic, vertices in such graph are basic blocks of code that are to be executed consequentially and have no branching logic. Each basic block has an entry point and exit point. Edges in control graph represent transitions between basic blocks.
Control flow graph flattening is an obfuscation technique that introduces large switch statements within while-loops with basic blocks at each switch case.
String concealing is a code obfuscation technique that involves some sort of string constant recomputation (e.g. Base64 encoding or encryption with symmetric ciphers) being introduced into code. Furthermore, obfuscation solutions may introduce some variable and function indirection to further thwart reverse engineering. In this post we will be learning how to deal with both of these obstacles on the way to untold riches.
Let us consider the following JS snippet that we are going to obfuscate with obfuscator.
Introducing constant recomputation is a commonly used code obfuscation technique. Let’s consider the following JavaScript snippet:
const fourtyTwo = 42; const msg = "The answer is:"; console.log(msg, fourtyTwo); We have one numeric constant (fourtyTwo) and one string constant (msg) that are passed into console.log(). Let us apply constant obfuscation by using obfuscator.io with “Numbers To Expressions” and “Split Strings” checkboxes being on.
Screenshot 1 Screenshot 2
This yields the following obfuscated version of the above snippet that we put into AST Explorer:
In previous posts about using Babel for JavaScript deobfuscation, we have used NodePath.replaceWith() method to replace one node with another and NodePath.remove() to remove a single node. Since AST and it’s elements are mutable, we can also modify the AST without traversing it. But there is more to learn about AST modification than what we have seen before. We will go through some more AST modification APIs that should prove to be useful when developing JS deobfuscators.
In computer programming, a lexical scope of an identifier (name of function, class, variable, constant, etc.) is area within the code where that identifier can be used. Some identifiers have global scope meaning they can be used in the entire program. Others have narrowed-down scope that is limited to one single function or code block between curly braces. Identifier names can be reused between scopes - conflict is resolved by preferring the innermost scope for a given identifier.
When doing Javascript deobfuscation work at AST level one will need to create new parts of AST for replacing the existing parts of AST being processed. We will go through 3 ways of doing that.
For the sake of an example, let us try to build an AST for the following JS snippet:
debugger; console.log("!") debugger; Babel parses this into two DebuggerStatement nodes and and one ExpressionStatement node.
Screenshot
We don’t need to worry about File and Program objects and should only focus on the program body that contains the following child nodes:
In previous posts we have went through several JavaScript obfuscation techniques and how they could be reversed by applying Abstract Syntax Tree transformations. AST manipulation is a powerful skill that is of particular importance in certain kinds of grayhat programming projects. Earlier, we have focused on how exactly AST could be changed to undo specific kinds of obfuscations. This time, however, we will take a bit broader look into Babel to see how it facilitates development of AST manipulation code that goes beyond a single AST transformation.
On the front page of Scrapy framework there’s the following Python snippet:
import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['https://www.zyte.com/blog/'] def parse(self, response): for title in response.css('.oxy-post-title'): yield {'title': title.css('::text').get()} for next_page in response.css('a.next'): yield response.follow(next_page, self.parse) Note the usage of yield keyword in this code. This is different from simply returning some value from method or function. One way to look at this is that yield is a lazy equivalent of returning a value.
Yahoo! Finance is a prominent website featuring financial news, press releases, financial reports and quantitive data on various assets. We are going to go through some examples on how data could be scraped from this portal. We are going to scrape two kinds of data: fundamental infomation on how well various public companies are performing financially and stock price time series.
But first we need to have a list of stock tickers (symbols) for companies we want to gather information on.
Introduction I don’t suppose I need to do much explaining on the value of touch typing for developer productivity. Today developers gain additional productivity by using Integrated Development Environments such as PyCharm, Xcode, Visual Studio that provide features like auto-completion, enhanced code navigation, integration with compilers, debuggers and static analysers. All of that does not come for free. Running heavy GUI environment can be resource intensive. For example, Android Studio is quite infamous for making laptop fans spin like crazy.
The default way of using Scrapy entails running scrapy startproject to generate a bunch of starter code across multiple files. For small scale scrapers that’s a bit of overkill. It’s far simpler to have a single Python script file that you can run when you want to scrape some data. The CrawlerProcess class in Scrapy framework enables us to develop such a script.
For the sake of example, let us suppose we are interested in extracting some basic information about Fortune 500 companies from the Fortune 500 website.
Many developers would agree that source code version control is a great thing that they would not imagine the modern software development without. What if it was applied for structured data as well? Yes, technically it is possible to save SQLite file or SQL dump into git repo, but that is rather clunky and outside the intended use case of git. For proper version control, we would want row-level diffing, ability to undo changes, group them into incremental chunks (commits), have a way to review the proposed changes and many other nice features that we have available in systems like git.
On it’s own SMTP protocol does not do much validation on the authenticity of sender. One can spoof protocol headers at SMTP and MIME levels to send a message in another users name. That can be problematic as it makes spam and social engineering attacks easier. To address this problem, some email authentication technologies have been developed. We will discuss the three major ones: SPF, DKIM and DMARC. SPF and DKIM authenticate sender metadata and DMARC extends them to configure email handling in case authentication fails.
Introduction and big picture Email is a fairly old school technology meant to replicate postal service digitally. To understand how email works, we must know about network protocols that specify sets of data formats and data exchange rules for exchanging messages over the network. In the modern internet email sending part is conceptually (and sometimes technically) separate from receiving part. Email sending aspects are formalised in RFCs that describe SMTP protocol, whereas reception part can be done either via IMAP or POP3 protocol.
So, what is wget? Wget is a prominent command line tool to download stuff from the web. It has a fairly extensive feature set that includes recursive downloading, proxy support, site mirroring, cookie handling and so on. Let us go through some use cases of wget.
To download a single file/page using wget, just pass the corresponding URL into argv[1]:
$ wget http://www.textfiles.com/100/crossbow --2022-11-11 18:24:33-- http://www.textfiles.com/100/crossbow Resolving www.textfiles.com (www.textfiles.com)... 208.86.224.90 Connecting to www.
When doing recon part of penetration testing or bug bounty hunting engagements one may want to run various tools (such as port scanners, crawlers, vulnerability scanners, headless browsers and so on) in a VPS environment that will no longer be needed when the task at hand is complete. For large-scale scanning it is highly desirable to spread the workload across many servers. To address these need, axiom was developed. Axiom is a dynamic infrastructure framework designed for quick setup and teardown of reproducible infrastructure.
Legacy ASP.Net sites are some of the most difficult ones to deal with for web scraper developers due to their archaic state management mechanism that entails hidden form submissions via POST requests. This stuff is not easy to reproduce programmatically. In this post we will discuss a real world example of scraping a site based on some old version of ASP.Net technology.
The site in question is publicaccess.claytoncounty.gov - a real estate record website for Clayton County, Georgia.
Sometimes data being scraped is a little fuzzy - easy to understand for human mind, but not strict and formal enough to be easily tractable by algorithms. This can get in a way of analysing the data or basing further automations off it. This is where heuristics come in. A heuristic is an inexact, but quick way to solve a problem. We will discuss how GPT-3 API can be used to implement heuristic techniques to deal with somewhat fuzzy data that might otherwise require a human to take a look into it.
One Javascript obfuscation technique that can be found in the wild is string array mapping. It entails gathering all the string constants from the code into an array and modifying the code so that string literals are accesssed by referencing the array at various indices instead of having string literals being used directly. Like with previously discussed obfuscation techniques we will be exploring it in isolation, but in real websites it will most likely be used in combination with another techniques.
We discussed some passive DNS recon techniques, but that’s only half of the story. There is also the active DNS reconnaissance that involves generating actual DNS requests to gather data. Active DNS recon does not rely on information footprints and secondary sources. Thus it enables us to get more up-to-date information on target systems.
So how do we send the DNS queries? We want to control what exact request we send and bypass DNS caching on the local system, so using getaddrinfo() is out of question.
DNS is a network protocol and a distributed system that maps human-readable domain names to IP addresses. It can be thought of as a big phone book for internet hosts. Passive DNS recon is activity of using various data sources to map out a footprint of site or organisation without ever directly accessing the target or launching DNS requests. That’s basically doing OSINT on DNS names and records that are linked to the target.
Python requests module is a well established open source library for doing HTTP requests that is widely used in web scraping and other fields. However it has some limitations. At the time of writing, requests module only supports HTTP/1.1 yet significant fraction of sites are supporting more modern, faster HTTP/2 protocol and there is no support for asynchronous communication in requests module.
HTTPX is a newer, more modern Python module that addresses some of the limitations of requests module (not to be confused with another httpx that is CLI tool for probing HTTP servers).
OSINT is collection and analysis of information from public sources. Nowadays it can largely be automated through web scraping and API integrations. Recon-ng is open source OSINT framework that features a modular architecture. Each module can be thought as a pluggable piece of code that can be loaded on as-needed basis. Most modules are API integrations and scrapers of data sources. Some deal with report generation and other auxillary tasks. Since recon-ng was developed by and for infosec people, the user interface resembles that of Metasploit Framework, thus making the learning curve easier to people who are working on the offensive side of cybersecurity.
Curl is a prominent CLI tool and library for performing client-side requests in a number of application layer protocols. Commonly it is used for HTTP requests. However, there is more to curl than that. We will go through some less known features of curl that can be used in web scraping and automation systems.
Debugging Sometimes things fail and require us to take a deeper look into technical details of what is happening.
Stable Diffusion is a recently released open source text-to-image AI system that challenges DALL-E by OpenAI. Nowadays OpenAI products are open in name only: aside from client libraries and some other inconsequential things, all the new products by OpenAI (GPT-3, DALL-E 2) are not only proprietary, but also offered in SaaS form only. In constrast with locked-down proprietarism of DALL-E, Stable Diffusion is fully open source and can be self-hosted on a server the end user controls.
DALL-E 2 is a cutting edge AI system for creating and manipulating images based on natural language instructions. Although it is not 100% available to the general public yet, it is possible to apply for access by filling a form at OpenAI site. This will place you in a waitlist. I was able to get access after some 20 days. It’s unclear on what basis they choose or prioritise people for access, but my guess is that having spent a few dollars on GPT-3 had helped.
Unreachable parts can be injected into JavaScript source as a form of obfuscation. We will go through a couple of simple examples to show how the unreachable code paths can be removed at Abstract Syntax Tree level for reverse engineering purposes.
For the first example, let us consider the following code:
if (1 == 2) console.log("1 == 2"); if (1 == 1) { console.log("1 == 1"); } else if (1 == 2) { console.
To make Python script parametrisable with some input data we need to develop a Command Line Interface. This will enable the end user to run the script with various inputs and provide a way to choose optional behaviors that the script supports. For example, running the script with --silent suppresses the output that it would otherwise print to the terminal, whereas --verbose makes the output extra detailed with fine-grained technical details included.
ImageMagick is a set of CLI tools and a C library for performing a variety of digital image handling tasks, such as converting between formats, resizing, mirroring, rotating, adjusting colors, rendering text, drawing shapes and so on. In many cases ImageMagick is not being used directly, but exists as part of the hidden substrate of C code that the modern shiny stuff is built upon. For example, it may be used to generate thumbnails of images that users uploaded to a web application.
One needs to use proxy pool to scrape some of the sites that implement countermeasures against scraping. It might be because of IP-based rate limiting, geographic restrictions, AS-level traffic filtering or even something more advanced such as TLS fingerprinting. When Scrapy framework is used to implement a scraper, there are some ways to do so:
Setting HTTPS_PROXY environment variable when running a spider will make the traffic go through the proxy.
To get a deeper insight into how an Android app works we may want to convert the binary form inside APK file to some sort of textual representation that we can read and edit. This may be desirable for working out how to defeat some automation countermeasures (e.g. HMAC being applied on API requests), finding vulnerabilities in the apps themselves or in backend systems that service the apps, analysing malicious code and so on.
When scraping the web one sometimes comes across data being hardcoded in JS snippets. One could be using regular expressions or naive string operations to extract it. However, it is also possible to parse JavaScript code into Abstract Syntax Tree to extract the hardcoded data in a more structured way than by using simple string processing. Let us go through couple of examples.
Suppose you were scraping some site and came across to JS code similar to following example from Meta developer portal:
Since the previous post I realised that there’s more interesting and valuable tools that were not covered and that this warrants a new post. So let’s discuss several of them.
AST Explorer Sometimes you run into obfuscated code that you need to make sense of and write some code to undo the obfuscation. AST explorer is a web app that let’s you put in code snippets and browse the AST for multiple programming languages and parsing libraries.
Previously on Trickster Dev:
Understanding Abstract Syntax Trees JavaScript Obfuscation Techniques by Example When it comes to reverse engineering obfuscated JavaScript code there are two major approaches:
Dynamic analysis - using debugger to step through the code and observing it’s behaviour over time. Static analysis - performing source code analysis without running it, but parsing and analysing the code itself. It is bad idea to rely exclusively on regular expressions and naive string manipulation to do web scraping.
Sometimes when working on scraping some website you look into JavaScript code and it looks like a complete mess that is impossible to read - no matter how much you squint, you cannot make sense of it. That’s because it has been obfuscated. Code obfuscation is a transformation that is meant to make code difficult to read and reverse engineer. Many web sites utilize JavaScript obfuscation to make things difficult to scraper/automation developers.
The following text assumes some knowledge and experience with Scrapy and is meant to provide a realistic example of scraping moderately difficult site that you may be doing as freelancer working in web scraping. This post is not meant for people completely unfamiliar with Scrapy. The reader is encouraged to read some content earlier in the blog that introduces Scrapy from ground up. This time, we are building up from the knowledge about basics of Scrapy and thus skipping some details.
TLS (Transport Level Security) is a network protocol that sits between trasport layer (TCP) and application layer protocols (HTTP, IMAP, SMTP and so on). It provides security features such as encryption and authentication to TCP connections that merely deal with reliably transferring streams of data. For example, a lot of URLs in modern web start with https:// and you typically see a lock icon by the address bar on your web browser.
Recently Youtube has introduced a small graph on it’s user interface that visualises a time series of video viewing intensity. Spikes in this graph indicate what parts of the video tend to be replayed often, thus being most interesting or relevant to watch. This requires quite some data to be accumulated and is only available on sufficiently popular videos.
Screenshot 1
Let us try scraping this data as an exercise in web scraper development.
First thing that happens when a program source code is parsed by compiler or interpreter is tokenization - cutting of code into substrings that are later organised into parse tree. However, this tree structure merely represents textual structure of the code. Further step is syntactic analysis - an activity of converting parse tree (also known as CST - concrete syntax tree) into another tree structure that represent logical (not textual) structure.
This post will consist of notes taken from The Bug Hunter’s Methodology: Application Analysis v1 - a talk by Jason Haddix at Nahamcon 2022. These notes are mostly for my own future review, but hopefully other people will find it useful as well.
Many people have been teaching how to inject an XSS payload, but not how to systematically find vulnerabilities in the first place. Jason has created an AppSec edition of his methodology when it became large enough to be split into recon and AppSec parts.
Automation systems tends to have a temporal aspect to them as some action or the entire flow may need to be executed at specific times, at a certain intervals. Scrapers, vulnerability scanners and social media bots are examples of things that you may want to run at schedule. Those using web scraping for lead generation or price intelligence need to relaunch the web scraper often enough to get up-to-date snapshots of data.
Sometimes one would scrape eCommerce product data for the purpose of reselling these products. For example, a retail ecommerce company might be sourcing their products from a distributor that does not provide an easy way to integrate into Shopify store. This problem can be solved through web scraping. Once data is scraped, it can be imported into Shopify store.
One way to do that is to wrangle the product dataset into file(s) that heed the Shopify product CSV schema and import it via Shopify store admin dashboard.
Scrapy framework provides many benefits over regular Python scripts when it comes to developing web scrapers of non-trivial complexity. However Scrapy by itself does provide a direct way to integrate your web scraper into larger system that may need some on-demand scraping (e.g. price comparison website). ScrapyRT is a Scrapy extension that was developed by Zyte to address this limitation. Just like Scrapy itself, it is trivially installable through PIP. ScrapyRT exposes your Scrapy project (not just spiders, but also pipelines, middlewares and extensions) through HTTP API that you can integrate into your systems.
Instagram scraping is of interest to OSINT and growth hacking communities, but can be rather challenging. If we proceed with using browser automation for this purpose we risk triggering client-side countermeasures. Using private API is a safer, more performant approach, but has it’s own challenges. Instagram implements complex API flows that involve HMACs and other cryptographic techniques for extra security. When it comes to implementing Instagram scraping code, one does not simply use mitmproxy or Chrome DevTools to intercept API requests so that we could reproduce them programmatically.
To make HTTP requests with Python we can use requests module. To send and receive things via TCP connection we can use stream sockets. However, what if we want to go deeper than this? To parse and generate network packets, we can use struct module. Depending on how deep in protocol stack are we working we may need to send/receive the wire format buffer through raw sockets. This can be fun in a way, but if this kind of code is being written for research purposes (e.
Introduction and motivation We live in the postmodern age of hyperreality. Vast majority of information most people receive through technological means without having any direct access to it’s source or any way to verify or deny what is written or shown on the screen.
A pretty girl sitting in a private jet might have paid a lot of money to fly somewhere warm or might have paid a company that keeps the plane on the tarmac a smaller amount of money to do a photo shoot.
Amazon Web Services (AWS) is the most prominent large cloud provider in the world, offering what seems to be practically unlimited scalability and flexibility. It is heavily marketed towards startups and big companies alike. There’s even a version of AWS meant for extra-secure governmental use (AWS for Government). There’s an entire branch of DevOps industry that helps setting up software systems on AWS and a lot of people are making money by using AWS in some capacity.
Pyinstaller is a CLI tool that compiles Python scripts into executable binaries, installable through PIP. Let us go through a couple of examples of using this tool.
SMTP enumeration script from previous post can be trivially compiled with this tool by running the following commnd:
$ pyinstaller smtp_enum.py This creates two directories - build/ for intermediate files and dist/ for the results of compilation. However we find that dist/ contains multiple binary files, whereas it is generally more convenient to compile it into single file that statically links all the dependencies.
This post will summarize The Bug Hunter’s Methodology v4.01: Recon edition - a talk at [email protected] 2020 by Jason Haddix, a prominent hacker in the bug bounty community. Taking notes is important when learning new things and therefore notes were taken for future reference of this material. This methodology represents breadth-first approach to bounty hunting and is meant to provide a reproducible strategy to discover as many assets related to the target as possible (but make sure to heed scope!
Colly is a web scraping framework for Go programming language. The feature set of Colly largely overlaps with that of Scrapy framework from Python ecosystem:
Built-in concurrency. Cookie handling. Caching of HTTP response data. Automatic heeding of robots.txt rules. Automatic throttling of outgoing traffic. Furthermore, Colly supports distributed scraping out-of-the-box through a Redis-based task queue and can be integrated with Google App Engine. This makes it a viable choice for large-scale web scraping projects.
Although many proxy providers offer data center (DC) proxies fairly cheaply, sometimes it is desirable to make our own. In this post we will discuss how to set up Squid proxy server on cheap Virtual Private Servers from Vultr. We will be using Debian 11 Linux environment on virtual “Cloud Compute” servers.
Let us go through the steps to install Squid through Linux shell with commands that will be put into provisioning script.
Sometimes it is desirable to enumerate all (or as much as possible) email recipients at given domain. This can be done by establishing SMTP connection to corresponding mail server that can be found via DNS MX record and getting the server to verify your guesses for email usernames or addresses.
There are three major approaches to do this:
Using VRFY request that is meant to verify if there is user with corresponding mailbox on the server.
Typically Docker is used to encapsulate server-side software in reproducible packages - containers. A certain degree of isolation is ensured between containers. Furthermore, containers can be used as building blocks for systems consisting of multiple software servers. For example, a web app can consist of backend server, database server, frontend server, load balancer, redis instance for caching and so on.
However, what if we want to run desktop GUI apps within Docker containers to use them as components within larger systems?
When developing and operating scraping/automation solutions, we don’t exactly want to focus on systems administration part of things. If we are writing code to achieve a particular objective, making that code run on the VPS or local system is merely a supporting side-objective that is required to make the primary thing happen. Thus it is undesirable to spend too much time on it, especially if we can use automation to avoid repetitive, error-prone activity of installing the required tooling in disposable virtual machines or virtual private servers.
HTTP messages are typically are not sent in plaintext in the post-Snowden world. Instead, TLS protocol is used to provide communications security against tampering and surveillance of communications based on HTTP protocol. TLS itself is fairly complex protocol consisting of several sub-protocols, but let us think of it as encrypted and authenticated layer on top of TCP connection that also does some server (and optionally client) verification through public key cryptography.
In this post we will take a deeper look into architecture of not just Scrapy projects, but Scrapy framework itself. We will go through some key components of Scrapy and will look into how data is flowing through the system.
Let us look into the following picture from Scrapy documentation: https://docs.scrapy.org/en/latest/_images/scrapy_architecture_02.png
We see the following components:
Engine is the central switchboard for all data that is transferred inside Scrapy when it is running.
Many websites do not provide proper access to information for unauthenticated users. Instead, the data is provided on some client area that is accessible only after the user goes through login flow, possibly by signing up for a paid plan beforehand. How can such websites be scraped with Python?
There are two additional things we have to do when scraping data behind login:
Set up requests.Session object with proper cookies by reproducing the login flow programmatically.
By default, Scrapy framework provides a way to export scraped data into CSV, JSON, JSONL, XML files with a possibility to store them remotely. However, we may need more flexibility in how and where the scraped data will be stored. This is the purpose of Scrapy item pipelines. Scrapy pipeline is a component of Scrapy project for implementing post-processing and exporting of scraped data. We are going to discuss how to implement data export code in pipelines and provide a couple of examples.
Scrapy is a prominent Python framework for web scraping that provides a certain kind of project template structure for easier development. However, once Scrapy project is developed it may be necessary to deploy it into cloud for long running scraping jobs. Scrapy Cloud is a SaaS solution for hosting your Scrapy projects, developed by the same company that created Scrapy framework. Let us go through an example for using it with already developed scraper.
You may want to get notified about certain events happening during your scraping/botting operations. Examples might be an outages of external systems that your setup depends on, fatal error conditions, scraping jobs being finished, and so on. If you are implementing automations for bug bounty hunting, you certaintly want to get notified about new vulnerabilities being found in target systems being scanned. You may also want to get periodic status updates on long running tasks.
MS Playwright is a framework for web testing and automation. Playwright supports programmatic control of Chromium and Firefox browser and also integrates WebKit engine. Headless mode is supported and enabled by default, thus making it possible to run your automations in environments that have no GUI support (lightweight virtual private servers, Docker containers and so on). Playwright is written in JavaScript, but has official bindings for Python, C#, TypeScript and Java.
Spreadsheets are a mainstay tool for information processing in many domains and widely used by people in many walks of life. It is fairly common for developers working in scraping and automation to be dealing with spreadsheets for inputting data into custom code, report generation and other tasks. Openpyxl is a prominent Python module for reading and writing a spreadsheet files in Open Office XML format that is compatible with many spreasheet programs (MS Excel 2010 and later, Google Docs, Open Office, Libre Office, Apple Numbers, etc.
Use Scrapy shell for interactive experimentation Running scrapy shell gives you an interactive environment for experimenting with the site being scraped. For example, running fetch() with URL of page fetches the page and creates a response variable with scrapy.Response object for that page. view(response) opens the browser to let you see the HTML that the Scrapy spider would fetch. This bypasses some of the client-side rendering and also lets us detect if the site has some countermeasures against scraping.
You may have some reason to automatically gather (harvest) software developer emails. That might be SaaS marketing objectives, recruitment or community building. One place online that has a lot of developers is Github. Some of the developers have a public email address listed on their Github profile. Turns out, this information is available through Github’s REST API and can be extracted with just a bit of scripting.
First we need an API token.
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a type of automation countermeasure. It is based on challenge-response approach that involves asking user to perform an action that only human is supposed to be able to perform, such as:
Writing down characters or words from distorted image. Performing based mathematical operations that are presented in distorted image. Recognizing specific object within a set of images, possibly with distortion.
Scrapy is a Python framework for web scraping. One might ask: why would we need an entire framework for web scraping if we can simply code up some simple scripts in Python using Beautiful Soup, lxml, requests and the like? To be fair, for simple scrapers you don’t. However when you are developing and running web scraper of non-trivial complexity the following problems will arise:
Error handling. If you have an unhandled exception in your Python code, it can bring down an entire script, which can cause to time and data being lost.
To get things done, one needs a set of tools appropriate to the task. We will discuss several open source tools that are highly valuable for developers working in scraping and automation.
Chrome DevTools Let us start with the basics. Chrome has a developer tool panel that you can open by right-clicking on something in the web page and choose “Inspect Element” or by going to View -> Developer -> Developer Tools.
Automated gathering of email addresses is known as email harvesting. Suppose we want to gather the email addresses of certain kinds of individuals, such as influencers or content creators. This can be accomplished through certain less-known features of Google. For example, site: operator limits search results to a given domain. Double-quoting something forces Google to provide exact matches on the quoted text. Boolean operators (AND, OR) are also supported.
Google caps the number of results to about 300 per query, but we can mitigate this limitation by breaking down search space into smaller segments across multiple search queries.
When scraping HTML pages, many developers are using CSS selectors to find the right elements for data extraction. However, modern HTML is largely based on XML format, which means there is a better, more powerful way to find the exact elements one needs: XPath. XPath is an entire language created for traversing XML (and by extension HTML) element trees. The size and complexity of XPath 3.1 specification might seem daunting, but the good news is that you need to know very little of it as web scraper developer.
When it comes to scraping and automation operations, it might be important to control where remote systems see the traffic coming from to evade rate-limiting, captchas and IP bans. This is what we need to use proxies for. Let us talk about individual proxy servers and proxy pools.
Proxy server is server somewhere in the network that acts as middleman for network communications. One way this can work is connection-level proxying via SOCKS protocol.
In the previous post, instructions were provided on how to set up mitmproxy with iOS device. In this one, we will be going through setting up mitmproxy with Android device or emulator. If you would prefer to use Android emulator for hacking mobile apps I would recommend Genymotion software which lets you create pre-rooted virtual devices. The following steps were reproduced with Android 10 system running in Genymotion emulator with Google Chrome installed through ADB from some sketchy APK file that was found online.
mitmproxy is an open source proxy server developed for launching man-in-the-middle attacks against network communications (primarily HTTP(S). mitmproxy enables passive sniffing, active modification and replaying of HTTP messages. It is meant to be used for troubleshooting, reverse engineering and penetration testing of networked software.
We will be setting up mitmproxy with iOS 15 device for the scraping and gray hat automation purposes. One use case of mitmproxy-iPhone setup is discussed in my previous post about scraping private API of mobile app.
Web scraping is a widely known way to gather information from external sources. However, it is not the only way. Another way is API scraping. We define API scraping as the activity of automatically extracting data from reverse engineered private APIs. In this post, we will go through an example of the reverse engineering private API of a mobile app and developing a simple API scraping script that reproduces API calls to extract the exact information that the app is showing on the mobile device.
First, let us presume that we want to develop code to extract structured information from web pages that may or may not be doing a lot of API calls from client-side JavaScript. We are not trying to develop a general crawler like Googlebot. We don’t mind writing some code that would be specific for each site we are scraping (e.g. some Scrapy spiders or Python scripts).
When coming across discussions about web scraper development on various forums online, it is common to hear people saying that they need JavaScript rendering to scrape websites.
In this day and age, JSON is the dominating textual format for information exchange between software systems. JSON is based on key-value pairs. Key is always a string, but values can be objects (dictionaries), arrays, numbers, booleans, strings or nulls. Common problem that software developers are running into is that JSON tree structures can be deeply nested with parts that may or may not be present. This can lead to tedious, awkward code.
Use pandas read_html() to parse HTML tables (when possible) Some pages you might be scraping might include old-school HTML <table> elements with tabular data. There’s an easy way to extract data from pages like this with into Pandas dataframes (although you may need to clean up it afterward):
>>> import pandas as pd >>> pd.read_html('https://en.wikipedia.org/wiki/Yakutsk')[0] Yakutsk Якутск Yakutsk Якутск.1 0 City under republic jurisdiction[1] City under republic jurisdiction[1] 1 Other transcription(s) Other transcription(s) 2 • Yakut Дьокуускай 3 Central Yakutsk from the air Central Yakutsk from the air 4 .
Twitch is a popular video streaming platform originally developed for gamers, but now expanding beyond gaming community. It enables viewers to use chat rooms associated with video streams to communicate amongst each other and with the streamer. Chatbots for these chatrooms can be used for legitimate uses cases such as moderations, polls, ranking boards, and so on. Twitch is allowing such usage but requires passing a verification process for chatbot software to be used in production at scale.
There is more to using Google than searching by keywords, phrases and natural language questions. Google also has advanced features that can empower users to be extra specific with their searches.
To search for an exact phrase, wrap it in double quotes, for example:
"top secret" To search for documents with a specific file extension, use filetype: operator:
filetype:pdf "top secret' To search for results in a specific domain, use site: operator.
Sometimes, when developing server-side software it is desirable to make it accessible for access outside the local network, which might be shielded from incoming TCP connections by a router that performs Network Address Translation. One option to work around this is to use Ngrok - a SaaS app that lets you tunnel out a connection from your network and exposes it to external traffic through a cloud endpoint. However, it is primarily designed for web apps and it would be nice if we didn’t need to rely on a third-party SaaS vendor to make our server software accessible outside our local network.
Howitzer is a SaaS tool that scrapes subreddits for users mentioning given keywords and automates mass direct message sending for growth hacking purposes.
Generally speaking, there are significant difficulties when automating against major social media platforms. However, Reddit is not as hostile towards automation as other platforms and even provides a relatively unrestricted official API for building bots and integrations.
Let us try automating against Reddit API with Python to build a poor mans Howitzer with Python.
GPT-3 is a large scale natural language processing system based on deep learning and developer by OpenAI. It is now generally available for developers and curious people.
GPT-3 is the kind of AI that works exclusively on text. Text is both input and output for GPT-3. One can provide questions and instructions in plain English and receive fairly coherent responses. However, GPT-3 is not:
HAL from 2001: Space Odyssey Project 2501 from Ghost In the Shell T-800 from The Terminator AM from I Have No Mouth And I Must Scream What GPT-3 is a tool AI (as opposed to agent AI).
PerimeterX is a prominent vendor of anti-bot technology, used by portals such as Zillow, Crunchbase, StockX and many others. Many developers working on web scraping or automation scripts have ran into PerimeterX Human Challenge - a proprietary CAPTCHA that involves pressing and holding a HTML element and does not seem to solvable by any of the CAPTCHA solving services.
PerimeterX has registered the following US patents:
US 10,708,287B2 - ANALYZING CLIENT APPLICATION BEHAVIOR TO DETECT ANOMALIES AND PREVENT ACCESS US 10,951,627B2 - SECURING ORDERED RESOURCE ACCESS US 2021/064685A1 - IDENTIFYING A SCRIPT THAT ORIGINATES SYNCHRONOUS AND ASYNCHRONOUS ACTIONS (pending application) Let us take a look into these patents to discover key working principles of PerimeterX bot mitigation technology.
When wandering across the World Wide Web, many netizens have come across pages containing Youtube or Vimeo videos embedded in them. youtube-dl is a prominent tool to download online videos from many sources (not limited by Youtube - see the complete list of supported sites, but can it download the videos even if they are embedded in some third party website?
Turns out, it can (with a little bit of help from the user).
Trickster Dev
Code level discussion of web scraping, gray hat automation, growth hacking and bounty hunting