Curl is a prominent CLI tool and library for performing client-side requests in a number of application layer protocols. Commonly it is used for HTTP requests. However, there is more to curl than that. We will go through some less known features of curl that can be used in web scraping and automation systems.
Debugging
Sometimes things fail and require us to take a deeper look into technical
details of what is happening. Running curl with --verbose
(or just -v
)
gives us a certain level of technical details regarding connection establishment
and protocol messages being transmitted:
$ curl http://httpbin.org/status/401 -v
* Trying 34.227.213.82:80...
* Connected to httpbin.org (34.227.213.82) port 80 (#0)
> GET /status/401 HTTP/1.1
> Host: httpbin.org
> User-Agent: curl/7.79.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 401 UNAUTHORIZED
< Date: Sat, 03 Sep 2022 07:28:47 GMT
< Server: gunicorn/19.9.0
< WWW-Authenticate: Basic realm="Fake Realm"
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Credentials: true
< Cache-Control: proxy-revalidate
< Content-Length: 0
< Connection: Keep-Alive
<
* Connection #0 to host httpbin.org left intact
However, we may want to look deeper. There are CLI options that gives us even
more detailed output for troubleshooting. --trace
saves full dump of data
being sent and received into a text file (or standard output if we use -
instead of file path):
$ curl http://httpbin.org/status/401 --trace -
== Info: Trying 34.227.213.82:80...
== Info: Connected to httpbin.org (34.227.213.82) port 80 (#0)
=> Send header, 85 bytes (0x55)
0000: 47 45 54 20 2f 73 74 61 74 75 73 2f 34 30 31 20 GET /status/401
0010: 48 54 54 50 2f 31 2e 31 0d 0a 48 6f 73 74 3a 20 HTTP/1.1..Host:
0020: 68 74 74 70 62 69 6e 2e 6f 72 67 0d 0a 55 73 65 httpbin.org..Use
0030: 72 2d 41 67 65 6e 74 3a 20 63 75 72 6c 2f 37 2e r-Agent: curl/7.
0040: 37 39 2e 31 0d 0a 41 63 63 65 70 74 3a 20 2a 2f 79.1..Accept: */
0050: 2a 0d 0a 0d 0a *....
== Info: Mark bundle as not supporting multiuse
<= Recv header, 27 bytes (0x1b)
0000: 48 54 54 50 2f 31 2e 31 20 34 30 31 20 55 4e 41 HTTP/1.1 401 UNA
0010: 55 54 48 4f 52 49 5a 45 44 0d 0a UTHORIZED..
<= Recv header, 37 bytes (0x25)
0000: 44 61 74 65 3a 20 53 61 74 2c 20 30 33 20 53 65 Date: Sat, 03 Se
0010: 70 20 32 30 32 32 20 30 37 3a 33 36 3a 32 35 20 p 2022 07:36:25
0020: 47 4d 54 0d 0a GMT..
<= Recv header, 25 bytes (0x19)
0000: 53 65 72 76 65 72 3a 20 67 75 6e 69 63 6f 72 6e Server: gunicorn
0010: 2f 31 39 2e 39 2e 30 0d 0a /19.9.0..
<= Recv header, 44 bytes (0x2c)
0000: 57 57 57 2d 41 75 74 68 65 6e 74 69 63 61 74 65 WWW-Authenticate
0010: 3a 20 42 61 73 69 63 20 72 65 61 6c 6d 3d 22 46 : Basic realm="F
0020: 61 6b 65 20 52 65 61 6c 6d 22 0d 0a ake Realm"..
<= Recv header, 32 bytes (0x20)
0000: 41 63 63 65 73 73 2d 43 6f 6e 74 72 6f 6c 2d 41 Access-Control-A
0010: 6c 6c 6f 77 2d 4f 72 69 67 69 6e 3a 20 2a 0d 0a llow-Origin: *..
<= Recv header, 40 bytes (0x28)
0000: 41 63 63 65 73 73 2d 43 6f 6e 74 72 6f 6c 2d 41 Access-Control-A
0010: 6c 6c 6f 77 2d 43 72 65 64 65 6e 74 69 61 6c 73 llow-Credentials
0020: 3a 20 74 72 75 65 0d 0a : true..
<= Recv header, 33 bytes (0x21)
0000: 43 61 63 68 65 2d 43 6f 6e 74 72 6f 6c 3a 20 70 Cache-Control: p
0010: 72 6f 78 79 2d 72 65 76 61 6c 69 64 61 74 65 0d roxy-revalidate.
0020: 0a .
<= Recv header, 19 bytes (0x13)
0000: 43 6f 6e 74 65 6e 74 2d 4c 65 6e 67 74 68 3a 20 Content-Length:
0010: 30 0d 0a 0..
<= Recv header, 24 bytes (0x18)
0000: 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20 4b 65 65 70 Connection: Keep
0010: 2d 41 6c 69 76 65 0d 0a -Alive..
<= Recv header, 2 bytes (0x2)
0000: 0d 0a ..
== Info: Connection #0 to host httpbin.org left intact
If we wanted to skip hexdumps of protocol messages we can use --trace-ascii
instead of --trace
. To augment the debug output with timing information
we can use --trace-time
:
$ curl http://httpbin.org/status/401 --trace-ascii - --trace-time
14:38:54.540384 == Info: Trying 54.147.68.244:80...
14:38:54.573724 == Info: Connected to httpbin.org (54.147.68.244) port 80 (#0)
14:38:54.573928 => Send header, 85 bytes (0x55)
0000: GET /status/401 HTTP/1.1
001a: Host: httpbin.org
002d: User-Agent: curl/7.79.1
0046: Accept: */*
0053:
14:38:55.131822 == Info: Mark bundle as not supporting multiuse
14:38:55.131875 <= Recv header, 27 bytes (0x1b)
0000: HTTP/1.1 401 UNAUTHORIZED
14:38:55.131907 <= Recv header, 37 bytes (0x25)
0000: Date: Sat, 03 Sep 2022 07:38:55 GMT
14:38:55.131932 <= Recv header, 25 bytes (0x19)
0000: Server: gunicorn/19.9.0
14:38:55.131958 <= Recv header, 44 bytes (0x2c)
0000: WWW-Authenticate: Basic realm="Fake Realm"
14:38:55.131983 <= Recv header, 32 bytes (0x20)
0000: Access-Control-Allow-Origin: *
14:38:55.132004 <= Recv header, 40 bytes (0x28)
0000: Access-Control-Allow-Credentials: true
14:38:55.132028 <= Recv header, 33 bytes (0x21)
0000: Cache-Control: proxy-revalidate
14:38:55.132050 <= Recv header, 19 bytes (0x13)
0000: Content-Length: 0
14:38:55.132070 <= Recv header, 24 bytes (0x18)
0000: Connection: Keep-Alive
14:38:55.132091 <= Recv header, 2 bytes (0x2)
0000:
14:38:55.132122 == Info: Connection #0 to host httpbin.org left intact
This can be used with --trace
, --trace-ascii
and --verbose
.
Dictionary lookups with DICT protocol
DICT (RFC 2229) is a application level protocol for performing dictionary lookups. It can be used via curl like this:
$ curl dict://dict.org/d:linux
220 dict.dict.org dictd 1.12.1/rf on Linux 4.19.0-10-amd64 <auth.mime> <[email protected]>
250 ok
150 1 definitions retrieved
151 "linux" wn "WordNet (r) 3.0 (2006)"
Linux
n 1: an open-source version of the UNIX operating system
.
250 ok [d/m/c = 1/0/30; 0.000r 0.000u 0.000s]
221 bye [d/m/c = 0/0/0; 0.000r 0.000u 0.000s]
This may seem like it’s nothing to write home about. After all, we can
search Google with define:
and the word we want to lookup. Scraping
Google SERPs either directly or through some SaaS API is not hard. However
this provides an example of a fairly obscure network protocol that
curl has support for.
Transferring files via (S)FTP
Let us something bit more useful now. Curl can be used to transfer files via FTP, SFTP and FTPS protocols. For example, we can use curl to list files on FTP server and download them:
$ curl --list-only ftp://ftp.sunet.se
HEADER.html
Public
about
cdimage
conspiracy
debian
debian-cd
favicon.ico
images
mirror
pub
releases
robots.txt
tails
ubuntu
$ curl --list-only ftp://ftp.sunet.se/debian/
README
README.CD-manufacture
README.html
README.mirrors.html
README.mirrors.txt
dists
doc
extrafiles
indices
ls-lR.gz
pool
project
tools
zzz-dists
$ curl ftp://ftp.sunet.se/debian/README -o README
If we wanted to upload a local file to remote path at FTP server we could use
-T
or --upload-file
.
However plain text anonymous servers are going out of fashion within last couple decades. When we have SSH access to a remote server we can also use SFTP - a more modern file transfer protocol that is built upon the foundation of SSH. This is not merely FTP over SSH, but a separate protocol designed from ground up by IETF. It is also more generalised than the legacy SCP protocol.
To use SFTP with curl, we would use the same URL format, except we would have
sftp://
at the beginning. If username/password auth is required we can
pass the credentials in colon-separated form via --user
option.
If we are trying to connect to FTPS (FTP over TLS) server we would use ftps://
for the URLs.
Sending and receiving email
Curl also supports SMTP protocol for email sending. Let us try sending an email via Sendgrid. First we need to prepare a text file with some email headers and content. This is what we save into email.txt:
From: No Reply <[email protected]>
To: Test User <[email protected]>
Subject: Test email
A specter is haunting the modern world, the specter of crypto anarchy.
A command to send this email is the following:
$ curl --insecure --ssl smtp://smtp.sendgrid.com:587 --user "apikey:[REDACTED]" --mail-rcpt "[email protected]" --mail-from "[email protected]" --upload-file email.txt
The redacted part is Sendgrid API key that we use as SMTP password.
Curl can be used to send emails, but what about receiving them? This is also possible via POP3 and IMAP protocols.
Let us first list message IDs and sizes via POP3 protocol:
$ curl pop3://pop3.rambler.ru --user "[email protected]:[REDACTED]"
1 4388
We have a single message with ID 1 and size 4388. It can be downloaded by using the message ID as URL path:
$ curl pop3://pop3.rambler.ru/1 --user "[email protected]:[REDACTED]"
Things are a bit different with IMAP, as IMAP protocol supports different mailboxes that we can list:
$ curl --ssl imap://imap.rambler.ru/ --user "[email protected]:[REDACTED]"
* LIST (\HasNoChildren \UnMarked \Sent) "/" SentBox
* LIST (\HasNoChildren \Junk) "/" Spam
* LIST (\HasNoChildren \Marked \Trash) "/" Trash
* LIST (\HasNoChildren \Drafts) "/" DraftBox
* LIST (\HasNoChildren) "/" INBOX
To check the status of mailbox, let us send examine
IMAP command:
$ curl --ssl "imap://mail.rambler.ru/" -X "examine INBOX" --user "[email protected]:[REDACTED]"
* FLAGS (\Answered \Flagged \Deleted \Seen \Draft)
* OK [PERMANENTFLAGS ()] Read-only mailbox.
* 2 EXISTS
* 0 RECENT
* OK [UIDVALIDITY 1644072147] UIDs valid
* OK [UIDNEXT 20] Predicted next UID
* OK [HIGHESTMODSEQ 37] Highest
To get a single message at specific index, query it like this:
$ curl --ssl "imap://mail.rambler.ru/INBOX;MAILINDEX=1" --user "[email protected]:[REDACTED]"
libcurl and it’s bindings
Curl is not only a CLI tool, but also a C library. If we have a curl snippet we can
add --libcurl
with file path to generate some boilerplate C code:
$ curl http://httpbin.org/status/401 --libcurl example.c
It will generate something like this:
/********* Sample code generated by the curl command line tool **********
* All curl_easy_setopt() options are documented at:
* https://curl.se/libcurl/c/curl_easy_setopt.html
************************************************************************/
#include <curl/curl.h>
int main(int argc, char *argv[])
{
CURLcode ret;
CURL *hnd;
hnd = curl_easy_init();
curl_easy_setopt(hnd, CURLOPT_BUFFERSIZE, 102400L);
curl_easy_setopt(hnd, CURLOPT_URL, "http://httpbin.org/status/401");
curl_easy_setopt(hnd, CURLOPT_NOPROGRESS, 1L);
curl_easy_setopt(hnd, CURLOPT_USERAGENT, "curl/7.85.0");
curl_easy_setopt(hnd, CURLOPT_MAXREDIRS, 50L);
curl_easy_setopt(hnd, CURLOPT_HTTP_VERSION, (long)CURL_HTTP_VERSION_2TLS);
curl_easy_setopt(hnd, CURLOPT_FTP_SKIP_PASV_IP, 1L);
curl_easy_setopt(hnd, CURLOPT_TCP_KEEPALIVE, 1L);
/* Here is a list of options the curl code used that cannot get generated
as source easily. You may choose to either not use them or implement
them yourself.
CURLOPT_WRITEDATA set to a objectpointer
CURLOPT_INTERLEAVEDATA set to a objectpointer
CURLOPT_WRITEFUNCTION set to a functionpointer
CURLOPT_READDATA set to a objectpointer
CURLOPT_READFUNCTION set to a functionpointer
CURLOPT_SEEKDATA set to a objectpointer
CURLOPT_SEEKFUNCTION set to a functionpointer
CURLOPT_ERRORBUFFER set to a objectpointer
CURLOPT_STDERR set to a objectpointer
CURLOPT_HEADERFUNCTION set to a functionpointer
CURLOPT_HEADERDATA set to a objectpointer
*/
ret = curl_easy_perform(hnd);
curl_easy_cleanup(hnd);
hnd = NULL;
return (int)ret;
}
/**** End of sample code ****/
If you are not working in fairly deep systems programming you may not be willing to code in C. However, there’s libcurl bindings available in a lot of languages:
- C++ - curlpp
- Go - go-curl
- Haskell - curl Haskell package
- Java - curl-java
- JavaScript - node-libcurl
- PHP - PHP curl library
- Python - PyCURL
- Rust - curl-rust
If you merely need to do some basic HTTP requests using libcurl is likely to be an overkill. For example, Python requests module is far easier to use than PyCURL. However, if you’re doing some nonstandard stuff libcurl might be a good abstraction layer to base your code upon. For example, once I had to integrate proxy servers that require TLS handshake to be performed before HTTP CONNECT message is sent by the client. Using such proxy servers is possible with PyCURL.