Reproducible Linux environments with Vagrant and Terraform

When developing and operating scraping/automation solutions, we don’t exactly want to focus on systems administration part of things. If we are writing code to achieve a particular objective, making that code run on the VPS or local system is merely a supporting side-objective that is required to make the primary thing happen. Thus it is undesirable to spend too much time on it, especially if we can use automation to avoid repetitive, error-prone activity of installing the required tooling in disposable virtual machines or virtual private servers. Luckily for us, there are DevOps tools developed for this exact purpose.

For the sake of this example, we will be setting up Node.js with n8n workflow automation platform.

To reproduce virtual machine with Linux OS on a development machine, we will use Vagrant VM management software with Virtualbox backend.

To reproduce the same environment in Digital Ocean VPS, we will use Terraform infrastructure-as-code tool.

Vagrant

First, we need to set up Vagrant. If you use any of the major desktop operating systems (Windows, macOS, Linux) you should be able to find an installer at Vagrant Downloads page. You may also want to check if it is available through a package manager on your system.

Likewise, Virtualbox installers are available on Virtualbox downloads page.

Once both are installed, we are ready to start setting up a configuration for our VM. We go to Vagrant boxes search - a searchable directory for pre-made virtual machine images that we can use. Since Vagrant works best with Linux, most of these will be Linux-based. For the sake of this example we choose generic/debian11 as this provides a fairly vanilla Debian 11 system that is also readily available on cloud providers.

We need to create Vagrantfile now. This will be a central configuration file for the Vagrant environment we will be setting up. Technically Vagrantfiles are written in Ruby, but don’t worry if you don’t know Ruby - it’s not actually required in all but most advanced use cases of Vagrant.

On the Vagrant box page, there is a very basic example snippet:

Vagrant.configure("2") do |config|
  config.vm.box = "generic/debian11"
end

However this is not enough for us. Thus we will create new Vagrantfile with more extensive template by running vagrant init generic/debian11 in a root directory of our project. This creates a following Vagrantfile:

# -*- mode: ruby -*-
# vi: set ft=ruby :

# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure("2") do |config|
  # The most common configuration options are documented and commented below.
  # For a complete reference, please see the online documentation at
  # https://docs.vagrantup.com.

  # Every Vagrant development environment requires a box. You can search for
  # boxes at https://vagrantcloud.com/search.
  config.vm.box = "generic/debian11"

  # Disable automatic box update checking. If you disable this, then
  # boxes will only be checked for updates when the user runs
  # `vagrant box outdated`. This is not recommended.
  # config.vm.box_check_update = false

  # Create a forwarded port mapping which allows access to a specific port
  # within the machine from a port on the host machine. In the example below,
  # accessing "localhost:8080" will access port 80 on the guest machine.
  # NOTE: This will enable public access to the opened port
  # config.vm.network "forwarded_port", guest: 80, host: 8080

  # Create a forwarded port mapping which allows access to a specific port
  # within the machine from a port on the host machine and only allow access
  # via 127.0.0.1 to disable public access
  # config.vm.network "forwarded_port", guest: 80, host: 8080, host_ip: "127.0.0.1"

  # Create a private network, which allows host-only access to the machine
  # using a specific IP.
  # config.vm.network "private_network", ip: "192.168.33.10"

  # Create a public network, which generally matched to bridged network.
  # Bridged networks make the machine appear as another physical device on
  # your network.
  # config.vm.network "public_network"

  # Share an additional folder to the guest VM. The first argument is
  # the path on the host to the actual folder. The second argument is
  # the path on the guest to mount the folder. And the optional third
  # argument is a set of non-required options.
  # config.vm.synced_folder "../data", "/vagrant_data"

  # Provider-specific configuration so you can fine-tune various
  # backing providers for Vagrant. These expose provider-specific options.
  # Example for VirtualBox:
  #
  # config.vm.provider "virtualbox" do |vb|
  #   # Display the VirtualBox GUI when booting the machine
  #   vb.gui = true
  #
  #   # Customize the amount of memory on the VM:
  #   vb.memory = "1024"
  # end
  #
  # View the documentation for the provider you are using for more
  # information on available options.

  # Enable provisioning with a shell script. Additional provisioners such as
  # Ansible, Chef, Docker, Puppet and Salt are also available. Please see the
  # documentation for more information about their specific syntax and use.
  # config.vm.provision "shell", inline: <<-SHELL
  #   apt-get update
  #   apt-get install -y apache2
  # SHELL
end

In the initial configuration, only config.vm.box is set and rest is commented out. Read the comments to get an idea of what can be configured for the virtual machine.

First, we want to perform provisioning with a shell script. However, we will not be using inline syntax shown in the comments and will be creating a separate shell script file instead. This is because we want to be prepared to reuse provisioning code in Terraform config that we will work on later without introducing code duplication. Thus we replace the provisioning section with the following line:

  config.vm.provision "shell", path: "provision.sh"

Now we have to develop the shell script that will install Node.js, n8n and all the dependencies we will need to run n8n and also launch n8n as daemon.

The contents of provision.sh file are fairly straightforward:

#!/bin/bash

set -x

apt-get update

curl -fsSL https://deb.nodesource.com/setup_17.x -o /tmp/install_node.sh
bash /tmp/install_node.sh
apt-get install -y gcc g++ make nodejs

curl -sL https://dl.yarnpkg.com/debian/pubkey.gpg | gpg --dearmor | sudo tee /usr/share/keyrings/yarnkey.gpg >/dev/null
echo "deb [signed-by=/usr/share/keyrings/yarnkey.gpg] https://dl.yarnpkg.com/debian stable main" | sudo tee /etc/apt/sources.list.d/yarn.list
apt-get update 
apt-get install -y yarn

npm install n8n -g
npm install pm2 -g

pm2 start n8n
pm2 startup
pm2 save

First, we run apt-get update so that APT system would fetch up-to-date information about Debian packages. Next, we proceed with Node.js installation. Since Debian package of Node.JS is not very well maintained, we use an alternative approach of installing NodeSource binary distribution that involves fetching a shell script to update APT configuration, running it and installing NodeJS from third party APT repository. We use similar approach to install Yarn package manager. There is no explicit step to install NPM, as it is installed from nodejs APT package.

Now we can use NPM to install n8n that we do through command npm install n8n -g. Since we want to run n8n as daemon we also install PM2 process manager and run the relevant commands to set up n8n as background service process. PM2 makes sure that n8n will be relaunched when Linux system is restarted.

Running vagrant up in the directory with Vagrantfile creates a virtual machine with Debian Linux 11 and runs our provisioning script. Running vagrant ssh will give us a shell access to Linux environment that was provisioned. We can verify that n8n is indeed running:

vagrant@debian11:~$ ps aux | grep n8n
root         907  7.5  9.9 11538924 201832 ?     Ssl  12:28   0:08 node /usr/bin/n8n
vagrant     1592  0.0  0.0   6152   708 pts/0    S+   12:30   0:00 grep --color=auto n8n
vagrant@debian11:~$ curl http://127.0.0.1:5678 -v
*   Trying 127.0.0.1:5678...
* Connected to 127.0.0.1 (127.0.0.1) port 5678 (#0)
> GET / HTTP/1.1
> Host: 127.0.0.1:5678
> User-Agent: curl/7.74.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< X-Powered-By: Express
< Access-Control-Allow-Origin: http://localhost:8080
< Access-Control-Allow-Credentials: true
< Access-Control-Allow-Methods: GET, POST, OPTIONS, PUT, PATCH, DELETE
< Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, sessionid
< Content-Type: text/html; charset=utf-8
< Content-Length: 1201
< ETag: W/"4b1-/hUzA+Pj4gDyZovqj340mKmXXjQ"
< Vary: Accept-Encoding
< Date: Fri, 25 Mar 2022 12:35:55 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width,initial-scale=1"><link rel="icon" href="/favicon.ico"><script>window.BASE_PATH = "/";</script><title>n8n.io - Workflow Automation</title><link href="/js/chunk-2d2073c1.f0370751.js" rel="prefetch"><link href="/js/chunk-2d22d3e6.388c0a4c.js" rel="prefetch"><link href="/js/chunk-4301fce8.d032a636.js" rel="prefetch"><link href="/js/chunk-b1e1f7c0.92d036f0.js" rel="prefetch"><link href="/css/app.287bd9f8.css" rel="preload" as="style"><link href="/css/chunk-vendors.29499615.css" rel="preload" as="style"><link href="/js/app.070a8ea4.js" rel="preload" as="script"><link href="/js/chunk-vendors.39458c07.js" rel="preload" as="script"><link href="/css/chunk-vendors.29499615.css" rel="stylesheet"><link href="/css/app.287bd9f8.css" rel="stylesheet"></head><body><noscript><strong>We're sorry but the n8n Editor-UI doesn't work properly without JavaScript enabled. Pl* Connection #0 to host 127.0.0.1 left intact
ease enable it to continue.</strong></noscript><div id="app"></div><script src="/js/chunk-vendors.39458c07.js"></script><script src="/js/app.070a8ea4.js"></script></body></html>

We can also verify presence of Node.js tooling:

vagrant@debian11:~$ node --version
v17.8.0
vagrant@debian11:~$ yarn --version
1.22.18
vagrant@debian11:~$ npm --version
8.5.5
vagrant@debian11:~$ node
Welcome to Node.js v17.8.0.
Type ".help" for more information.
> .exit

However, we cannot access n8n through a desktop web browser from host system. That’s because we did not configure port forwarding in our Vagrantfile.

We edit port forwarding section to add the following line:

  config.vm.network "forwarded_port", guest: 5678, host: 5678

To reboot the VM and make this change working, we run vagrant reload. After doing so, we can access n8n frontend via http://localhost:5678/.

There is one more thing we need to set up in Vagrant configuration. You may want to edit your source code through VIM or other text editor in host machine and run it in a guest machine without going through the hassle of using SFTP to transfer files. Vagrant supports shared directories that enable us to mirror a source code directory between host and guest systems. Edit the part of Vagrantfile that mentions shared folders to add the following line:

  config.vm.synced_folder ".", "/vagrant"

This will make current source code directory on host machine mirrored to /vagrant in guest machine. To make this change work, we need to run vagrant reload again. However, depending on exact circumstances this may not work right away. To fix it, you may need to install vbguest plugin and activate by running the following commands:

$ vagrant plugin install vagrant-vbguest
$ vagrant vbguest
$ vagrant reload

To temporarily shut down your Vagrant VM, run:

$ vagrant halt

To delete the VM, run:

$ vagrant destroy

Complete Vagrantfile with comments removed for brevity is merely few lines long:

Vagrant.configure("2") do |config|
  config.vm.box = "generic/debian11"
  config.vm.network "forwarded_port", guest: 5678, host: 5678
  config.vm.synced_folder ".", "/vagrant"
  config.vm.provision "shell", path: "provision.sh"
end

With less than 30 lines of code were able to set up an entire environment for experimenting with n8n automation workflows in a reproducible way. This enables us to delete the VM when no longer needed, but quickly recreate it from Vagrantfile and provisioning script when we need it again.

Terraform

Terraform is another DevOps tool created by Hashicorp - the company that developed Vagrant. It reads declarative configuration files written in HCL (Hashicorp Configuration Language) and performs infrastructure setup for us. We will use it to provision a Digital Ocean VPS with the same environment we created in Vagrant VM. The following steps will use Terraform CLI tool that can be found at Terraform Downloads page.

There are some prerequisites before we start. You will need to have a Digital Ocean account with payment method added. Furthermore, you will need to register your SSH key on your Digital Ocean account if you haven’t already (this is generally convenient thing to have, as you won’t need to type a password in when logging into servers from your local machine). Lastly, you will need to generate Digital Ocean API token and save it to do_token.txt (make sure there’s no whitespace!).

When these matters are take care of, we can start writing our main.tf file that will contain the configuration for the server. We start by setting up Digital Ocean Terraform provider (an equivalent of software library that abstract away an external API) with the following lines:

terraform {
  required_providers {
    digitalocean = {
      source  = "digitalocean/digitalocean"
      version = "~> 2.0"
    }
  }
}

provider "digitalocean" {
  token = file("do_token.txt")
}

Next, we create a resource (an equivalent of object) for Digital Ocean droplet we want created:

resource "digitalocean_droplet" "n8n" {
  image     = "debian-11-x64"
  name      = "n8n"
  region    = "sfo3"
  size      = "s-1vcpu-1gb"
  user_data = file("provision.sh")
  ssh_keys  = ["[REDACTED]"]
}

There are several things of interest here. We set image argument to debian-11-x64 as this is the same Linux distribution that we had in our Vagrant machine. Thus it is very reasonable to expect that our provisioning script will work here as well. We set size to value that corresponds to $5/month droplet as we don’t expect to use much resources. I find that in many cases $5/month droplets are more than enough for computational needs of many scraping/automation scripts that I develop. Next, we set up SSH key fingerprint that will be needed to access the droplet. We make sure to use the same SSH keypair that is configured in Digital Ocean account and our local machine. Also we set user_data with contents of our provisioning script so that it will be launched when server setup is complete.

Lastly, we create an output for getting an IP address of server after setup:

output "server_ip" {
  value = resource.digitalocean_droplet.n8n.ipv4_address
}

To actually install our environment on the cloud, there are three steps. First, we run terraform init to initialize local state with TF providers. Next, we launch terraform plan -out=tf.plan to set up an action plan that will be shown for us and saved into file tf.plan. Lastly, we run terraform apply tf.plan to actually proceed with the installation. Few minutes later, we can access n8n via HTTP protocol at port 5678 on IP address that is printed at the end of the installation process.

Conveniently for us, Digital Ocean does not set up any default firewall rules that would limit the ingress traffic. However if you work on equivalent setup with AWS EC2 you would need to make sure that a security group is configured accordingly. Furthermore, you may want to set up HTTP authentication for n8n by setting up N8N_BASIC_AUTH_ACTIVE environment variable to true in provision.sh file and setting username/password to N8N_BASIC_AUTH_USER \ N8N_BASIC_AUTH_PASSWORD variables if you are going to run it on the cloud for non-trivial amounts of time (this has been skipped in above code for simplicity).

When we no longer need this environment, we can run terraform destroy to remove it.

The complete main.tf file is as follows:

terraform {
  required_providers {
    digitalocean = {
      source  = "digitalocean/digitalocean"
      version = "~> 2.0"
    }
  }
}

provider "digitalocean" {
  token = file("do_token.txt")
}

resource "digitalocean_droplet" "n8n" {
  image     = "debian-11-x64"
  name      = "n8n"
  region    = "sfo3"
  size      = "s-1vcpu-1gb"
  user_data = file("provision.sh")
  ssh_keys  = ["[REDACTED]"]
}

output "server_ip" {
  value = resource.digitalocean_droplet.n8n.ipv4_address
}

Since the new environment is now running on the remote server it is not very feasible to make a shared directory with it. However, if we wanted to upload some files to the server we could have used file provisioner to do so.

$5/month Digital Ocean droplet costs $0.007 per hour. If we launch some automation workflow in the evening and let it run overnight for 10 hours, we have merely spent $0.07! Since Terraform enables us to prepare configuration once and setup/destroy the environment as many times as we like, we can save money that we would otherwise spending on keeping the unused infra running in the cloud. Furthermore, we save time and effort that we would be spending when recreating the environment repeatedly and possibly making mistakes.

Trickster Dev

Code level discussion of web scraping, gray hat automation, growth hacking and bounty hunting


By rl1987, 2022-03-25