It works with the macOS Terminal.app as well as on Ubuntu Linux 14.04 (or newer) and the whole source-code can be found in the following GitHubGist: Unix Shell-Script to crawl a list of website URLs using curl

After running the script, you will find a logfile created in the same directory as the script, with all the relevant output.

Setup

Create the following files in any directory

/any/directory/to/
    curl-crawler
    urls.txt

The curl-crawler Bash Shell script

You can find the source code for the curl-crawler script in the following GitHubGist – copy it’s contents into the curl-crawler file. Make sure, that there is an empty last line in the script file.

In order to be able to execute the shell script, use the following command on it’s file (by adjusting the directory path, of course):

$ sudo chmod +x /any/directory/to/curl-crawler

List of website URLs to check

In the urls.txt file, you add one website address per line. There is an example in the following GitHubGist. Make sure, that there is an empty last line in the file.

Configuration

Inside the curl-crawler script the following configurations can be made:

  1. timezone="Europe/Zurich" – define the desired timezone for formatting date/time information accordingly
  2. logfile="$script.log" – modify the name of the logfile to be written
  3. mailto="your@email.com" – if required, uncomment this line and change it to a valid email address where the log should be sent
  4. mailsubj="$script log from $now" – modify the email subject for the notification message containing the log output
  5. If the website URLs require a basic auth authentication, you can modify the line `curlresult=curl...` as follows
    (see the curl manpage for all the details):
curl -sSL -w '%{http_code} %{url_effective}' -u "login:password" $line -o /dev/null

Running the curl-crawler script

In order to properly execute the script, it requires 1 input consisting of the file containing the website URLs to be checked.
Do as follows:

$ /any/directory/to/curl-crawler /any/directory/to/urls.txt

Or – if you wish to regularly have the curl-crawler script run – use it as follows as part of crontab:

$ crontab -e
0 * * * * /any/directory/to/curl-crawler "/any/directory/to/urls.txt"

By the way: this website could come in handy to properly configure the timing of your crontab job: crontab.guru

Result after the script was executed

Once the curl-crawler script has been executed, you can find all the output and what it was doing in the separately created new curl-crawler.log. By the way: once the script is being run again, it will add the new output to the existing logfile – not overwrite it.

 

Share:
  • 0
  • 0

Questions? Suggestions? Let us know with a comment!

This site uses Akismet to reduce spam. Learn how your comment data is processed.