It works with the macOS Terminal.app as well as on Ubuntu Linux 14.04 (or newer) and the whole source-code can be found in the following GitHubGist: Unix Shell-Script to crawl a list of website URLs using curl
After running the script, you will find a logfile created in the same directory as the script, with all the relevant output.
Setup
Create the following files in any directory
/any/directory/to/ curl-crawler urls.txt
The curl-crawler Bash Shell script
You can find the source code for the curl-crawler
script in the following GitHubGist – copy it’s contents into the curl-crawler
file. Make sure, that there is an empty last line in the script file.
In order to be able to execute the shell script, use the following command on it’s file (by adjusting the directory path, of course):
$ sudo chmod +x /any/directory/to/curl-crawler
List of website URLs to check
In the urls.txt
file, you add one website address per line. There is an example in the following GitHubGist. Make sure, that there is an empty last line in the file.
Configuration
Inside the curl-crawler
script the following configurations can be made:
timezone="Europe/Zurich"
– define the desired timezone for formatting date/time information accordinglylogfile="$script.log"
– modify the name of the logfile to be writtenmailto="your@email.com"
– if required, uncomment this line and change it to a valid email address where the log should be sentmailsubj="$script log from $now"
– modify the email subject for the notification message containing the log output- If the website URLs require a basic auth authentication, you can modify the line
`curlresult=curl...`
as follows
(see the curl manpage for all the details):
curl -sSL -w '%{http_code} %{url_effective}' -u "login:password" $line -o /dev/null
Running the curl-crawler script
In order to properly execute the script, it requires 1 input consisting of the file containing the website URLs to be checked.
Do as follows:
$ /any/directory/to/curl-crawler /any/directory/to/urls.txt
Or – if you wish to regularly have the curl-crawler
script run – use it as follows as part of crontab
:
$ crontab -e 0 * * * * /any/directory/to/curl-crawler "/any/directory/to/urls.txt"
By the way: this website could come in handy to properly configure the timing of your crontab job: crontab.guru
Result after the script was executed
Once the curl-crawler
script has been executed, you can find all the output and what it was doing in the separately created new curl-crawler.log
. By the way: once the script is being run again, it will add the new output to the existing logfile – not overwrite it.