Creating a page hit counter with nginx and bash

Published: June 6, 2021 03:18 UTC
Tags: webdev svson tutorial
1508 words
8 min read

So earlier today I posted about the idea to add a hit counter to this webpage. I did it and now I’m going to try my hand at writing more tutorial-ish posts and show you how :).

I didn’t want to use one of those weird hit counter services, because where’s the fun in that.

Like with all technical problems, I split the “how do I add a hit counter” into smaller problems that I solved one at a time:

How to tell how many people have visited the page?
- No client-side Javascript/additional requests.
- Update it per day.
How to display this count on the site, as the site is a static page?

To solve these problems I decided to use shell scripting, as it had all the functionality required already there.

Counting the amount of visitors

I decided to use nginx access logs for the visitor info, as the server already has that info. It didn’t make too much sense to me to add a hit-counter service that gets some request on every page proxied to it (e.g. the CSS query), because then I’d have to run another service :).

An example of a couple of log entries is listed below. They are following the format of:

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent"';

127.0.0.1 - - [06/Jun/2021:01:13:12 +0000] "GET /favicon.ico HTTP/1.1" 200 711 "https://svson.xyz/" "Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
127.0.0.1 - - [06/Jun/2021:01:13:16 +0000] "GET /feedback HTTP/1.1" 301 185 "https://svson.xyz/" "Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
127.0.0.1 - - [06/Jun/2021:01:13:16 +0000] "GET /feedback/ HTTP/1.1" 200 2267 "https://svson.xyz/" "Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
127.0.0.1 - - [06/Jun/2021:01:13:16 +0000] "GET /css/style.min.css HTTP/1.1" 304 0 "https://svson.xyz/feedback/" "Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
127.0.0.1 - - [06/Jun/2021:01:13:16 +0000] "GET /cc.png HTTP/1.1" 304 0 "https://svson.xyz/feedback/" "Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
127.0.0.1 - - [06/Jun/2021:01:13:20 +0000] "POST /api/feedback HTTP/1.1" 404 169 "https://svson.xyz/feedback/" "Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
127.0.0.1 - - [06/Jun/2021:01:14:25 +0000] "POST /api/feedback HTTP/1.1" 404 169 "https://svson.xyz/feedback/" "Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
127.0.0.1 - - [06/Jun/2021:01:14:31 +0000] "GET /40x HTTP/1.1" 404 169 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
127.0.0.1 - - [06/Jun/2021:01:14:34 +0000] "GET /about HTTP/1.1" 301 185 "https://svson.xyz/feedback/" "Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
127.0.0.1 - - [06/Jun/2021:01:14:34 +0000] "GET /about/ HTTP/1.1" 200 4958 "https://svson.xyz/feedback/" "Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"

So we’ll want to:

Filter out all easy-to-filter bots (no point in having fake visitors),
- Check the user-agent field for that and filter out common bot identifiers.
Get all unique IP values from the list.

So let’s construct a pipeline!

To dump the logfile to stdin I’ll use zcat, as the logfiles are gzipped.

For the robot detection we’ll need to print out each line which doesn’t contain any robot identifiers. awk is a perfect tool for this job, as it allows us to regex match and print another field, so we’re solving the next problem already as well (printing out the IPs for non-robots).

Scrolling through the logs and noting down the robot user-agents I came up with the following regex:

bot|spider|mapping|inspect|curl|python|zgrab|l9|wget|http[-]?client

Rather simple, but no point in getting too fancy, since it’s not that imperative that we have 100% truthful results ;). So using that regex on the user agent field¹ results in the following pipeline:

zcat svson.log.gz \
    | awk -F'"' '{ if (tolower($6) !~ /bot|spider|mapping|inspect|curl|python|zgrab|l9|wget|http[-]?client/) { print $1 } }'

The added awk command uses double quotes (") as the field separator (-F) and:

Takes the user-agent value ($6, sixth field),
Turns it into lower-case,
Checks that it doesn’t match our bot regex,
Prints out the IP address ($1, first field).

Since my nginx log format was a bit bad for this kind of usage due to not having all of the fields quoted then the IP address field contains some extra junk on the end. I didn’t want to delimit by space because the user-agents contain spaces as well :(.

The output for the pipeline is:

127.0.0.1 - - [06/Jun/2021:01:13:16 +0000] 
127.0.0.1 - - [06/Jun/2021:01:13:16 +0000] 
127.0.0.1 - - [06/Jun/2021:01:13:16 +0000] 
127.0.0.1 - - [06/Jun/2021:01:13:16 +0000]

So now we’ll need to cut off to the first space, sort, find unique lines and count the output lines. Since it’s pretty simple then I’ll take you through it with haste. The resulting pipeline is:

zcat svson.log.gz \
    | awk -F'"' '{ if (tolower($6) !~ /bot|spider|mapping|inspect|curl|python|zgrab|l9|wget|http[-]?client/) { print $1 } }' \
    | cut -f1 -d' ' -- \
    | sort -n \
    | uniq -u \
    | wc -l

The cut command cuts the first space delimited field. Then comes in sort, which sorts the lines by numeric value and then passes it on to uniq, which removes all duplicate lines from the output. Finally the output gets piped into wc, which is instructed to count the lines and thus returns us a nice number (the number of visitors!) we can use.

Displaying it on the page

Now that we have the number of visitors, we’ll need some way to show it on the page! Putting the number onto the site as a number would be pretty boring, as then we wouldn’t have the cool counter graphics :^).

So we need a way to create images programmatically, for which ImageMagick is the perfect tool!

ImageMagick has enough functionality to generate the whole counter from the command line, it’s a really awesome tool, but I find it a bit esoteric to use and frankly not very fun. Since I have the freedom to not do it from scratch then I’ll use base counter images (digits 0–9), which I’ll glue together with ImageMagick.

I found this nice seven segment LED graphic on Openclipart by JayNick, which I’ll use as the base graphic. I edited the provided SVG file into every digit and then rasterized the image onto a PNG file. So now I have a folder with each digit in a separate PNG file (so 0.png, 1.png and so forth). I created two zero digit graphics, the one without any segments lighted for the “fill” part of the number (000 prefixed) and one where the 0 is lighted for the count part of the number (102).

ImageMagick has a lot of usage examples and I found the one right for this job. To create a graphic for the number 000123 the command is:

convert 0.png 0.png 0.png 1.png 2.png 3.png +append out.ong

Putting it all together

I put it all together into a neat shell script, which I can then call from a cronjob. The script uses some bash-isms so it’s not POSIX shell compliant.

#!/bin/bash

# Digit images source directory
img_dir="/srv/svson/counter"
# Input logfile
logfile="svson.log.gz"
# Output destination
destination="hitcounter.webp"
# Number of digits in the output graphic
hitcounter_digits=6
# Height in pixels of the output graphic
height=50

# Get the hit-count
hitcount=$(zcat "$logfile" \
    | awk -F'"' '{ if (tolower($6) !~ /bot|spider|mapping|inspect|curl|python|zgrab|l9|wget|http[-]?client/) { print $1 } }' \
    | cut -f1 -d' ' -- \
    | sort \
    | uniq -u \
    | wc -l
)

# For each char of hit-count string concat the number for that
# character into a buffer
read_idx=0
while read -rn1 c; do
    if [ -n "$c" ]; then
        read_idx=$((read_idx+1))
        hitcounters="$hitcounters $img_dir/$c.png"
    fi
done <<< "$hitcount"

# If we didn't have at least $hitcounter_digits numbers then
# prepend alt. zeros to the buffer
while [ $read_idx -lt $hitcounter_digits ]; do
    read_idx=$((read_idx+1))
    hitcounters="$img_dir/0_alt.png $hitcounters"
done

# NOTE: don't quote $shitcounters here, as the filenames
# are supposed to be spread
convert $hitcounters +append \
    -quality 40 \
    -resize x"$height" \
    "$destination"

Since my nginx is configured to respond with a 30 day expiry duration for images and other media, then I’ll have to add an exception to the configuration file for the hitcounter. I’ll run it in a cronjob every day so I’ll set the expiry time for that duration as well. The generated graphics files are fairly small (around 800 bytes), so it’s not a huge issue.

location = /hitcounter.webp {
        expires 1d;
        add_header Cache-Control "public";
}

Finally add the script as a cronjob with crontab -e:

0 23 * * * /srv/svson/counter/generate.sh

The page counter is a bit of bloat, but I felt like adding it for now :). If you’re wondering how it looks, then scroll down to the footer of this site and take a look.

PS: Had to rename hitcounter to footcounter on the actual page, as one of my uBlock Origin filters was removing it (and the digits photo on this page, as that also matched due to the directory) :)).

Matching the whole line results in a couple of matches coming in from my image filenames (*-BOTtom.webp) so I immediatly jumped onto matching user-agents. Now that some time has passed and I’m writing this, I had the realization that to reach the page containing one of those images the visitor had to request at least the CSS (which doesn’t match the regex), so the whole user-agent-only matching was rather pointless :)). ↩︎