Timothy Neilen    Books    Now    Quotes

Parsing nginx logs using awk

Out of the box, nginx uses the ngx_http_log_module using the combined format:

log_format combined '$remote_addr - $remote_user [$time_local]  '
            '"$request" $status $body_bytes_sent '
            '"$http_referer" "$http_user_agent"';

The log_format supports many variables:

  • $remote_addr – client IP address
  • $remote_user – username supplied with basic authentication, generally blank
  • [$time_local] – server timestamp
  • “$request” – HTTP request type, path and protocol used
  • $status – HTTP response returned
  • $body_bytes_sent – server response size (in bytes)
  • “$http_referer” – referral URL (if present)
  • “$http_user_agent” – client user agent

Refer to ngx_htt_core_module for more information.

Using awk, we can then break up each line of the log file into “columns”. By default awk splits the string by the space character. Allowing us to easily report on things such as HTTP response codes:

awk '{print $9}' access.log | sort | uniq -c | sort -rn
126570 200
   4144 404
   1620 304
   1416 302
    891 301
    302 499
      5 206
      3 405
      1 403
      1 "-"

From these response codes, we can see there are 4144 404 “Not Found” responses - to determine which URLs are returning this response again we can use awk:

awk '($9 ~ /404/)' access.log | awk '{print $7}' | sort | uniq -c | sort -rn
   3343 /wp-content/themes/eNurse/assets/en_background1.jpg
    307 /grassblade-lrs/xAPI/statements
     53 /eshop/clearance-products
     38 /core/assets/frontend/layout/css/fancybox_overlay.png
     24 /core/assets/frontend/layout/css/[email protected]
     24 /core/assets/frontend/layout/css/[email protected]
     13 /my-account/favicon.ico
     11 /courses/favicon.ico
      8 /learninghub/favicon.ico
      7 /eshop/clearance-products?p=3
      7 /apple-touch-icon.png
      6 /professional-file/professional-file-enurse-cpd/favicon.ico
      6 /blogsarticles/favicon.ico
      6 /apple-touch-icon-precomposed.png.
      ...

To filter based on other response codes, just change it within the awk command, the below example would return all temporarily redirected pages based on a 301 response code:

awk '($9 ~ /301/)' access.log | awk '{print $7}' | sort | uniq -c | sort -rn

Another neat way to analyse the logs is to determine the country of origin, by performing a look up of client IP addresses using the geoip package:

awk '{ print $1 }' | sort | uniq -c | sort -rn | head -n 25 | awk '{ printf("%5d\t%-15s\t", $1, $2); system("geoiplookup " $2 " | cut -d \\: -f2 ") }'
 3213   174.129.242.xxx  US, United States
 1582   203.37.226.xxx   AU, Australia
  960   101.175.136.xxx   AU, Australia
  907   58.96.63.xxx     AU, Australia
  883   221.121.132.xxx   AU, Australia
  807   111.69.116.xxxx   NZ, New Zealand
  733   118.208.154.xxx   AU, Australia
  684   1.156.134.xxx    AU, Australia
  665   1.41.98.xxx      AU, Australia
  657   106.69.55.xxx     AU, Australia
  649   58.7.198.xxx     AU, Australia
  620   121.212.60.xxx   AU, Australia
  614   211.31.50.xxx    AU, Australia