2016-10-10

Working with the MySQL slow query log on RDS with the AWS Go SDK

In order to let you submit RDS logs directly to Honeycomb, I've been exploring how to use the go RDS SDK to tail the mysql slow query log. It's fascinating the things I've discovered that I didn't find in the docs. (All of this is probably there somewhere, I just failed to find it.)
  • The docs say the logs rotate every hour but they don't say it's on the hour or what the rotating style is. The current log (slowquery.log) is rotated to slowquery.log.# where # is the current hour (0-23) UTC. So at 18:00 UTC, the file slowquery.log.18 is written with the logs from 17:00 to 18:00 (almost. more later)
  • When asking for a portion of the log file, you can hand it an index from which to start reading and the number of lines you would like to retrieve (both optional). You get back a struct with a LogFileData *string containing log lines and a Marker *string containing the index for the next segment of the log file.  The Marker looks like #:# where the first number is the hour and the second is the distance in to the file.
Asking for current data is rather straightforward. First ask for one line of  the current log file with no marker. You'll get back the most recent line (which you throw away) and the current marker. Wait a bit, then ask for the current file with the marker, and you'll get all lines since that first one. Repeat.

There are two places things get interesting: log rotation (aka the top of the hour) and asking for older data.

Let's say it's 17:58 UTC and you're happily tailing content, fetching additional lines every 5 seconds. You get to 18:00:02 or so and the marker switches from 18:xxxx to 19:0 and suddenly you're not getting any new data anymore. You wait and wait and ... at 18:05:00, you start getting data! but it's from 18:05:00; the data from between 18:00 and 18:05 is missing.

You take the last marker you got from your tail before data stopped flowing (with the 18: prefix) and use that marker as an index into slowquery.log.18. This gets you the data from between 18:00:02 and 18:05:00, then stops.  Or! Once some amount of time (not necessarily the full 5 minutes) you can ask the slowquery.log for data since the 18:offset marker and you'll get the data that's missing.

Back you go to tailing the slowquery.log.

Ok, so here's the next fun part - you can hand a marker to either the previous hour or the current hour to AWS when asking for the slowquery.log and it will give you data since then. If you do this after the 5 minute log rotation period has ended, you get all the intermediate data too! However, if you ask it for data where the number before the colon is not the current or previous hour, it ignores that and hands you back current data at the offset specified by the second half of the marker.

You can ask for data from older files (the numbered log files) and pass them a marker, but in that case it always ignores the hour portion of the marker. It will happily return the hour portion you gave it, but it doesn't affect the data you get. In other words, when asking for data from slowquery.log.8 passing it a marker of 12:123456 and 4:123456 are both treated as though they were 8:123456.

To poke around this for yourself, you have to have RDS configured to write the slow query log to a file. If that's the case, grab the DB instance identifier aws rds describe-db-instances --output table --region us-east-1 | grep DBInstanceIdentifier to list identifiers) and the log file aws rds describe-db-log-files --db-instance-identifier <ID> --output table --region <region> | sort -n to show log files.

Once you've got your instance identifier, log file, and region handy, download this go code[1] and give it a whirl:
go run tailrdslog.go --identifier myrdsinstanceid
Try passing it different markers and log files
go run tailrdslog.go -i prod-mysql -l slowquery/mysql-slowquery.log.10 -m 14.123456
and see what happens!

It is especially interesting to run this across an hour boundary to see what happens as the log rolls over. The script attempts to grab some of the data hiding in the previous log until everything to return to normal at 5 past the hour.

It will output a line every 5s (the default polling interval) with
  • the current time
  • the starting marker
  • the length of the content downloaded
  • the ending marker
  • the first timestamp encountered in the slow query log
  • the last timestamp encountered in the slow query log
  • (optional) sleep duration before the next poll
Oct  7 21:35:43 starting from 22:20523279, got 37424 content, next section from 22:20560703. first time is 21:35:39, last time is 21:35:43 sleeping 5 sec...
Oct  7 21:35:49 starting from 22:20560703, got 59056 content, next section from 22:20619759. first time is 21:35:43, last time is 21:35:48 sleeping 5 sec...
Oct  7 21:35:54 starting from 22:20619759, got 54649 content, next section from 22:20674408. first time is 21:35:48, last time is 21:35:53 sleeping 5 sec...
Oct  7 21:35:59 starting from 22:20674408, got 42533 content, next section from 22:20716941. first time is 21:35:54, last time is 21:35:59 sleeping 5 sec...
One nice thing to notice - the starting marker plus the content length gives you the ending marker. \o/

[1] I know, and I'm sorry. It's terrible code. It's sort of OK because it's scratch code just used to explore the API. But I should be giving the world better examples. It doesn't check pointers aren't nil before derefenencing them. The control flow is awkward. Ifs are nested too deeply. Code is repeated. I encourage you to do none of those things when you cargo cult snippets in to your production application. :)