2017 Sep 27

Or, yet another reason why PostgreSQL is amazing...

Recently (very recently), we suffered some performance degradation during a traffic volume spike with Wicket. As with most modern web applications we use a variety of technologies in Wicket including Ruby on Rails, SidekiqReact, Redis, Elasticsearch and PostgreSQL, all hosted on AWS.

When a performance problem arises across disparate services, it's time to pull out the checklist:

  1. Have you made any changes recently? Code deployments, configuration changes etc. If so, time to rollback and review. If not, move on...
  2. Check monitoring tools, including AWS status dashboard, Cloudwatch, any APM tools (New Relic / Datadog). If any of these indicate they're the potential source, hopefully you've been notified automatically. If not, move on...
  3. Start walking through services, log files, process monitoring tools (top/htop). This is the most unwelcome position to be in.

Lucky enough for us, item 1 didn't apply so we could skip it immediately. Item 2 should have applied in this scenario, but our Cloudwatch alarms were not sensitive enough to alert us (more on that below). So, we were in an unfortunate position -- the dreaded item 3.

Digging through servers and log files etc. was netting nothing substantive. We use Sentry (which is awesome) to capture error events and alert on them, and nothing was coming through that would explain what we were experiencing and what was being reported by users.

So, we're off the checklist now. Adding to the confusion was the fact that the issue appeared to be self-healing. It wasn't persistent and performance would randomly bounce back to normal levels, even while the application was seeing the same high volume of requests.

Also, past experience biased us to spend more time on the application servers, as the symptoms appeared to be a CPU exhausted application which is fairly common with Rails -- something we've experienced in the past.

At this point, we went back to Cloudwatch and AWS monitoring. We had ruled out the application servers above, so we pulled up Elasticache (Redis) and RDS (PostgreSQL) instance monitoring dashboards and zoomed out to align the graph with the times where performance was reported as degraded. At these points, we were seeing 100% CPU spikes in RDS, but not sustained for long enough to trigger our Cloudwatch alarm threshold. Easy to tweak the threshold, not so easy to figure out where these spikes were coming from.

Finally, we at least know that the issue was in RDS / PostgreSQL. We doubled the instance size to smooth out the performance bumps so we could look at optimization options.

Optimizing PostgreSQL on RDS

At the time, it seemed like a sensible first step to optimize the PostgreSQL environment on RDS. PostgreSQL is configured via settings in a postgresql.conf file. In RDS, these values are represented via a parameter group which is assigned to a database instance and applied when the instance is started (or rebooted).

We were using the default parameter group for PostgreSQL on RDS. A number of the parameters were sensible and calculated automatically based on the RDS instance resource sizing (handy if you're increasing or decreasing regularly). For example, RDS provides the following DSL that sets the parameter based on the instance size (this is only two examples of many of the different configuration options available):

Parameter Setting
maintenance_work_mem GREATEST({DBInstanceClassMemory/16384},65536)
max_connections LEAST({DBInstanceClassMemory/9531392},5000)

 

RDS applies this parameter (maintenance_work_mem, or max_connections) using the greater/lesser number (relatively) based on the first value evaluated (DBInstanceClassMemory/16384), or 65536.

There's a great tool called pgTune which takes a number of system parameters (e.g. DB Version, OS type, DB Type, RAM and Number of Connections) and generates some configuration recommendations.

When evaluating the calculated values provided by RDS above, we noticed that the RDS configuration is a bit conservative. Particularly, the following:

  • maintenance_work_mem
  • work_mem
  • max_wal_size

So, we altered the default RDS parameter group and modified the parameters to match the values provided by pgTune and applied the group to our PostgreSQL instance, and rebooted it. 

Sadly, this only led to marginal improvements in performance on our instance that was already doubled in capacity.

Without a more in-depth examination of what was happening in PostgreSQL, we weren't going to solve this. We needed pg_stat_statements.

The pg_stat_statements module provides a means for tracking execution statistics of all SQL statements executed by a server.

 

This wonderful PostgreSQL extension provides statistics for all statements sent to a PostgreSQL database. By storing the statements and various metrics associated with their execution, you can find out which queries are not performing well, where indices are missing or not used, etc. It's truly a wealth of information.

To install it, we modified the following RDS parameter group values:

Parameter Setting
shared_preload_libraries pg_stat_statements
track_activity_query_size 2048
pg_stat_statements.track all

 

Once these are set, and PostgreSQL restarted (minor downtime), hop onto psql and run the following:

CREATE EXTENSION pg_stat_statements;

... which will add a number of tables:

  • pg_stat_all_tables
  • pg_stat_user_indexes
  • pg_stat_statements

pg_stat_statements to_the_rescue

Here's some helpful queries to pull out useful information from pg_stat_statements to see how the database is being used. There's a lot more that can be done with this information, but these three really helped us understand what PostgreSQL was doing with the statements we were sending it.

Index utilization

SELECT relname,
       seq_scan-idx_scan AS too_much_seq,
       CASE
       WHEN seq_scan-idx_scan > 0 THEN
         'Missing Index?'
       ELSE 'OK' 
       END, 
       pg_relation_size(relname::regclass) AS rel_size, 
       seq_scan, 
       idx_scan
FROM   pg_stat_all_tables
WHERE  schemaname='public'
         AND pg_relation_size(relname::regclass) > 80000
ORDER BY too_much_seq DESC;

This provides some comparative statistics between how sequential table scans are used vs. index scans. Information on these what these are is extremely important in understanding how modern RDBMS's work, but a heady topic to cover in this post. If you're interested, here's a pretty comprehensive cheatsheet.

Back to the query result from above. It will indicate whether tables are missing indices and whether sequential scans are invoked more frequently then they should be (which isn't always a bad thing). Using this information, you could examine the tables relying a bit too much on sequential scans and possibly add some additional indices to improve your statement execution times.

Missing indices

SELECT indexrelid::regclass AS index,
       relid::regclass AS table,
       'DROP INDEX ' || indexrelid::regclass || ';' AS drop_statement
FROM   pg_stat_user_indexes
JOIN   pg_index
USING  (indexrelid)
WHERE  idx_scan = 0
         AND indisunique is false;

This will output the list of tables that have indices that are never used. Important to run on a production database instance as a non-production environment may not have enough statement activity to correctly validate whether indices are used or not.

If there are too many indices on write-heavy tables, it can potential cause CPU spikes as PostgreSQL will have additional work post-statement execution in maintaining the indices as well. If the tables are not write-heavy, then there would be little to no improvements to gain by issuing the DROP INDEX recommendations either.

Poor performing statements (queries), AKA the silver bullet

SELECT substring(query, 1, 50) AS short_query,
       round(total_time::numeric, 2) AS total_time,
       calls,
       round(mean_time::numeric, 2) AS mean,
       round(
         (
           100 * total_time / sum(total_time::numeric)
           OVER ()
         )::numeric, 2
       ) AS percentage_cpu
FROM   pg_stat_statements
ORDER BY total_time DESC LIMIT 20;

The most useful of them all (and ultimately, the information that led us to a resolution) is this query. Courtesy of this post, it will generate an easy-to-consume table of queries with a number of useful metrics which includes the percentage of time that a statement is CPU bound.

Here's two examples:

short_query total_time calls mean percentage_cpu
SELECT  DISTINCT "insurance_options".* FROM "insur 42180492.76 82432 511.70 76.36
SELECT DISTINCT COUNT(DISTINCT "insurance_options" 11560639.58 144502 80.00 20.93

 

Those first two results were eye-opening. They were the actual two queries that were responsible for 96% of the CPU time in PostgreSQL. Since we're seeing CPU spikes on RDS, they're suspects number 1 and 2 to investigate further and optimize.

Optimization and resolution

Some well-tuned code reviews (text editor finds) located a few places that might be responsible for these queries. Tailing the ActiveRecord invocation with .explain and visualizing them showed just how inefficient they were. Not only that, we were using .any? on the ActiveRecord::Relation, which accounted for the second nearly identical query that invoked COUNT on the same inefficient result set.

Combining DISTINCT, some inefficient joins and a redundant COUNT on the same result set resulted in queries that were ripe for optimization. We re-wrote them to drop the dependency on the COUNT query (dropped the .any?) entirely and made two much smaller queries that did not require the inefficient joins at all.

Ultimately, the results speak for themselves:

AWS Monitoring Screenshot