Recently (very recently), we suffered some performance degradation during a traffic volume spike with Wicket. As with most modern web applications we use a variety of technologies in Wicket including Ruby on Rails, Sidekiq, React, Redis, Elasticsearch and PostgreSQL, all hosted on AWS.
When a performance problem arises across disparate services, it's time to pull out the checklist:
- Have you made any changes recently? Code deployments, configuration changes etc. If so, time to rollback and review. If not, move on...
- Check monitoring tools, including AWS status dashboard, Cloudwatch, any APM tools (New Relic / Datadog). If any of these indicate they're the potential source, hopefully you've been notified automatically. If not, move on...
- Start walking through services, log files, process monitoring tools (top/htop). This is the most unwelcome position to be in.
Lucky enough for us, item 1 didn't apply so we could skip it immediately. Item 2 should have applied in this scenario, but our Cloudwatch alarms were not sensitive enough to alert us (more on that below). So, we were in an unfortunate position -- the dreaded item 3.
Digging through servers and log files etc. was netting nothing substantive. We use Sentry (which is awesome) to capture error events and alert on them, and nothing was coming through that would explain what we were experiencing and what was being reported by users.
So, we're off the checklist now. Adding to the confusion was the fact that the issue appeared to be self-healing. It wasn't persistent and performance would randomly bounce back to normal levels, even while the application was seeing the same high volume of requests.
Also, past experience biased us to spend more time on the application servers, as the symptoms appeared to be a CPU exhausted application which is fairly common with Rails -- something we've experienced in the past.
At this point, we went back to Cloudwatch and AWS monitoring. We had ruled out the application servers above, so we pulled up Elasticache (Redis) and RDS (PostgreSQL) instance monitoring dashboards and zoomed out to align the graph with the times where performance was reported as degraded. At these points, we were seeing 100% CPU spikes in RDS, but not sustained for long enough to trigger our Cloudwatch alarm threshold. Easy to tweak the threshold, not so easy to figure out where these spikes were coming from.
Finally, we at least know that the issue was in RDS / PostgreSQL. We doubled the instance size to smooth out the performance bumps so we could look at optimization options.
Optimizing PostgreSQL on RDS
At the time, it seemed like a sensible first step to optimize the PostgreSQL environment on RDS. PostgreSQL is configured via settings in a postgresql.conf file. In RDS, these values are represented via a parameter group which is assigned to a database instance and applied when the instance is started (or rebooted).
We were using the default parameter group for PostgreSQL on RDS. A number of the parameters were sensible and calculated automatically based on the RDS instance resource sizing (handy if you're increasing or decreasing regularly). For example, RDS provides the following DSL that sets the parameter based on the instance size (this is only two examples of many of the different configuration options available):
RDS applies this parameter (maintenance_work_mem, or max_connections) using the greater/lesser number (relatively) based on the first value evaluated (DBInstanceClassMemory/16384), or 65536.
There's a great tool called pgTune which takes a number of system parameters (e.g. DB Version, OS type, DB Type, RAM and Number of Connections) and generates some configuration recommendations.
When evaluating the calculated values provided by RDS above, we noticed that the RDS configuration is a bit conservative. Particularly, the following:
So, we altered the default RDS parameter group and modified the parameters to match the values provided by pgTune and applied the group to our PostgreSQL instance, and rebooted it.
Sadly, this only led to marginal improvements in performance on our instance that was already doubled in capacity.
Without a more in-depth examination of what was happening in PostgreSQL, we weren't going to solve this. We needed pg_stat_statements.
The pg_stat_statements module provides a means for tracking execution statistics of all SQL statements executed by a server.
This wonderful PostgreSQL extension provides statistics for all statements sent to a PostgreSQL database. By storing the statements and various metrics associated with their execution, you can find out which queries are not performing well, where indices are missing or not used, etc. It's truly a wealth of information.
To install it, we modified the following RDS parameter group values:
Once these are set, and PostgreSQL restarted (minor downtime), hop onto psql and run the following:
CREATE EXTENSION pg_stat_statements;
... which will add a number of tables:
Here's some helpful queries to pull out useful information from pg_stat_statements to see how the database is being used. There's a lot more that can be done with this information, but these three really helped us understand what PostgreSQL was doing with the statements we were sending it.
SELECT relname, seq_scan-idx_scan AS too_much_seq, CASE WHEN seq_scan-idx_scan > 0 THEN 'Missing Index?' ELSE 'OK' END, pg_relation_size(relname::regclass) AS rel_size, seq_scan, idx_scan FROM pg_stat_all_tables WHERE schemaname='public' AND pg_relation_size(relname::regclass) > 80000 ORDER BY too_much_seq DESC;
This provides some comparative statistics between how sequential table scans are used vs. index scans. Information on these what these are is extremely important in understanding how modern RDBMS's work, but a heady topic to cover in this post. If you're interested, here's a pretty comprehensive cheatsheet.
Back to the query result from above. It will indicate whether tables are missing indices and whether sequential scans are invoked more frequently then they should be (which isn't always a bad thing). Using this information, you could examine the tables relying a bit too much on sequential scans and possibly add some additional indices to improve your statement execution times.
SELECT indexrelid::regclass AS index, relid::regclass AS table, 'DROP INDEX ' || indexrelid::regclass || ';' AS drop_statement FROM pg_stat_user_indexes JOIN pg_index USING (indexrelid) WHERE idx_scan = 0 AND indisunique is false;
This will output the list of tables that have indices that are never used. Important to run on a production database instance as a non-production environment may not have enough statement activity to correctly validate whether indices are used or not.
If there are too many indices on write-heavy tables, it can potential cause CPU spikes as PostgreSQL will have additional work post-statement execution in maintaining the indices as well. If the tables are not write-heavy, then there would be little to no improvements to gain by issuing the DROP INDEX recommendations either.
Poor performing statements (queries), AKA the silver bullet
SELECT substring(query, 1, 50) AS short_query, round(total_time::numeric, 2) AS total_time, calls, round(mean_time::numeric, 2) AS mean, round( ( 100 * total_time / sum(total_time::numeric) OVER () )::numeric, 2 ) AS percentage_cpu FROM pg_stat_statements ORDER BY total_time DESC LIMIT 20;
The most useful of them all (and ultimately, the information that led us to a resolution) is this query. Courtesy of this post, it will generate an easy-to-consume table of queries with a number of useful metrics which includes the percentage of time that a statement is CPU bound.
Here's two examples:
|SELECT DISTINCT "insurance_options".* FROM "insur||42180492.76||82432||511.70||76.36|
|SELECT DISTINCT COUNT(DISTINCT "insurance_options"||11560639.58||144502||80.00||20.93|
Those first two results were eye-opening. They were the actual two queries that were responsible for 96% of the CPU time in PostgreSQL. Since we're seeing CPU spikes on RDS, they're suspects number 1 and 2 to investigate further and optimize.
Optimization and resolution
Some well-tuned code reviews (text editor finds) located a few places that might be responsible for these queries. Tailing the ActiveRecord invocation with .explain and visualizing them showed just how inefficient they were. Not only that, we were using .any? on the ActiveRecord::Relation, which accounted for the second nearly identical query that invoked COUNT on the same inefficient result set.
Combining DISTINCT, some inefficient joins and a redundant COUNT on the same result set resulted in queries that were ripe for optimization. We re-wrote them to drop the dependency on the COUNT query (dropped the .any?) entirely and made two much smaller queries that did not require the inefficient joins at all.
Ultimately, the results speak for themselves: