How fast are your queries? Make PostgreSQL work transparently with pg_stat_statements
Too long; didn't read
Analyze Postgres actions by enabling the shared library pg_stat_statements: it will give useful statistics about your queries!
The problem
I recently wrote about streams and how effective they are at manipulating big amounts of data. A typical flow is to take the data from a source, filter it and then store it in some sink. The storage part involves a database and in many cases this is a relational database. One famous implementation of these is PostgreSQL.
Generally speaking, when you integrate a component in your application, you need to monitor that it works as you expect. When manipulating data as streams, your bottlenecks are likely the source or the sink, which may both be relational database under your control!
The reason we care about this is our user life: any time we shave off some second of computation, we end up saving minutes if not hours of our user's time (according to how adopted is our software).
So what do you do if your stream/application computes quickly up to your PostgreSQL storage?
It is a problem indeed
Databases are omnipresent. Often you work with them through frameworks like Ruby on Rails, which facilitate you in storing data. Even if you write bare SQL yourself, it is advantageous for you to monitor how your database is behaving under the hood.
Oftentimes shaving off seconds of computation is just an index away: ah, if only you knew which query was slow!
And there is a solution
PostgreSQL comes with a useful extension library for collecting
statistics about queries: pg_stat_statements. This is an extension
that you have to load while running psql
, the PostgreSQL command
line client.
You can do that in the database configuration file or inline with the command:
psql -c shared_preload_libraries='pg_stat_statements' ...
The reason why you cannot set that afterwards is because the database gets instrumented to collect statistics. That is also the reason why it is disabled by default: instrumenting the database causes an overhead on each query, making your database slower than optimal.
Once you are running the SQL REPL, you can create the table that collects statistics:
create extension pg_stat_statements;
At this point you should do some queries.
Ideally, you will run a performance test of the application that
relies on the database to mimic normal usage of your software. In that
way, you can pinpoint the queries that consume most of your
application user's time.
After you run your test, just query your statistics table like so:
select substring(query, 1, 100) as query,calls,total_time,max_time,mean_time,rows FROM pg_stat_statements ORDER BY calls DESC;
I order the statistics by the number of calls a query experienced
during my test. Ordering by total_time
makes a lot of sense too, if
you want to do it differently. Note I also truncate the query text to
100 characters to keep it readable, just remove the SUBSTRING
if you
want the raw query.
An example of output of the query above is this:
query | calls | total_time | max_time | mean_time | rows |
---|---|---|---|---|---|
BEGIN | 13604 | 6.81200000000054 | 0.023 | 0.000500735077918257 | 0 |
COMMIT | 13604 | 7.23400000000058 | 0.032 | 0.000531755366068805 | 0 |
someQuery1 | 6678 | 367.142000000007 | 0.398 | 0.0549778376759509 | 6678 |
someQuery2 | 6000 | 85.3680000000014 | 0.232 | 0.014228 | 6000 |
someQuery3 | 4284 | 174.814999999998 | 0.247 | 0.0408064892623715 | 4284 |
someQuery4 | 2098 | 6.47599999999978 | 0.04 | 0.00308674928503337 | 0 |
someQuery5 | 2098 | 6.98699999999979 | 0.026 | 0.00333031458531935 | 0 |
someQuery6 | 1882 | 639.127 | 2.367 | 0.339599893730074 | 1882 |
someQuery7 | 899 | 560.159 | 2.225 | 0.623091212458286 | 920 |
someQuery8 | 882 | 9.41099999999997 | 0.147 | 0.0106700680272109 | 882 |
You may notice in the above that the problematic queries seem to be
someQuery6
, someQuery7
, someQuery1
, someQuery3
because they
take longer (their total_time
) than anything else. All of these
looked reasonable queries to write, naturally. Monitoring rescued me!
If you need to rerun a test, I would clean the statistics table to have a clear result for my analysis. You can do it like this:
select pg_stat_statements_reset();
Furthermore, another trick for you! If you want to know the algorithm PostgreSQL is going to use for your query you can run explain. For example, say you have a query like this:
select id, content from testtable;
If you run explain
on it you get the query plan:
explain select id, content from testtable;
QUERY PLAN |
---|
Seq Scan on testtable (cost=0.00..22.70 rows=1270 width=36) |
Conclusion
So start your PostgreSQL with a small option, run a performance test and see if you can find on what your preferred database spends its time. Every second you save makes somebody happy!
Happy monitoring!