So here is my situation. I have 2 tables with the following DDL.
As you can see, the tags table is a simple value-id-table. The second table represents a join table between pages and tags.
The goal of my Query should be to get the most used tag from the join table. Only the first x-Rows are of interest to me. To get there I used a simple limit command. So just for comparison here a simple query of the join table without the actual values.
This is just a statement to get you the picture of cost for a simple query (without fetching any actual values).
To make this query useful I needed to add the values. All the values will be joined through the tags table.
Here the first implementation I came up with.
As you can see, simply joining the table makes this query quite complex. The part which consumes most of the cost is the more complicated group by clause. Now the execution engine has to join these tables and then sort all values by id and string (mostly the value is the important part).
To avoid this there only could be one solution – remove the join. With removing the join there comes the question how to get the values from the second table. One way to do this would be to use the program (in my case a php web application) to query again for every line of the result set.
Another way to approach this would be to do a sub-select in the select section. This way you don’t have the additional round trip of doing it in the application. Another advantage would be that the database would only do these sub-selects for the actually returning rows (with respect of the limit).
So here the query I came up with (with the query execution plan)
As you can see i still costs a lot. It is still 3 times more expensive then doing it without the values. On the other hand the cost is only a fourth of the cost of the join. This is mostly owed to the limit clause. The join has no way of knowing that it would be enough to run the limit without the join and later join the values. So far I found no way to tell postgres to do this more efficient.
So the simplest solution for that would be to do sub-queries. With that, the limit clause will be honored.
So as this example shows, it is always a good idea to try different approaches to one and the same query. Often you can see lots of differences in the execution plan which can have a major impact on performance.