By joanolo


2017-01-06 23:32:53 8 Comments

Queries like SELECT * FROM t ORDER BY case when _parameter='a' then column_a end, case when _parameter='b' then column_b end are possible, but: Is this a good practice?

It is common to use parameters in the WHERE part of queries, and to have some computed columns in the SELECT part, but not that common to parameterize the ORDER BY clause.

Let's say we have an application that lists pre-owned cars (à la CraigsList). The list of cars can be sorted by price or colour. we have one function that, given a certain amount of parameters (let's say price-range, colour, and sorting criteria for instance) returns a set of records with the results.

To make it concrete, let's assume cars are all in the following table:

CREATE TABLE cars
(
  car_id serial NOT NULL PRIMARY KEY,  /* arbitrary anonymous key */
  make text NOT NULL,       /* unnormalized, for the sake of simplicity */
  model text NOT NULL,      /* unnormalized, for the sake of simplicity */
  year integer,             /* may be null, meaning unknown */
  euro_price numeric(12,2), /* may be null, meaning seller did not disclose */
  colour text               /* may be null, meaning unknown */
) ;

The table would have indices for most columns...

CREATE INDEX cars_colour_idx
  ON cars (colour);
CREATE INDEX cars_price_idx
  ON cars (price);
/* etc. */

And have some commodity enumerations:

CREATE TYPE car_sorting_criteria AS ENUM
   ('price',
    'colour');

... and some sample data

INSERT INTO cars.cars (make, model, year, euro_price, colour)

VALUES 
    ('Ford',   'Mondeo',   1990,  2000.00, 'green'),
    ('Audi',   'A3',       2005,  2500.00, 'golden magenta'),
    ('Seat',   'Ibiza',    2012, 12500.00, 'dark blue'),
    ('Fiat',   'Punto',    2014,     NULL, 'yellow'),
    ('Fiat',   '500',      2010,  7500.00, 'blueish'),
    ('Toyota', 'Avensis',  NULL,  9500.00, 'brown'), 
    ('Lexus',  'CT200h',   2012, 12500.00, 'dark whitish'), 
    ('Lexus',  'NX300h',   2013, 22500.00, NULL) ;

The kind of queries we are going to make are of the style:

SELECT
    make, model, year, euro_price, colour
FROM
    cars.cars
WHERE
    euro_price between 7500 and 9500 
ORDER BY
    colour ;

We would like to have queries of this style in a function:

CREATE or REPLACE FUNCTION get_car_list
   (IN _colour    text, 
    IN _min_price numeric, 
    IN _max_price numeric, 
    IN _sorting_criterium car_sorting_criteria) 
RETURNS record AS
$BODY$
      SELECT
          make, model, year, euro_price, colour
      FROM
          cars
      WHERE
           euro_price between _min_price and _max_price
           AND colour = _colour
      ORDER BY
          CASE WHEN _sorting_criterium = 'colour' THEN
            colour
          END,
          CASE WHEN _sorting_criterium = 'price' THEN
            euro_price
          END 
$BODY$
LANGUAGE SQL ;

Instead of this approach, the SQL in this function could be generated dynamically (in PL/pgSQL) as a string, and then EXECUTEd.

We can sense a few limitations, advantages and disadvantages with either approach:

  1. Within a function, finding which is the query plan for a certain statement is difficult (if possible at all). Yet we tend to use functions mostly when going to use something often enough.
  2. Errors in static SQL will be caught (mostly) when the function is compiled or when it is first called.
  3. Errors in dynamic SQL will only be caught (moslty) after the function is compiled, and all the execution paths have been checked (i.e.: the number of tests to perform to the function can be really high).
  4. A parametric query like the one exposed will probably be less efficient than a dynamically generated one; yet the executor will have a harder job parsing / making query tree / deciding every single time (which could impact efficiency in the opposite direction).

Question:

How to "get the best of both worlds" (if possible)? [Efficiency + Compiler checks + Debugging easily + Optimizing easily]

NOTE: this is to be run on PostgreSQL 9.6.

4 comments

@Erwin Brandstetter 2017-01-17 01:21:47

General answer

First, I want to address ambiguity in the premise:

It is common to use parameters in the WHERE part of queries, and to have some computed columns in the SELECT part, but not that common to parameterize the ORDER BY clause.

Computed columns in the SELECT part are hardly ever relevant for the query plan or performance. But "in the WHERE part" is ambiguous.

It's common to parameterize values in the WHERE clause, which works for prepared statements. (And PL/pgSQL works with prepared statements internally.) A generic query plan often makes sense regardless of provided values. That is, unless tables have a very uneven data distribution, but since Postgres 9.2 PL/pgSQL re-plans queries a couple of times to test whether the generic plan seems good enough:

But it's not as common to parameterize whole predicates (including identifiers) in the WHERE clause, which is impossible with prepared statements to begin with. You need dynamic SQL with EXECUTE, or you assemble the query strings in the client.

Dynamic ORDER BY expressions are somewhere in between both. You can do it with a CASE expression, but that's very hard to optimize in general. Postgres might use indexes with a plain ORDER BY, but not with CASE expressions hiding the eventual sort order. The planner is smart, but not an AI. Depending on the rest of the query (ORDER BY may be relevant for the plan or not - it's relevant in your example), you may end up with a sub-optimal query plan all the time.
Plus, you add the minor cost of the CASE expression(s). And in your particular example also for multiple useless ORDER BY columns.

Typically, dynamic SQL with EXECUTE is faster or much faster for this.

Maintainability should not be a problem if you maintain a clear and readable code format in your function body.

Fix demo function

The function in the question is broken. The return type is defined to return an anonymous record:

RETURNS record AS

But the query actually returns a set of records, it would have to be:

RETURNS SETOF record AS

But that's still unhelpful. You'd have to provide a column definition list with every call. Your query returns columns of well-known type. Declare the return type accordingly! I am guessing here, use actual data types of returned columns / expressions:

RETURNS TABLE (make text, model text, year int, euro_price int, colour text) AS

I use the same column names for convenience. Columns in the RETURNS TABLE clause are effectively OUT parameters, visible in every SQL statement in the body (but not inside EXECUTE). So table-qualify columns in queries in the function body to avoid possible naming conflicts. The demo function would work like this:

CREATE or REPLACE FUNCTION get_car_list (
    _colour            text, 
    _min_price         numeric, 
    _max_price         numeric, 
    _sorting_criterium car_sorting_criteria) 
  RETURNS TABLE (make text, model text, year int, euro_price numeric, colour text) AS
$func$
      SELECT c.make, c.model, c.year, c.euro_price, c.colour
      FROM   cars c
      WHERE  c.euro_price BETWEEN _min_price AND _max_price
      AND    c.colour = _colour
      ORDER  BY CASE WHEN _sorting_criterium = 'colour' THEN c.colour     END
              , CASE WHEN _sorting_criterium = 'price'  THEN c.euro_price END;
$func$  LANGUAGE sql;

Do not confuse the RETURNS key word in the function declaration with the plpgsql RETURN command like Evan did in his answer. Details:

General difficulty of the example query

Predicate on some columns (even worse: range predicates), other columns in ORDER BY, that's already hard to optimize. But you mentioned in a comment:

Actual result sets can be in the order of several 1.000 rows (and thus, sorted and paginated in smaller chunks server-side)

So you'll add LIMIT and OFFSET to these queries, returning the n "best" matches first. Or some smarter pagination technique:

You need a matching index to make this fast. I don't see how this could possibly work with CASE expressions in ORDER BY.

Consider:

@joanolo 2017-01-17 20:04:11

The most comprehensive answer, pointing out also a few of the mistakes I made when I simplified my real use-case. The answer is clear, however: go for the dynamic SQL, and forget about parametric ordering, that's a corner-case in practice and DB languages are not optimized for that. To make things funnier, my real-case scenario uses to involve full-text queries and sorting using ts_rank, which is a real performance hit, no matter which efforts we make not to compute it, if at all possible.

@Lucas 2017-01-07 09:08:07

Does not make sense to have a case in the order by of a function in the described case

The indexes will not be used in the order by clause that way. The database engine will have to calculate the case expression to each row then sort. And this technique is not extensible to dynamic joins and filters.

I would go with a dynamic SQL generation...

Generating dynamic SQL in a PostgreSQL function

You can do something like this:

  create or replace function fn_test(car_sort_option) returns text as $$
  declare
      xsql text;
      xresult text;
  begin

    xsql = 'select array_to_json(array_agg(c)) from cars';

    if $1 = 'colour' then 
      xsql = xsql || ' order by colour';
    elsif $1 = 'model' then  
      xsql = xsql || ' order by model';
    else 
      raise exception 'invalid parameter: sort option=% is invalid', $1; 
    end if;

    execute xsql into xresult;

    return xresult;
  end

  $$ language 'plpgsql';

I do not usually make dynamic SQL for something so particular to the presentation layer inside PL/pgSQL. I usually prefer to leave this SQL building in PHP or Java. But for a lot of other things I do a lot of dynamic SQL inside the PL/pgSQL. Mostly for partitioning, database maintenance, workflow implementation and data consistency controls.

I have found this policy leaves the code cleaner, has better use of indexes and is usable in more complicated dynamic SQL. Index usage is critical in my world because I work with analytics in databases with several billion records and need fast responses.

Additional notes on why it is more common to generate dynamic SQL outside the database (following comments)

That happens because most dynamic SQL is only needed when a table name or a column name is dynamic. For everything else parametrized queries will do fine. Dynamic SQL is needed for most reports. Not only because you will need to choose the sort order column, but most of the time you will need to dynamically include filters and columns. Many developers keep reporting SQL in the report programs in the presentation layer or in the batches that generate them, in which case the dynamic SQL will be generated outside the database as well.

@bentaly 2017-01-11 19:18:01

I would use dynamic SQL in this case.
The dynamic query is less complex and may yield a better execution plan.
From my experience, any gain from plan caching or shorter parse time, is negligible next to that.

I used to follow the rule of "Use dynamic SQL only if static SQL won't work". Today as a rule of thumb, I usually choose it according to the clause of the SQL query I need flexibility in:

  • WHERE- Static SQL
  • FROM, ORDER BY, GROUP BY - Dynamic SQL
  • SELECT - Alternate

True, it's a bit harder to debug, but using some good practices with it should help reduce the gap.
For example, in plpgsql you might use a $ notation + REPLACE + RAISE NOTICE:

-- Write the entire query using $ for replacements.
-- Don't use || operator.
-- This makes the dynamic query clearer to read and easier to maintain and debug.
v_sql:= 'SELECT <some columns> FROM tbl ORDER BY $sorting$';

v_sql:= REPLACE(v_sql,
  '$sorting$',
  CASE condition
    WHEN value1 THEN 'colX'
    WHEN value2 THEN 'colY'
  END);

-- Debug the query when running
RAISE NOTICE '%', v_sql;
RETURN QUERY EXECUTE v_sql;  

You should be able to trap any syntax errors with the first runs.
Logical errors will be just as easy/hard to find, as with static SQL.

@bentaly 2017-01-15 12:34:26

Right you are @joanolo - Added.

@Evan Carroll 2017-01-10 23:57:21

Three points that I would raise,

  1. This is a very basic query, even in your applied version. Create a VIEW for it. Have your users customize the WHERE conditions using the VIEW. Functions are black boxes to the query planner. It's horrible to use them inside of other functions, only SQL is inlined. And, dynamic functions do not get cached plans.
  2. If you want to continue to use plpgsql, use RETURNS QUERY (or RETURNS QUERY EXECUTE) not RETURNS SETOF. There is no reason to use RETURNS SETOF with a sort. It has to be buffered anyway, afaik. You will encounter problems with either one if your resultset is greater than work_mem.
  3. I'm not sure what you're app is written in. I assume web. I was in the automotive industry for a decade though, made lots of things like Craigslist and posting tools for Craigslist. "Don't sort stuff for users in the database" is generally a good rule of thumb. There is no reason for it. Drop that stuff out in JSON, and let them handle it in the browser. Unless you're showing more than 1000 rows, it's not even worth thinking about. Just consider the round trip time from a cell phone tower. You'll never get that time back wondering about this problem.

Moving forward, I'd even consider wrapping a service like PostgREST which handles thoroughly arbitrary ordering,

If you care where nulls are sorted, add nullsfirst or nullslast:

GET /people?order=age.nullsfirst
GET /people?order=age.desc.nullslast

@joanolo 2017-01-11 00:54:43

Good points. The example is just a SQL function, not plPGSQL. That's a very heavyly simplified made up version of the real scenario, just to concentrate on possible techniques to allow for parametric sorting (and filtering). There's more than one app consuming queries like this, via REST APIs, and they're PHP, ASP.Net, Node.JS, Python, ... Actual result sets can be in the order of several 1.000 rows (and thus, sorted and paginated in smaller chunks server-side). They're normally the result of queries involving many more parameters, and full text searches.

@joanolo 2017-01-11 01:00:14

Have never tried PostgREST, looks really (!) interesting and worth looking at.

@Erwin Brandstetter 2017-01-17 01:37:40

There is no RETURNS QUERY in plpgsql. See explanation: dba.stackexchange.com/questions/35746/…

Related Questions

Sponsored Content

1 Answered Questions

Guidelines on best indexing strategy for varying searches on 20+ columns

1 Answered Questions

1 Answered Questions

4 Answered Questions

[SOLVED] Is there a way to generate table create script in TSQL?

1 Answered Questions

Partial updates with functions

2 Answered Questions

[SOLVED] order by case with multiple order criteria

  • 2015-11-04 16:53:57
  • otrovagomas
  • 6250 View
  • 3 Score
  • 2 Answer
  • Tags:   mysql order-by case

1 Answered Questions

[SOLVED] Get TEXT value of a CLOB OID in Postgresql

1 Answered Questions

2 Answered Questions

[SOLVED] Does Postgres preserve insertion order of records?

1 Answered Questions

[SOLVED] Does it make sense to upgrade to a newer PostgreSQL JDBC driver?

  • 2014-05-05 22:34:48
  • sorin
  • 559 View
  • 3 Score
  • 1 Answer
  • Tags:   postgresql jdbc

Sponsored Content