2015 Jun 24

A brief discussion of RDBMS history, and the benefits of adhering to SQL standards when using some of the advanced functionality that PostgreSQL provides.

We work with databases all the time. We're constantly stuffing large amount of data into them, ranging from page content to meta-data, numbers and more.

We worry about backing them up, testing out of Disaster Recovery plans (we all have them right?), and optimizing them to ensure that they are performant for the basic create, read, update and delete (CRUD) operations that they're primarily used for.

Every once and awhile though, we get the opportunity to push them and use them for their primary purpose. I'm always amazed at the power that resides under our fingers that we rarely exploit.

Well, here at Industrial we recently had a project with some unique requirements that allowed us to push the database tier a bit more than usual.

Some History

In the beginning there was chaos.

In the early 70s, there were a lot of brilliant minds wrestling with the challenge of data storage and, in particular, how to relate one set of data to another. A lot of research was done at a number of large organizations such as IBM as well as universities (of course) and early interations of what would be termed "Relational Database Management Systems" (RDMS) were born.

With all of these various RDMS technologies appearing on the scene, a new language was needed to query these tools and provide the means to define and/or manipulate the relational datasets it returned. In the mid-70's, structured query language (SQL) assumed that role and adoption exploded in the late 70's and early 80's. It became so ubiquitous, SQL was officially turned into an ANSI standard in 1986 and an ISO standard in 1987.

Since that time, the SQL standard has evolved and adopted a versioning approach to its feature set which various RDMSs are free to implement wholly, or piecemeal as they deem appropriate.

It is likely the most used language in the world.

Relational Database Management Systems

In the 40 odd years since the first RDMS applications hit the scene, there has been a lot of changes. Many companies, technologies, tools and man-hours have been spent extending the SQL standard and implementing these changes in the various tools. With many players, some of whom have strong opinions, financial incentives and institutional investments in their implementation, the adoption of the SQL standard has varied widely over the years.

Here's a very small subset of the players in the RDMS space, but the most common/popular options:

  • Oracle - Largest proprietary player in the space with a very, very large install base. Primarily aimed at large enterprise customers and carries a pretty hefty price-tag to go along with it. Very capable product/platform and it is the underpinning of some of the largest applications available today.
  • DB2 - IBMs proprietary RDMS product, aimed at the same customers that Oracle targets. Very capable RDMS, if a bit less feature-filled as the Oracle product line.
  • Microsft SQL Server - Microsoft's proprietary RDMS product. It is primarily aimed at medium-to-large customers, but a free SQL Server Express product is available for smaller installs and applications. It is the database backend for Azure so it certainly scales. Often very compelling for larger clients given the substantial discounts available for Microsofts' volume license customers.
  • MySQL - The most widely used RDBMS in the world. Open source, but wholly owned by Oracle after Oracle aquired Sun Microsystems in 2009 (who previously acquired MySQL). This upset a few folks, who have forked the MySQL project into alternatives such as Drizzle and MariaDB. While Oracle has maintained MySQL, the feature completeness has waned and adoption of the SQL-92+ standards has been sluggish.
  • PostgreSQL - Probably the youngest option on this list, but it's still almost 19 years old. It is also another Open Source alternative, and very opinionated as well. In this case, the developers (there are many) are strict adherents to the SQL standard and, at the time of this writing, it is the only RDBMS that almost fully supports SQL-2011.

PostgreSQL

The initial release of PostgreSQL was in July 8, 1996. It is cross-platform, so it runs on Windows, Linux, Solaris and MacOSX. It is developed by the PostgreSQL Global Development Group, which is comprised of developers/contributors from all over the world and sponsored by many organizations.

If you're hard-up to find some bedtime reading material, I recommend walking through some of the SQL standards documents on Wikipedia. SQL standards are named according to the year that they're ratified, so you can see specific pages for SQL-92, SQL:1999, SQL:2003 and the latest, SQL:2011.

It's important to point this out as PostgreSQL adopts these standards as they're ratified. The vast majority of the features/changes detailed in these standards have not been implemented in MySQL and while the community has provided plugins to support these features, the core product still lacks features in the SQL-92 standard.

Here's a few of these features, with examples, that are available in PostgreSQL that exhibit the capabilities of the platform over both its proprietary and open source alternatives.

Window Functions

As someone once said, there was SQL before window functions, and SQL after window functions. They're incredibly powerful, but fairly easy to implement. Pre-window functions, standard aggregate functions, such as AVG, SUM would be used with sub-selects to get similar data, but rows would still be grouped into single output rows.

With window functions, each row retains their separate identity.

The PostgreSQL documentation provides a lot more in-depth details on window functions. Here's a quick example though:

SELECT  p.id AS id,
        p.sku AS number,
        p.cost
        NTILE(4) OVER (ORDER BY o.cost) AS quartile
FROM    products p

Very simple example, but the result set from this query would separate the products list into quartiles which is a useful tool for separating quarters based on the cost to produce these items (in this example).

Advanced Data Types

Oh boy. One of the most amazing advances in databases is to break out of the standard data types that we all know and love. Varchars, text, integers, blobs etc. are all excellent, but limited when you want more from your data.

For example, PostgreSQL supports the following:

Geometric Types - Two dimentional spatial objects and associated functions for 'scaling, translation, rotation, and determining intersections' which is simply incredible.

Network Address Types - These types store network address data such as IPv4, IPv6 and MAC and offers error checking and a lot of special operators and functions to work with this data.

UUID Types - UUIDs (or GUIDs) are Universally Unique Identifiers which take you out of your own database when it comes to referential integrity. UUIDs can be used to ensure uniqueness within distributed systems rather than sequence ids (1, 2, 3, 4 etc.) which are linked solely to singular RDBMS environments.

hstore Types - Finally, my favourite advanced data-type in PostgreSQL is the hstore. Often, we develop data-models that are not one-size fits all models that need to be constantly augmented based on client requirements etc. Hstore helps, as these are native key/value stores (much like JSON) that can provide this 'semi-structured' data, rather than continually adding NULL-value columns.

Materialized Views

Reporting folks everywhere will love materialized views. They're very similar to views which are essentially stored query result sets that can be queried themselves. The difference is that materialized views are generated point-in-time, which is great for reporting cycles without the overhead of generating temporary tables and re-populating them.

Here's an example:

CREATE MATERIALIZED VIEW mv_product_tax_summary AS
  WITH orders AS (
    SELECT  o.id      AS id,
            o.total   AS total,
            (SELECT t.total FROM taxes t WHERE o.id = taxes.id AND t.year = 2015) AS 2015_taxes,
            (SELECT t.total FROM taxes t WHERE o.id = taxes.id AND t.year = 2014) AS 2014_taxes,
    FROM    orders
  )
  SELECT  p.id AS id,
          p.name_en AS name_en,
          o.total AS order_total,
          o.2015_taxes AS order_2015_taxes,
          o.2014_taxes AS order_2015_taxes
  FROM    products p, orders o
  WHERE   p.order_id = o.id

At the time this view is generated, the data will be stored so future access to this view will be consistent. Then, in order to update this view, simply issue the following:

REFRESH MATERIALIZED VIEW mv_product_tax_summary

This will update the view with the latest results of that SELECT.

Common Table Expressions

In the above example, I also included another really powerful feature of PostgreSQL (which is also a SQL standard). The WITH common table expression. The example provided is pretty contrived, but common table expressions exist to de-clutter SQL. They're generate temporary tables that exist solely for the lifetime of a single query and can encapsulate logic specific to their function rather than spreading it across the overall query itself.

Conclusion

As prefaced, there is always a time to ask not what you can do for your database platform, but what it can do for you. While it's not new, the power and flexibility provided by PostgreSQL is incredible. The standards to which is developed against and adheres is a true testament to the open source community.

Not only is it great and the standard CRUD functionality, applying even one or two of the advanced features listed above exposes some of the true power of the tools that sit under the hood of most of our applications, without us even being aware of it (DBAs aside).