sobota, 11 maja 2024

Leveraging PostgreSQL statistics for min() and max() approximation.

Sometimes you have a big table (terabytes or more) and you want to know what's the minimal or maximal, or average value of some column. This can happen when you prepare your training and testing datasets for your latest Machine Learning (a.k.a AI) project!

Let's assume we have this big table with following structure.

>>> \d my_table
                Table "public.my_table"
   Column    |  Type   | Collation | Nullable | Default
-------------+---------+-----------+----------+---------
 userid      | integer |           |          |
 date        | bigint  |           |          |
 description | text    |           |          |
 body        | text    |           |          |
 id          | uuid    |           | not null |
Indexes:
    "my_table_pkey" PRIMARY KEY, btree (id)

In my test environment, table size is 12 TB and has approximately 3 billions rows.

Method 1. Full table scan

Let's imagine you take the simplest approach:

>>> SELECT min(date) from my_table;
... sound of grass growing ...
...
      min
---------------
 1651924922126
(1 row)

Time: 878291.856 ms (14:38.292)

This sledgehammer method works but requires real patience if your data is really big. Such query can take minutes or hours to complete. Here it was below 15 minutes, as the dataset is only 12 TB.

Metod 2. Indexes

You can create indexes on all columns you need. This will speed up *some aggregates* including min() and max(). Here, example after creating an index on the date column:

>>> SELECT min(date) FROM my_table;
      min
---------------
 1651924922126
(1 row)

Time: 2.177 ms

That was fast! Unfortunately, there are many situations you can't create indexes or the approach is not practical (either you don't own the system, or there is too many columns to be indexed, or you don't have enough disk space for the indexes).

Method 3. Sampling

You can use TABLESAMPLE clause to select only a portion of the table, like 0.5% or 0.1% in following examples:

>>> SELECT min(date) AS samp_min_date FROM my_table tablesample system(0.05);
      min
---------------
 1651941070954
(1 row)

Time: 12137.988 ms (00:12.138)

>>> SELECT min(date) FROM crm.emails_sent tablesample system(0.1);
      min
---------------
 1651931602939
(1 row)

Time: 24582.320 ms (00:24.582)

That's faster than full table scan. Sampling with TABLESAMPLE is nice and good enough in many cases.

Method 4. Database statistics

If you know that RDBM systems keep statistics on attributes, you can use that to achieve even better precision without scanning through tons of data!

PostgreSQL keeps that too and you can access the data in in the pg_stats system view. Here we use pg_stats.histogram_bounds column which is of anyarray type hence it needs special casting to be useful.

For any numeric (integer, float etc) columns, this expression will yield approximated minimum value from the whole dataset: (histogram_bounds :: text :: numeric[])[1] from pg_stats where tablename='my_table' and attname='my_column'

Examples - same expression works for date and userid columns:

>>> select (histogram_bounds :: text :: numeric[])[1] as stat_min_date
  from pg_stats
 where tablename = 'my_table' and attname = 'date';
  
 stat_min_date
---------------
 1651932598261
(1 row)

Time: 1.297 ms

>>> select (histogram_bounds :: text :: numeric[])[1] as stat_min_userid
  from pg_stats
 where tablename = 'my_table' and attname = 'userid';
 
 stat_min_userid
-----------------
            2038
(1 row)
Time: 1.237 ms

That's really fast, the results are returned almost instantly even without index on respective columns!

But how accurate is it? We need to remember any statistically - sampled results are not super precise. In my table, the actual value of min(userid) is zero but here you see 2038.

But it's true to the order of magnitude. If you want more precise sampling, you can ask postgres to analyze your data with higher precision by raising the default_statistics_target sessions parameter, running ANALYZE and results should be closer to reality:

>>> set default_statistics_target to 1000;
SET
Time: 0.426 ms
     
>>> analyze verbose crm.emails_sent ( userid );
INFO:  analyzing "crm.emails_sent"
INFO:  "emails_sent": scanned 300000 of 53667393 pages, containing 16709008 live rows and 0 dead rows; 300000 rows in sample, 2989096330 estimated total rows
ANALYZE
Time: 128758.921 ms (02:08.759)

> select (histogram_bounds :: text :: numeric[])[1] as stat_min_userid from pg_stats where tablename = 'my_table' and attname = 'userid';
 stat_min_userid
-----------------
             262
(1 row)

Time: 1.904 ms

Thanks for reading!