sobota, 11 maja 2024

Leveraging PostgreSQL statistics for min() and max() approximation.

Sometimes you have a big table (terabytes or more) and you want to know what's the minimal or maximal, or average value of some column. This can happen when you prepare your training and testing datasets for your latest Machine Learning (a.k.a AI) project!

Let's assume we have this big table with following structure.

>>> \d my_table
                Table "public.my_table"
   Column    |  Type   | Collation | Nullable | Default
-------------+---------+-----------+----------+---------
 userid      | integer |           |          |
 date        | bigint  |           |          |
 description | text    |           |          |
 body        | text    |           |          |
 id          | uuid    |           | not null |
Indexes:
    "my_table_pkey" PRIMARY KEY, btree (id)

In my test environment, table size is 12 TB and has approximately 3 billions rows.

Method 1. Full table scan

Let's imagine you take the simplest approach:

>>> SELECT min(date) from my_table;
... sound of grass growing ...
...
      min
---------------
 1651924922126
(1 row)

Time: 878291.856 ms (14:38.292)

This sledgehammer method works but requires real patience if your data is really big. Such query can take minutes or hours to complete. Here it was below 15 minutes, as the dataset is only 12 TB.

Metod 2. Indexes

You can create indexes on all columns you need. This will speed up *some aggregates* including min() and max(). Here, example after creating an index on the date column:

>>> SELECT min(date) FROM my_table;
      min
---------------
 1651924922126
(1 row)

Time: 2.177 ms

That was fast! Unfortunately, there are many situations you can't create indexes or the approach is not practical (either you don't own the system, or there is too many columns to be indexed, or you don't have enough disk space for the indexes).

Method 3. Sampling

You can use TABLESAMPLE clause to select only a portion of the table, like 0.5% or 0.1% in following examples:

>>> SELECT min(date) AS samp_min_date FROM my_table tablesample system(0.05);
      min
---------------
 1651941070954
(1 row)

Time: 12137.988 ms (00:12.138)

>>> SELECT min(date) FROM crm.emails_sent tablesample system(0.1);
      min
---------------
 1651931602939
(1 row)

Time: 24582.320 ms (00:24.582)

That's faster than full table scan. Sampling with TABLESAMPLE is nice and good enough in many cases.

Method 4. Database statistics

If you know that RDBM systems keep statistics on attributes, you can use that to achieve even better precision without scanning through tons of data!

PostgreSQL keeps that too and you can access the data in in the pg_stats system view. Here we use pg_stats.histogram_bounds column which is of anyarray type hence it needs special casting to be useful.

For any numeric (integer, float etc) columns, this expression will yield approximated minimum value from the whole dataset: (histogram_bounds :: text :: numeric[])[1] from pg_stats where tablename='my_table' and attname='my_column'

Examples - same expression works for date and userid columns:

>>> select (histogram_bounds :: text :: numeric[])[1] as stat_min_date
  from pg_stats
 where tablename = 'my_table' and attname = 'date';
  
 stat_min_date
---------------
 1651932598261
(1 row)

Time: 1.297 ms

>>> select (histogram_bounds :: text :: numeric[])[1] as stat_min_userid
  from pg_stats
 where tablename = 'my_table' and attname = 'userid';
 
 stat_min_userid
-----------------
            2038
(1 row)
Time: 1.237 ms

That's really fast, the results are returned almost instantly even without index on respective columns!

But how accurate is it? We need to remember any statistically - sampled results are not super precise. In my table, the actual value of min(userid) is zero but here you see 2038.

But it's true to the order of magnitude. If you want more precise sampling, you can ask postgres to analyze your data with higher precision by raising the default_statistics_target sessions parameter, running ANALYZE and results should be closer to reality:

>>> set default_statistics_target to 1000;
SET
Time: 0.426 ms
     
>>> analyze verbose crm.emails_sent ( userid );
INFO:  analyzing "crm.emails_sent"
INFO:  "emails_sent": scanned 300000 of 53667393 pages, containing 16709008 live rows and 0 dead rows; 300000 rows in sample, 2989096330 estimated total rows
ANALYZE
Time: 128758.921 ms (02:08.759)

> select (histogram_bounds :: text :: numeric[])[1] as stat_min_userid from pg_stats where tablename = 'my_table' and attname = 'userid';
 stat_min_userid
-----------------
             262
(1 row)

Time: 1.904 ms

Thanks for reading!

wtorek, 15 listopada 2022

Detect and remove duplicated indexes in PostgreSQL

In Postgres, you can create multiple indexes on same table and identical list of columns. While it is useful for maintenance (you can create new index and drop the old one, for example) it is NOT advisable in long term.

Here is a catalog query which will detect and print duplicate indexes:


WITH dup AS (
	SELECT x.indrelid, x.indkey, i.relam, md5(x.indpred)
	FROM pg_index x
	JOIN pg_class i ON i.oid = x.indexrelid
	GROUP BY 1,2,3,4
	HAVING count(*) > 1
)
SELECT
	n.nspname AS schemaname,
	c.relname AS tablename,
    i.relname AS indexname,
	/*am.amname, x.indnatts, x.indkey,
    ARRAY(
        SELECT pg_get_indexdef(x.indexrelid, k + 1, true)
        FROM generate_subscripts(x.indkey, 1) as k
        ORDER BY k
    ) AS indkeyname,
	x.indpred, */
    pg_get_indexdef(x.indexrelid) AS indexdef,
    pg_size_pretty(c.relpages::int8 * current_setting('block_size')::int) AS tablesize,
    pg_size_pretty(i.relpages::int8 * current_setting('block_size')::int) AS indexsize
FROM pg_index x 
JOIN pg_class c ON c.oid = x.indrelid
JOIN pg_class i ON i.oid = x.indexrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
JOIN pg_am AS am ON i.relam = am.oid
JOIN dup ON (
	dup.indrelid = x.indrelid
	AND dup.indkey = x.indkey
	AND dup.relam = i.relam
	AND dup.md5 IS NOT DISTINCT FROM md5(x.indpred))
ORDER BY
	n.nspname, c.relname, dup.md5 nulls first, i.relname
;

wtorek, 18 października 2022

k8grep: Yet Another Ultimate Kubernetes Tool

Hello!

Maybe you're struggling how to find any IP in your Kubernetes cluster? [*]

Or you want to find out what images your deployments use? [**]

Or you need to know which configmaps or secrets have to anything with Redis? [***]

Here's a powerful func in bash to find almost ANY information about my current kubernetes cluster:

k8grep () {
	local objtype=$1
	shift 1
	local grep_args="$@"
	[ "$objtype" -a "$grep_args" ] || {
		echo "# usage: k8grep OBJTYPE GREP_ARGS"
		echo "# example: k8grep pod -n -m 5 postgres"
		return
	}
	kubectl get --no-headers "$objtype" --all-namespaces |
		while read namespace object junk; do
			echo "# kubectl get -n $namespace $objtype $object"
			kubectl get $objtype -n $namespace $object -oyaml | grep $grep_args
		done
}

See it in action. This is what I am getting after running k8grep configmap 'REDIS_\w\+' -n -1

k8grep in action

By the way, answers to questions (LOL, dare I compete with StackOverflow?) are:

How to find any IP in your Kubernetes cluster?

k8grep endpoints 10.11.12.134

How do I find out what images my k8s deployments use?

k8grep deployment -n "image: "

How do I find which config maps and secrets contain anything about Redis?

for t in configmap secret; do k8grep $t -m1 -n -i redis; done

wtorek, 19 maja 2020

The Shortest FOSS (Free / Open Source Software) License

Now and then, you may be wondering what is the shortest and simplest possible Open Source license.  Maybe you are contributing some Free Software to the world for your fun, glory and profit, and you want to save precious bytes in your project directory?

The Software Package Data Exchange® (SPDX®) publishes a list of (probably all) Open Source licenses. The page specifies license name, codename, full text, whether it's approved by Open Source Institute and whether it's recognized by the Free Software Foundation as Free/Libre license.

I was curious which one is the most terse, and I wrote a little Python script to scrape the data from SPDX page, parse it and save to CSV.

The script is here: spdx_licenses.py.

Here are the winners.

1. The shortest Open Source license is the diffmark license - it's only 17 words and 88 characters long, but it's neither FSF-approved nor OSI-approved. Personally, I love it. The credits for license text should probably go to Václav Bárta (http://www.mangrove.cz/diffmark/). Use it if you are brave. Quoting in full:
1. you can do what you want with it 2. I refuse any responsibility for the consequences
2. The shortest FSF-approved Open Source license is the FSF All Permissive License. It is 34 words, 224 characters long. Quote:
Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. This file is offered as-is, without any warranty.
3. The shortest OSI-approved Open Source license is the Fair License. It is 36 words, 229 characters long. Quote:
<Copyright Information> Usage of the works is permitted provided that this instrument is retained with the works, so that any entity that uses the works is notified of this instrument. DISCLAIMER: THE WORKS ARE WITHOUT WARRANTY.
4. The shortest Open Source license which is both FSF-approved and OSI Approved is the ISC License. It is 128 words, 819 characters long. Use it to get your back covered against most lawyers. Quoting:
ISC License Copyright (c) 2004-2010 by Internet Systems Consortium, Inc. ("ISC") Copyright (c) 1995-2003 by Internet Software Consortium Permission to use, copy, modify, and /or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. THE SOFTWARE IS PROVIDED "AS IS" AND ISC DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ISC BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

You may also wonder about the losers. Unsurprisingly, one of the longest licenses is the famous GNU General Public License v3.0 (over 5650 words, 38000 characters) - but it's  beaten by the Adaptive Public License 1.0 which is over 7100 words, 52500 characters! Whoa - pretty long for a free software license.

czwartek, 4 października 2018

Linux, SQLCMD and Vim

Hello everyone!

If you came here you probably already know what is SQL Server SQLCMD utility, and you also know (and either love or hate) the Vim text editor.

Of course, nobody with senses would use the SQLCMD for day-to-day work on Windows (we have GUIs. Management Studio etc).

Yes - I'm one of those guys who stick to GNU/Linux operating system and generally prefer command line interface (CLI) over GUIs.

So let's briefly explain the problem that I had:

SQLCMD interactive mode works (kudos to MS, you can create a working Unix program!), but on the output side, it does not have any paging / text formatting capabilities. It also does not shine on input side, ignoring one of best inventions of the GNU/Linux world which is the readline library.

Basically, this means that working with SQLCMD in interactive mode for anything serious is a mistake. Authoring SQL or reading SQL output is too inconvenient, unless all your queries look like this.

SELECT oneoneone=1, two=3
GO

So, the process to edit any complex SQL query in your CLI text editor is

  1. write your SQL query
  2. save the SQL file
  3. run the SQL file in sqlcmd, either interactive mode (using the :r macro) or batch mode (using the "-i" option).
  4. try to browse through the output on screen
  5. decide that output is too wide, line wrapping makes it unreadable and switch to  "-o" for output
  6. open the output file in any decent text pager / editor (like vim or less)
  7. Inspect the results. Go back to point 1.
Naturally, you could automate it using a bash script, along the lines of

/opt/mssql-tools/bin/sqlcmd <other options> -i my.sql | less


But my goal was to reduce development loop to this:
  1. write your SQL query
  2. run the SQL and see output in any decent text pager / editor
  3. Inspect the results. Go back to point 1.
Long story short - here is the solution for Vim, you can add it to your .vimrc and see for yourself.

Usage: While editing SQL file in normal mode, press letter M on the keyboard and query is executed over SQLCMD and its results appear in bottom window pane.

function! RunBufferInSQLCMD()
  let s:file = expand('%:p')
  if strlen(s:file) > 0
    silent write
    let s:sqlcmd = '/opt/mssql-tools/bin/sqlcmd'
    let s:sqlsrv = 'your.database.windows.net'
    let s:sqldb = 'YourDB'
    let s:sqlusr = 'bobadmin@yours'
    let s:sqlpwd = 'ehkm'
    let s:ofile = substitute(system('mktemp'),"\n",'','')
    let s:mycmd = s:sqlcmd
                  \ . ' -S "' . s:sqlsrv
                  \ . '" -d "' . s:sqldb
                  \ . '" -U "' . s:sqlusr
                  \ . '" -P "' . s:sqlpwd
                  \ . '" -i "' . s:file
                  \ . '" -o "' . s:ofile
                  \ . '"'
    let s:myoutput = system(s:mycmd)
    if strlen(s:myoutput) > 0
      echoerr 'Command output: ' . s:myoutput
    endif
    execute 'silent rightbelow split ' . s:ofile
  else
    echoerr 'Not a file!'
  endif
endfunction

map M :call RunBufferInSQLCMD()<cr>



That's it for today. Nice Vimming!


wtorek, 25 września 2018

pg_terminator: tiny utility for PostgreSQL DBAs

Hello everyone!

depesz just released a new useful tool, pg_terminator
 
I believe it will be useful in Pg DBA's toolbox.

It requires just Ruby, and ruby-pg gem.

You run it like this:

$ pg_terminator etc/pg_terminator/webdb.yaml

and the yaml config looks like this:

log: /tmp/pg_terminator.webdb.log
min_sleep: 1
database:
  dbname: webdb

  host: webdbsrv1
rules:

- name: long-webapp-txn
  action: transaction
  threshold: 180
  type: transaction
  condition: application_name ~ 'webapp' and state <> 'idle'



Here is an example of pg_terminator in action:
 2018-09-25 13:01:27 Sleeping for 20 seconds
2018-09-25 13:01:47 Sleeping for 20 seconds
2018-09-25 13:02:07 Sleeping for 15.0 seconds
2018-09-25 13:02:22 Terminated a query (pid: 2242) because of rule #1: long-txn
2018-09-25 13:02:22 Sleeping for 37.987898 seconds



niedziela, 23 września 2018

Neo4j LOAD CSV performance testing

Hello everyone!

Since I put my hands on Neo4j graph database, I am going to delve into its performance. I'll start with benchmarking basic operations - including data ingestion.

So, let's copy some tabular data into Neo4j. First step is obviously reading and parsing input files.

In case you don't know - the standard command for reading flat files is LOAD CSV. For example, to load data from tab-separated CSV file and print only 1st and 7th fields, you need to run:

LOAD CSV FROM 'file:///input.txt' AS line
RETURN line[0], line[6];

I decided to test this basic load operator in several variants, checking for possible  factors that *could* have impact on load performance:
  • If input file is compressed or not compressed
  • If WITH HEADERS clause is used or not (each row as hash / dictionary)
  • If the load is followed with a filter (WHERE clause)
  • If the load is follwoed by a simple MATCH rule (which actually does not match any existing paths)

Here are the raw results:

Gzipped Headers Filter Match Runtime#1 Runtime#2 Runtime#3 Runtime#avg CypherCommand
False False False False 13,99 14,70 14,13 14,28 LOAD CSV FROM 'file:///locations.csv' AS csv RETURN count(*)
False False False True 14,26 14,57 14,60 14,48 LOAD CSV FROM 'file:///locations.csv' AS csv MATCH (x:X {x: csv[5]}) RETURN count(*)
False False True False 16,59 16,92 17,06 16,86 LOAD CSV FROM 'file:///locations.csv' AS csv WITH csv WHERE csv[5] IS NOT NULL RETURN count(*)
False False True True 17,81 16,85 18,46 17,71 LOAD CSV FROM 'file:///locations.csv' AS csv WITH csv WHERE csv[5] IS NOT NULL MATCH (x:X {x: csv[5]}) RETURN count(*)
False True False False 22,22 20,53 20,74 21,16 LOAD CSV WITH HEADERS FROM 'file:///locations.csv' AS csv RETURN count(*)
False True False True 20,62 21,08 21,30 21,00 LOAD CSV WITH HEADERS FROM 'file:///locations.csv' AS csv MATCH (x:X {x: csv.city}) RETURN count(*)
False True True False 22,21 22,25 22,69 22,38 LOAD CSV WITH HEADERS FROM 'file:///locations.csv' AS csv WITH csv WHERE csv.city IS NOT NULL RETURN count(*)
False True True True 22,92 22,89 23,27 23,02 LOAD CSV WITH HEADERS FROM 'file:///locations.csv' AS csv WITH csv WHERE csv.city IS NOT NULL MATCH (x:X {x: csv.city}) RETURN count(*)
True False False False 13,85 14,17 14,74 14,25 LOAD CSV FROM 'file:///locations.csv.gz' AS csv RETURN count(*)
True False False True 14,06 14,51 14,87 14,48 LOAD CSV FROM 'file:///locations.csv.gz' AS csv MATCH (x:X {x: csv[5]}) RETURN count(*)
True False True False 16,62 16,66 16,97 16,75 LOAD CSV FROM 'file:///locations.csv.gz' AS csv WITH csv WHERE csv[5] IS NOT NULL RETURN count(*)
True False True True 17,08 16,90 17,41 17,13 LOAD CSV FROM 'file:///locations.csv.gz' AS csv WITH csv WHERE csv[5] IS NOT NULL MATCH (x:X {x: csv[5]}) RETURN count(*)
True True False False 20,31 20,07 20,76 20,38 LOAD CSV WITH HEADERS FROM 'file:///locations.csv.gz' AS csv RETURN count(*)
True True False True 20,78 20,40 21,52 20,90 LOAD CSV WITH HEADERS FROM 'file:///locations.csv.gz' AS csv MATCH (x:X {x: csv.city}) RETURN count(*)
True True True False 22,03 22,03 22,79 22,28 LOAD CSV WITH HEADERS FROM 'file:///locations.csv.gz' AS csv WITH csv WHERE csv.city IS NOT NULL RETURN count(*)
True True True True 22,55 22,65 23,34 22,85 LOAD CSV WITH HEADERS FROM 'file:///locations.csv.gz' AS csv WITH csv WHERE csv.city IS NOT NULL MATCH (x:X {x: csv.city}) RETURN count(*)




Conclusions?

  1. The gzipped input does not affect LOAD performance. Apparently, gzip decompression is so fast it does not affect anything. Gzipped input seems to be marginally faster
  2. The WITH HEADERS option adds significant overhead (40-50%) to the LOAD operation.
  3. The WHERE clause adds a small overhead but it's not so big as WITH HEADERS
  4. The negative MATCH after CSV LOAD is negligible. 

Executive summary:
  • Gzip adds 0% to the cost
  • Negative MATCH adds 1.40% to the cost
  • Simple WHERE adds 18% - 20% to the cost
  • WITH HEADERS option adds 43% - 48% to the cost.


Test conditions:

- default configuration for neo4j
- OS: Ubuntu 18.04
- Linux kernel 4.15
- Oracle Java 8
- Dell E6440 laptop
- Intel i5-4200M CPU
- 4 GB RAM: SODIMM DDR3 Synchronous 1600 MHz (0.6 ns)
- SSD: PLEXTOR PX-128M5

 [23:51:25][filip@lap2:~]$ uname -a
Linux lap2 4.15.0-34-generic #37-Ubuntu SMP Mon Aug 27 15:21:48 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[23:51:38][filip@lap2:~]$ java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
[23:51:42][filip@lap2:~]$ LC_ALL=C lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               60
Model name:          Intel(R) Core(TM) i5-4200M CPU @ 2.50GHz
Stepping:            3
CPU MHz:             1313.826
CPU max MHz:         3100.0000
CPU min MHz:         800.0000
BogoMIPS:            4988.50
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            3072K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l1d