my row in your table: Which Flat-File Format is the Best for Your Data Processing in PostgreSQL (or anywhere)

Good day everyone.

If you are dealing with data, sooner or later you will deal with flat data files.

Now - there are some competing options for transfer format. Human world will prefer rich / hierarchical file formats, markup languages, Excel etc etc.

But the machines will happily consume raw arrays of bits.

Here I'm going to compare elementary processing performance for two most common text file formats used while pushing data in/out of PostgreSQL: the native COPY format and the CSV format.

The COPY format is more compact and machine-efficient, while the CSV is only a little more complex, but as we will see this makes a difference.

Let's go to the real use case.

USE CASE

We have a table with 230 k different email templates. This content is similar characteristics as html. xml, or text document data, about 1000 characters wide plus some extra attributes.

I will measure input/output performance, starting with...

EXPORT TIMING

testdb=# copy mailtemplates to '/dev/null';
COPY 231674
Time: 4183.323 ms (00:04.183)
testdb=# copy mailtemplates to '/dev/null';
COPY 231674
Time: 3326.553 ms (00:03.327)

testdb=# copy mailtemplates to '/dev/null' with (format csv, header on);
COPY 231674
Time: 3714.943 ms (00:03.715)
testdb=# copy mailtemplates to '/dev/null' with (format csv, header on);
COPY 231674
Time: 3707.985 ms (00:03.708)

OK so we got some numbers, and quick calculation shows, that

CSV EXPORT IS 11% SLOWER:

testdb=# SELECT 3707.985 / 3326.553;
?column?
--------------------
1.1146628356740446
(1 row)

Now, let's test the...

IMPORT TIMING

Table was truncated before each test.

testdb=# create unlogged table public.temptable (like mailtemplates including constraints including indexes);
CREATE TABLE

testdb=# copy temptable from '/tmp/cm';
COPY 231674
Time: 16834.553 ms (00:16.835)
testdb=# copy temptable from '/tmp/cm';
COPY 231674
Time: 15490.150 ms (00:15.490)
testdb=# copy temptable from '/tmp/cm.csv' with (format csv, header on);
COPY 231674
Time: 19058.464 ms (00:19.058)
testdb=# copy temptable from '/tmp/cm.csv' with (format csv, header on);
COPY 231674
Time: 18825.862 ms (00:18.826)

So,

CSV IMPORT IS 21 % SLOWER

(because even without indexes, the parser has ~ 20% more work to do)

testdb=# SELECT 18825.862 / 15490.150;
?column?
------------------------
1.21534407349186418466
(1 row)

Also, COPY has following advantages:

LINE COUNT FOR PG FORMAT IS EXACTLY EQUAL TO ROW COUNT

-bash-4.2$ wc -l /tmp/cm*
231674 /tmp/cm
6973810 /tmp/cm.csv

and...

FILE SIZE: CSV FORMAT IS 3 % BIGGER

-bash-4.2$ ls -lh /tmp/cm*
-rw-r--r-- 1 postgres postgres 828M Dec 15 12:33 /tmp/cm
-rw-r--r-- 1 postgres postgres 850M Dec 15 12:33 /tmp/cm.csv

That's all for now :-).
I think this was kind of obvious but 11 or 20% is definitely a big cost increase, so let's be ecological and do not waste energy when not needed.

my row in your table

piątek, 15 grudnia 2017

Which Flat-File Format is the Best for Your Data Processing in PostgreSQL (or anywhere)