Skip to content

Optimisations possible due to float representation rather than WKT #1

@martindurant

Description

@martindurant

Taking as an example the Nova Scotia data, I have done a little digging for fun. My processing here is with awkward, whose API I find very convenient, but it actually uses arrow internally for parquet IO.

File sizes:
336M ns-water-water_line-interleaved.parquet
352M ns-water-water_line-wkb.parquet
336M ns-water-water_line.parquet
240M ns-water-water_line_new.parq
185M ns-water-water_line_new40.parq

  • "new" is created from ns-water-water_line with no changes, except to turn on byte-stream-splitting. This puts together the all the high bytes of the doubles, and then all the low bytes of the doubles. This makes the blocks much more uniform and leads to better compression (still using Zstd with default options)
  • "new40" also applies bitround with 40 bits of mantissa. This is in principle lossy, but for this dataset preserves the full resolution. I did not diagnose how low this number could go

Time to read, with local SSD

%timeit arr = ak.from_parquet("ns-water-water_line-wkb.parquet")
617 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# this does not include parsing of the text

%timeit arr = ak.from_parquet("ns-water-water_line-interleaved.parquet")
1.02 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit arr = ak.from_parquet("ns-water-water_line.parquet")
871 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit arr = ak.from_parquet("ns-water-water_line_new.parq")
566 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit arr = ak.from_parquet("ns-water-water_line_new40.parq")
580 ms ± 5.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For remote storage, the difference in file sizes would have made a bigger difference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions