Optimisations possible due to float representation rather than WKT

Taking as an example the Nova Scotia data, I have done a little digging for fun. My processing here is with awkward, whose API I find very convenient, but it actually uses arrow internally for parquet IO.

File sizes:
336M	ns-water-water_line-interleaved.parquet
352M	ns-water-water_line-wkb.parquet
336M	ns-water-water_line.parquet
240M	ns-water-water_line_new.parq
185M	ns-water-water_line_new40.parq

- "new" is created from ns-water-water_line with no changes, except to turn on byte-stream-splitting. This puts together the all the high bytes of the doubles, and then all the low bytes of the doubles. This makes the blocks much more uniform and leads to better compression (still using Zstd with default options)
- "new40" also applies [bitround](https://numcodecs.readthedocs.io/en/stable/bitround.html) with 40 bits of mantissa. This is in principle lossy, but for this dataset preserves the full resolution. I did not diagnose how low this number could go

Time to read, with local SSD
```
%timeit arr = ak.from_parquet("ns-water-water_line-wkb.parquet")
617 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# this does not include parsing of the text

%timeit arr = ak.from_parquet("ns-water-water_line-interleaved.parquet")
1.02 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit arr = ak.from_parquet("ns-water-water_line.parquet")
871 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit arr = ak.from_parquet("ns-water-water_line_new.parq")
566 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit arr = ak.from_parquet("ns-water-water_line_new40.parq")
580 ms ± 5.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

For remote storage, the difference in file sizes would have made a bigger difference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisations possible due to float representation rather than WKT #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimisations possible due to float representation rather than WKT #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions