-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Taking as an example the Nova Scotia data, I have done a little digging for fun. My processing here is with awkward, whose API I find very convenient, but it actually uses arrow internally for parquet IO.
File sizes:
336M ns-water-water_line-interleaved.parquet
352M ns-water-water_line-wkb.parquet
336M ns-water-water_line.parquet
240M ns-water-water_line_new.parq
185M ns-water-water_line_new40.parq
- "new" is created from ns-water-water_line with no changes, except to turn on byte-stream-splitting. This puts together the all the high bytes of the doubles, and then all the low bytes of the doubles. This makes the blocks much more uniform and leads to better compression (still using Zstd with default options)
- "new40" also applies bitround with 40 bits of mantissa. This is in principle lossy, but for this dataset preserves the full resolution. I did not diagnose how low this number could go
Time to read, with local SSD
%timeit arr = ak.from_parquet("ns-water-water_line-wkb.parquet")
617 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# this does not include parsing of the text
%timeit arr = ak.from_parquet("ns-water-water_line-interleaved.parquet")
1.02 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit arr = ak.from_parquet("ns-water-water_line.parquet")
871 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit arr = ak.from_parquet("ns-water-water_line_new.parq")
566 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit arr = ak.from_parquet("ns-water-water_line_new40.parq")
580 ms ± 5.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For remote storage, the difference in file sizes would have made a bigger difference.
Metadata
Metadata
Assignees
Labels
No labels