Skip to content

Conversation

@emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Dec 19, 2025

Based on support matrix I believe this was meant to capture Parquet summary files ("_metadata") and not reading external data.

Related format PR: apache/parquet-format#542

api-external-column-data:
status: full
api-parquet-summary-file:
status: none
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@platypii I looked at hyparquet documentation quickly and didn't see this mentioned but please let me know if I missed it.

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emkornfield

Assuming apache/parquet-format#542 is correct I think this is a good change too.

I rendered it locally and it looks good to me

Image

note: "In parquet.thrift: ColumnChunk->file_path"
- id: api-parquet-summary-file
display_name: Parquet summary file
note: "Files named _metadata that consolidate footers from multiple parquet files."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reference to the file_path field will help keep this specific. So perhaps something like

Suggested change
note: "Files named _metadata that consolidate footers from multiple parquet files."
note: "Files named _metadata that consolidate footers from multiple parquet files, stored in ColumnChunk->file_path"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "parquet summary file" really the same as supporting file_path?

Maybe summary files are the only known use case of actually using file_path? But nothing in the spec says file_path is necessarily associated with "summary files". In fact is there anything in the spec about summary file or is that just an implementation choice some people made?

I'm not opposed to having a row for support of summary files, but that seems like a different question than file_path support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd err on not including that a reference to file_path here.

In fact is there anything in the spec about summary file or is that just an implementation choice some people made?

There is not apache/parquet-format#542 has some more details.

@emkornfield
Copy link
Contributor Author

going to merge. Thanks for the reviews.

@emkornfield emkornfield merged commit 7505bde into apache:production Jan 8, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants