-
Notifications
You must be signed in to change notification settings - Fork 45
Rephrase external column data. #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rephrase external column data. #145
Conversation
| api-external-column-data: | ||
| status: full | ||
| api-parquet-summary-file: | ||
| status: none |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@platypii I looked at hyparquet documentation quickly and didn't see this mentioned but please let me know if I missed it.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @emkornfield
Assuming apache/parquet-format#542 is correct I think this is a good change too.
I rendered it locally and it looks good to me
| note: "In parquet.thrift: ColumnChunk->file_path" | ||
| - id: api-parquet-summary-file | ||
| display_name: Parquet summary file | ||
| note: "Files named _metadata that consolidate footers from multiple parquet files." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reference to the file_path field will help keep this specific. So perhaps something like
| note: "Files named _metadata that consolidate footers from multiple parquet files." | |
| note: "Files named _metadata that consolidate footers from multiple parquet files, stored in ColumnChunk->file_path" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "parquet summary file" really the same as supporting file_path?
Maybe summary files are the only known use case of actually using file_path? But nothing in the spec says file_path is necessarily associated with "summary files". In fact is there anything in the spec about summary file or is that just an implementation choice some people made?
I'm not opposed to having a row for support of summary files, but that seems like a different question than file_path support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd err on not including that a reference to file_path here.
In fact is there anything in the spec about summary file or is that just an implementation choice some people made?
There is not apache/parquet-format#542 has some more details.
|
going to merge. Thanks for the reviews. |
Based on support matrix I believe this was meant to capture Parquet summary files ("_metadata") and not reading external data.
Related format PR: apache/parquet-format#542