Skip to content
18 changes: 18 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -958,6 +958,24 @@ union ColumnCryptoMetaData {
struct ColumnChunk {
/** File where column data is stored. If not set, assumed to be same file as
* metadata. This path is relative to the current file.
*
* As of December 2025, the only known use-case for this field is writing summary
* parquet files (i.e. "_metadata" files). These files consolidate footers from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a link that describes what a summary file is and what implementations support it?

This is what came back from a quick google search: https://stackoverflow.com/questions/53150801/what-is-the-parquet-summary-file

But I didn't see any mention of this in the format repository: https://github.com/search?q=repo%3Aapache%2Fparquet-format%20summary&type=code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was ever officially part of the parquet specification as far as I can tell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded this section.

* multiple parquet files to allow for efficient reading of footers to avoid file
* listing costs and prune out files that do not need to be read based on statistics.
* This is legacy feature as modern table formats (e.g. Iceberg, Hudi and Delta Lake)
* are more scalable and serve effectively the same purpose.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem to me that calling this "legacy" may be too opinionated -- maybe we could tone down the language with something like

Note that table formats (e.g. Iceberg, Hudi and Delta Lake) offer a
superset of this functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at attempt to summarize this thread: https://lists.apache.org/thread/ootf2kmyg3p01b1bvplpvp4ftd1bt72d

It seems like there are potential correctness issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this to the text.

*
* There is no other known use-case for this field. Specifically, there are no known
* readers that will read externally stored column data if this field is populated
* within a standard parquet file.
*
* Writers should not populate this field except for in parquet summary files. Readers
* should ensure this field is empty.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These statements are effectively removing it from the spec, which I feel is too strong of a position for this clarification. I think it's fair to say that "readers should validate this field is empty if they do not support reading external column data", but prohibiting it is not a clarification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed this text entirely. I think this is mostly covered by the having new use go through a formal proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this was also referring to below as well. But I updated some language above this specific sentence in the paragraph above to also clarify that using this for reading external columns is not considered part of the specification (which is reflected by current implementation status).

*
* Any new use of this field must go through the normal Parquet feature
* addition process.
*
**/
1: optional string file_path

Expand Down