GH-563: Make ColumnMetaData.path_in_schema optional#564
Conversation
|
I hope to have a Java PoC available soon. |
|
Java PoC apache/parquet-java#3470 I've so far confirmed that parquet-cli cat from the Java PoC can read a file lacking |
|
I think it is a great idea -- though before merging this I think we should do a formal approval on the mailing list |
For sure! 👍 I just wanted to put up a concrete proposal to drive the discussion. Also, FWIW, I've started on an arrow-cpp PoC. We'll see how far I get 😅 |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
C++ PoC apache/arrow#49707 |
| * the schema, and redundantly storing it here can lead to unnecessary | ||
| * bloat in the footer. Writers are encouraged to make the writing of | ||
| * this field optional, but for maximal compatibility should default to | ||
| * writing the field until at least Month 202X. |
There was a problem hiding this comment.
Based on "Forward incompatible features/changes should not be turned on by default until 2 years after the parquet-java implementation containing the feature is released." Lets maybe fill in the date as September 2028, assuming we get things merged by a september java release?
There was a problem hiding this comment.
I've gone ahead and put Sept 2028 in the text for now. We can update as needed later.
# Which issue does this PR close? none # Rationale for this change This is a proof of concept implementation for apache/parquet-format#563 # What changes are included in this PR? Since version 57.0.0, this crate has been tolerant of a missing `path_in_schema`. This PR adds options to cease writing the field as well. The option defaults to continuing to write the field. See related discussion on parquet mailing list: https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt # Are these changes tested? Yes # Are there any user-facing changes? No, this only adds an optional behavior change that defaults to no change # Related PRs - apache/parquet-format#563 - apache/parquet-format#564 - apache/parquet-java#3470
|
@etseidl How do we want to move forward? |
|
FWIW we merged an option to not writing path_in_schema (off by default) in the Rust implementation which people can choose to use |
Don't we need to change the spec first? |
Not in my opinion -- see my rationale here: |
@emkornfield mentioned on the M/L that we should give a week for others to comment, but that was back in April (https://lists.apache.org/thread/900503q07v95vyh6fk3qfn7ynb4w6yn2). I think I need to loop back and make a file for parquet-testing, then test it with the 3 PoCs. Then I think we can bring this up for a vote. |
|
Submitted apache/parquet-testing#108 Verified the file is read by the arrow-cpp PoC % python
>>> import pyarrow
>>> pyarrow.__version__
'24.0.0.dev298+g24f0f4c9a'
>>> from pyarrow import parquet as pq
>>> df = pq.read_table('src/parquet-testing/data/no_path_in_schema.zstd.parquet')
>>> df
pyarrow.Table
a: map<string, map<int32, bool ('value')> ('a')>
child 0, a: struct<key: string not null, value: map<int32, bool ('value')>> not null
child 0, key: string not null
child 1, value: map<int32, bool ('value')>
child 0, value: struct<key: int32 not null, value: bool not null> not null
child 0, key: int32 not null
child 1, value: bool not null
b: int32 not null
c: double not null
----
a: [[keys:["a"]values:[keys:[1,2]values:[true,false]],keys:["b"]values:[keys:[1]values:[true]],keys:["c"]values:[null],keys:["d"]values:[keys:[]values:[]],keys:["e"]values:[keys:[1]values:[true]],keys:["f"]values:[keys:[3,4,5]values:[true,false,true]]]]
b: [[1,1,1,1,1,1]]
c: [[1,1,1,1,1,1]]Working on java...parquet-cli doesn't like non-string map keys: % pqcli cat ~/src/parquet-testing/data/no_path_in_schema.zstd.parquet
Argument error: Map key type must be binary (UTF8): required int32 key |
|
I confirmed parquet-cli meta and pages work with the parquet-java PoC. % parquet-cli pages no_path_in_schema.parquet
Column: a.key_value.key
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict Z _ 6 5.00 B 30 B
0-1 data Z R 6 4.33 B 26 B
Column: a.key_value.value.key_value.key
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict Z _ 5 4.00 B 20 B
0-1 data Z R 9 3.78 B 34 B
Column: a.key_value.value.key_value.value
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-0 data Z _ 9 3.33 B 30 B
Column: b
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict Z _ 1 4.00 B 4 B
0-1 data Z R 6 1.83 B 11 B
Column: c
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict Z _ 1 8.00 B 8 B
0-1 data Z R 6 1.83 B 11 B
% parquet-cli meta no_path_in_schema.zstd.parquet
File path: no_path_in_schema.zstd.parquet
Created by: parquet-rs version 58.3.0
Properties:
ARROW:schema: /////wgCAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAMAAAB4AAAASAAAABQAAAAQABYAEAAAAA8ABAAAAAgAEAAAABgAAAAcAAAAAAAAAxgAAAAAAAYACAAGAAYAAAAAAAIAAAAAAAEAAABjAAAAxP7//xAAAAAYAAAAAAAAAhQAAAAU////IAAAAAAAAAEAAAAAAQAAAGIAAAC8////GAAAAAwAAAAAAAERSAEAAAEAAAAIAAAA5P7//xD///8cAAAADAAAAAAAAA0YAQAAAgAAAOgAAAAYAAAACP///xAAFAAQAA4ADwAEAAAACAAQAAAAGAAAAAwAAAAAAAERoAAAAAEAAAAIAAAAOP///2T///8cAAAADAAAAAAAAA1wAAAAAgAAADQAAAAIAAAAXP///4j///8UAAAADAAAAAAAAAYMAAAAAAAAAHj///8FAAAAdmFsdWUAAACw////GAAAACAAAAAAAAACHAAAAAgADAAEAAsACAAAACAAAAAAAAABAAAAAAMAAABrZXkACQAAAGtleV92YWx1ZQAAAAUAAAB2YWx1ZQAAABAAFAAQAAAADwAEAAAACAAQAAAAGAAAAAwAAAAAAAAFEAAAAAAAAAAEAAQABAAAAAMAAABrZXkACQAAAGtleV92YWx1ZQAAAAEAAABhAAAA
org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"a","type":{"type":"map","keyType":"string","valueType":{"type":"map","keyType":"integer","valueType":"boolean","valueContainsNull":false},"valueContainsNull":true},"nullable":true,"metadata":{}},{"name":"b","type":"integer","nullable":false,"metadata":{}},{"name":"c","type":"double","nullable":false,"metadata":{}}]}
Schema:
message arrow_schema {
optional group a (MAP) {
repeated group key_value {
required binary key (STRING);
optional group value (MAP) {
repeated group key_value {
required int32 key;
required boolean value;
}
}
}
}
required int32 b;
required double c;
}
Row group 0: count: 6 58.50 B records start: 4 total(compressed): 351 B total(uncompressed):270 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
a.key_value.key BINARY Z _ R 6 16.00 B
a.key_value.value.key_value.key INT32 Z _ R 9 10.44 B
a.key_value.value.key_value.value BOOLEAN Z _ 9 5.22 B
b INT32 Z _ R 6 9.17 B
c DOUBLE Z _ R 6 9.83 B |

Rationale for this change
path_in_schemaoptional #563.What changes are included in this PR?
Change
path_in_schemato optional.Do these changes have PoC implementations?
Yes.
path_in_schemaoptional arrow-rs#9678path_in_schemaoptional parquet-java#3470path_in_schemaoptional arrow#49707Closes #563