-
Notifications
You must be signed in to change notification settings - Fork 4.1k
GH-33823: [C++][IPC] Improve error messages when opening files that are the wrong format #49771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 5 commits
abd2b9c
db7dd63
d025ab0
1d9b536
407c134
d01cb5d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -565,6 +565,17 @@ Status DecodeMessage(MessageDecoder* decoder, io::InputStream* file) { | |
| auto metadata_length = decoder->next_required_size(); | ||
| ARROW_ASSIGN_OR_RAISE(auto metadata, file->Read(metadata_length)); | ||
| if (metadata->size() != metadata_length) { | ||
| // The first sizeof(int32_t) bytes of the Arrow file magic ("ARRO") may have been | ||
| // misread as metadata_length. Check if the remaining bytes complete the magic. | ||
| const auto remaining_magic = internal::kArrowMagicBytes.substr(sizeof(int32_t)); | ||
| if (metadata->size() >= static_cast<int64_t>(remaining_magic.size()) && | ||
| std::string_view(reinterpret_cast<const char*>(metadata->data()), | ||
| remaining_magic.size()) == remaining_magic) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would only work on a little-endian machine. We need to be endianness-agnostic (see |
||
| return Status::Invalid("Expected to read ", metadata_length, | ||
| " metadata bytes, but only read ", metadata->size(), | ||
| ". This appears to be an Arrow IPC file. " | ||
| "Try the IPC file reader instead of the IPC stream reader."); | ||
| } | ||
| return Status::Invalid("Expected to read ", metadata_length, " metadata bytes, but ", | ||
| "only read ", metadata->size()); | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2265,6 +2265,52 @@ TEST(TestRecordBatchStreamReader, MalformedInput) { | |
| ASSERT_RAISES(Invalid, RecordBatchStreamReader::Open(&garbage_reader)); | ||
| } | ||
|
|
||
| TEST(TestRecordBatchStreamReader, OpenFileFormatSuggestsFileReader) { | ||
| std::shared_ptr<RecordBatch> batch; | ||
| ASSERT_OK(MakeIntRecordBatch(&batch)); | ||
|
|
||
| FileWriterHelper helper; | ||
| ASSERT_OK(helper.Init(batch->schema(), IpcWriteOptions::Defaults())); | ||
| ASSERT_OK(helper.WriteBatch(batch)); | ||
| ASSERT_OK(helper.Finish()); | ||
|
|
||
| io::BufferReader reader(helper.buffer_); | ||
| // Check we mention using the file_reader when we detect file format | ||
| EXPECT_RAISES_WITH_MESSAGE_THAT(Invalid, ::testing::HasSubstr("file reader"), | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we be a bit more specific and match a longer substring? For example "Try the IPC file reader".
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. addressed |
||
| RecordBatchStreamReader::Open(&reader)); | ||
| } | ||
|
|
||
| TEST(TestRecordBatchStreamReader, CorruptDataDoesNotSuggestFileReader) { | ||
| // Continuation marker + metadata_length = 100, then 8 bytes of non-magic data. | ||
| const std::string corrupt( | ||
| "\xff\xff\xff\xff" | ||
| "\x64\x00\x00\x00" | ||
| "ABABABAB", | ||
| 16); | ||
| auto buffer = std::make_shared<Buffer>(corrupt); | ||
| io::BufferReader reader(buffer); | ||
| // Validate that we don't suggest file reader when file is just corrupt | ||
| EXPECT_RAISES_WITH_MESSAGE_THAT( | ||
| Invalid, ::testing::Not(::testing::HasSubstr("file reader")), | ||
| RecordBatchStreamReader::Open(&reader)); | ||
| } | ||
|
|
||
| TEST(TestRecordBatchFileReader, OpenStreamFormatSuggestsStreamReader) { | ||
| std::shared_ptr<RecordBatch> batch; | ||
| ASSERT_OK(MakeIntRecordBatch(&batch)); | ||
|
|
||
| StreamWriterHelper helper; | ||
| ASSERT_OK(helper.Init(batch->schema(), IpcWriteOptions::Defaults())); | ||
| ASSERT_OK(helper.WriteBatch(batch)); | ||
| ASSERT_OK(helper.Finish()); | ||
|
|
||
| auto buf_reader = std::make_shared<io::BufferReader>(helper.buffer_); | ||
| // Check we mention using the stream_reader when we detect stream format | ||
| EXPECT_RAISES_WITH_MESSAGE_THAT( | ||
| Invalid, ::testing::HasSubstr("stream reader"), | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto |
||
| RecordBatchFileReader::Open(buf_reader.get(), helper.buffer_->size())); | ||
| } | ||
|
|
||
| class EndlessCollectListener : public CollectListener { | ||
| public: | ||
| EndlessCollectListener() : CollectListener(), decoder_(nullptr) {} | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How likely is it to have a metadata message exactly 1330795073 bytes long ("ARRO" decoded as a 32-bit little-endian integer)? Perhaps we can check up front instead of trying to read so much data?
cc @lidavidm @paleolimbot for opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a gigabyte of metadata raises flags anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In nanoarrow we check the first few bytes for the magic string and skip them (then attempt to read the rest of the input as an IPC stream). We've never run into a complaint about this not working but I'm not sure how widespread the usage is (we could add an option to turn it off or improve the error that occurs if we run into one). I think 1330795073 bytes of metadata would never reasonably occur on purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the check to happen first, and avoid the large read if possible