Skip to content
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions cpp/src/arrow/ipc/message.cc
Original file line number Diff line number Diff line change
Expand Up @@ -565,6 +565,17 @@ Status DecodeMessage(MessageDecoder* decoder, io::InputStream* file) {
auto metadata_length = decoder->next_required_size();
ARROW_ASSIGN_OR_RAISE(auto metadata, file->Read(metadata_length));
if (metadata->size() != metadata_length) {
// The first sizeof(int32_t) bytes of the Arrow file magic ("ARRO") may have been
// misread as metadata_length. Check if the remaining bytes complete the magic.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How likely is it to have a metadata message exactly 1330795073 bytes long ("ARRO" decoded as a 32-bit little-endian integer)? Perhaps we can check up front instead of trying to read so much data?

cc @lidavidm @paleolimbot for opinions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a gigabyte of metadata raises flags anyways

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In nanoarrow we check the first few bytes for the magic string and skip them (then attempt to read the rest of the input as an IPC stream). We've never run into a complaint about this not working but I'm not sure how widespread the usage is (we could add an option to turn it off or improve the error that occurs if we run into one). I think 1330795073 bytes of metadata would never reasonably occur on purpose.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the check to happen first, and avoid the large read if possible

const auto remaining_magic = internal::kArrowMagicBytes.substr(sizeof(int32_t));
if (metadata->size() >= static_cast<int64_t>(remaining_magic.size()) &&
std::string_view(reinterpret_cast<const char*>(metadata->data()),
remaining_magic.size()) == remaining_magic) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would only work on a little-endian machine. We need to be endianness-agnostic (see ConsumeInitialData).

return Status::Invalid("Expected to read ", metadata_length,
" metadata bytes, but only read ", metadata->size(),
". This appears to be an Arrow IPC file. "
"Try the IPC file reader instead of the IPC stream reader.");
}
return Status::Invalid("Expected to read ", metadata_length, " metadata bytes, but ",
"only read ", metadata->size());
}
Expand Down
4 changes: 3 additions & 1 deletion cpp/src/arrow/ipc/reader.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1890,7 +1890,9 @@ class RecordBatchFileReaderImpl : public RecordBatchFileReader {
const auto magic_start = buffer->data() + sizeof(int32_t);
if (std::string_view(reinterpret_cast<const char*>(magic_start), kMagicSize) !=
kArrowMagicBytes) {
return Status::Invalid("Not an Arrow file");
return Status::Invalid(
"Not an Arrow file. If this is an Arrow IPC streaming format file, use "
"the IPC stream reader instead.");
Comment thread
RobertLD marked this conversation as resolved.
Outdated
}

int32_t footer_length = bit_util::FromLittleEndian(
Expand Down
Loading