Skip to content
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
87b438d
add json_to_variant
harshmotw-db Jun 25, 2025
a946ac6
comment fix
harshmotw-db Jun 25, 2025
bca3b81
Added another sample buffer managger
harshmotw-db Jun 25, 2025
339e880
minor refactoring
harshmotw-db Jun 25, 2025
fe798c3
Merge branch 'main' of https://github.com/harshmotw-db/arrow-rs into …
harshmotw-db Jun 25, 2025
882d3a7
Merge branch 'main' of https://github.com/harshmotw-db/arrow-rs into …
harshmotw-db Jun 25, 2025
3c18fdf
incorporated new changes
harshmotw-db Jun 25, 2025
67a83fe
minor changes
harshmotw-db Jun 26, 2025
c9aa519
Merge branch 'main' of https://github.com/harshmotw-db/arrow-rs into …
harshmotw-db Jun 26, 2025
dede88d
test fix based on recent commit
harshmotw-db Jun 26, 2025
fa3befc
constant fix
harshmotw-db Jun 26, 2025
38bac59
fix
harshmotw-db Jun 26, 2025
cd530ee
addressed comments
harshmotw-db Jun 26, 2025
57b3eb0
fix
harshmotw-db Jun 26, 2025
71b7d6f
fixed VariantBufferManager
harshmotw-db Jun 26, 2025
031c916
deduped a bit of code
harshmotw-db Jun 26, 2025
d4fc876
deduplicated code
harshmotw-db Jun 26, 2025
c41af4e
moved serde code out of variant.rs
harshmotw-db Jun 26, 2025
ecaf557
incorporated Ryan's latest comments
harshmotw-db Jun 26, 2025
4abc598
add more object tests
harshmotw-db Jun 26, 2025
0842ef8
fix
harshmotw-db Jun 26, 2025
94531af
doc fix
harshmotw-db Jun 26, 2025
28d0012
Merge branch 'main' into harsh-motwani_data/from_json
harshmotw-db Jun 27, 2025
e2788f5
Removed dependency on VariantBufferManager and leave that to later
harshmotw-db Jun 27, 2025
3178449
Merge branch 'harsh-motwani_data/from_json' of https://github.com/har…
harshmotw-db Jun 27, 2025
0455685
fix
harshmotw-db Jun 27, 2025
cc0b66e
fix
harshmotw-db Jun 30, 2025
d2a7516
resolved test comment
harshmotw-db Jun 30, 2025
a29b5c3
fixed more comments
harshmotw-db Jun 30, 2025
7f23cf5
fix decimal to string
harshmotw-db Jul 1, 2025
50f4b25
fmt and clippy
harshmotw-db Jul 1, 2025
560e430
fix
harshmotw-db Jul 1, 2025
3249d93
Merge remote-tracking branch 'apache/main' into harsh-motwani_data/fr…
alamb Jul 1, 2025
07d5688
clippy
alamb Jul 1, 2025
af937ac
Split tests into separate functions
alamb Jul 1, 2025
388f188
partially removed decimal dependency
harshmotw-db Jul 2, 2025
7407776
merge
harshmotw-db Jul 2, 2025
3b42d91
removed decimal type from variant json parsing
harshmotw-db Jul 2, 2025
43d6ea5
comment fix
harshmotw-db Jul 2, 2025
e9deda9
refined lifetimes
harshmotw-db Jul 3, 2025
eb11890
Fix clippy, remove unused dependency
alamb Jul 3, 2025
3531540
Merge remote-tracking branch 'apache/main' into harsh-motwani_data/fr…
alamb Jul 3, 2025
ea5b573
Update parquet-variant/src/from_json.rs
alamb Jul 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion parquet-variant/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,10 @@ rust-version = "1.83"
[dependencies]
arrow-schema = { workspace = true }
chrono = { workspace = true }
serde_json = "1.0"
# "arbitrary_precision" allows us to manually parse numbers. "preserve_order" does not automatically
# sort object keys
serde_json = { version = "1.0", features = ["arbitrary_precision", "preserve_order"] }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about the (undocumented) arbitrary_precision feature flag and (undocumented) Number::as_str method it unlocks 🤯

That could solve a lot of problems for variant decimals on the to_json path as well:
https://github.com/apache/arrow-rs/blob/main/parquet-variant/src/to_json.rs#L147-L162

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question tho -- do we need to support both modes of serde_json, given that the user (not arrow-rs) probably decides whether to use that feature flag?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible some projects might prioritize performance and might not want the Decimal type at all. I would prefer doing that as a follow up though. If you agree, I'll go ahead and create an issue.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, for sure it's a follow-up. But we probably do need a story for how to handle that feature flag in a dependency (maybe part of the bigger story of how to handle a serde_json dependency in the first place).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rust_decimal = "1.37.2"
base64 = "0.21"

[lib]
71 changes: 71 additions & 0 deletions parquet-variant/examples/variant_from_json_examples.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

//! Example showing how to convert Variant values to JSON

use parquet_variant::{
json_to_variant, variant_to_json, variant_to_json_string, variant_to_json_value,
SampleBoxBasedVariantBufferManager, SampleVecBasedVariantBufferManager, VariantBufferManager,
};

fn from_json_example<T: VariantBufferManager>(
variant_buffer_manager: &mut T,
) -> Result<(), Box<dyn std::error::Error>> {
let person_string = "{\"name\":\"Alice\", \"age\":30, ".to_string()
+ "\"email\":\"alice@example.com\", \"is_active\": true, \"score\": 95.7,"
+ "\"additional_info\": null}";
let (metadata_size, value_size) = json_to_variant(&person_string, variant_buffer_manager)?;

let variant = parquet_variant::Variant::try_new(
&variant_buffer_manager.get_immutable_metadata_buffer()[..metadata_size],
&variant_buffer_manager.get_immutable_value_buffer()[..value_size],
)?;

let json_string = variant_to_json_string(&variant)?;
let json_value = variant_to_json_value(&variant)?;
let pretty_json = serde_json::to_string_pretty(&json_value)?;
println!("{}", pretty_json);

let mut buffer = Vec::new();
variant_to_json(&mut buffer, &variant)?;
let buffer_result = String::from_utf8(buffer)?;

// Verify all methods produce the same result
assert_eq!(json_string, buffer_result);
assert_eq!(json_string, serde_json::to_string(&json_value)?);

Ok(())
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
// The caller must provide an object implementing the `VariantBufferManager` trait to the library.
// This allows the library to write the constructed variant to buffers provided by the caller.
// This way, the caller has direct control over the output buffers.
let mut box_based_buffer_manager = SampleBoxBasedVariantBufferManager {
value_buffer: vec![0u8; 1].into_boxed_slice(),
metadata_buffer: vec![0u8; 1].into_boxed_slice(),
};

let mut vec_based_buffer_manager = SampleVecBasedVariantBufferManager {
value_buffer: vec![0u8; 1],
metadata_buffer: vec![0u8; 1],
};

from_json_example(&mut box_based_buffer_manager)?;
from_json_example(&mut vec_based_buffer_manager)?;
Ok(())
}
5 changes: 5 additions & 0 deletions parquet-variant/src/decoder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,11 @@ use std::num::TryFromIntError;
// Makes the code a bit more readable
pub(crate) const VARIANT_VALUE_HEADER_BYTES: usize = 1;

pub(crate) const MAX_UNSCALED_DECIMAL_4: i32 = 999999999;
pub(crate) const MAX_PRECISION_DECIMAL_4: u8 = 9;
pub(crate) const MAX_UNSCALED_DECIMAL_8: i64 = 999999999999999999i64;
pub(crate) const MAX_PRECISION_DECIMAL_8: u8 = 18;
Comment thread
harshmotw-db marked this conversation as resolved.
Outdated

#[derive(Debug, Clone, Copy, PartialEq)]
pub enum VariantBasicType {
Primitive = 0,
Expand Down
102 changes: 102 additions & 0 deletions parquet-variant/src/from_json.rs
Comment thread
alamb marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
use crate::variant_buffer_manager::VariantBufferManager;
use crate::{ListBuilder, ObjectBuilder, Variant, VariantBuilder};
use arrow_schema::ArrowError;
use serde_json::{Map, Value};

/// Eventually, internal writes should also be performed using VariantBufferManager instead of
/// ValueBuffer and MetadataBuffer so the caller has control of the memory.
/// Returns a pair <value_size, metadata_size>
pub fn json_to_variant<T: VariantBufferManager>(
json: &str,
variant_buffer_manager: &mut T,
) -> Result<(usize, usize), ArrowError> {
let mut builder = VariantBuilder::new();
let json: Value = serde_json::from_str(json)
.map_err(|e| ArrowError::InvalidArgumentError(format!("JSON format error: {}", e)))?;

build_json(&json, &mut builder)?;
let (metadata, value) = builder.finish();
let value_size = value.len();
let metadata_size = metadata.len();

// Write to caller's buffers - Remove this when the library internally writes to the caller's
// buffers anyway
variant_buffer_manager.ensure_metadata_buffer_size(metadata_size)?;
variant_buffer_manager.ensure_value_buffer_size(value_size)?;

let caller_metadata_buffer = variant_buffer_manager.borrow_metadata_buffer();
caller_metadata_buffer[..metadata_size].copy_from_slice(metadata.as_slice());
let caller_value_buffer = variant_buffer_manager.borrow_value_buffer();
caller_value_buffer[..value_size].copy_from_slice(value.as_slice());
Ok((metadata_size, value_size))
}

fn build_json(json: &Value, builder: &mut VariantBuilder) -> Result<(), ArrowError> {
match json {
Value::Null => builder.append_value(Variant::Null),
Value::Bool(b) => builder.append_value(*b),
Value::Number(n) => {
let v: Variant = n.try_into()?;
builder.append_value(v)
}
Comment thread
harshmotw-db marked this conversation as resolved.
Value::String(s) => builder.append_value(s.as_str()),
Value::Array(arr) => {
let mut list_builder = builder.new_list();
build_list(arr, &mut list_builder)?;
Comment thread
harshmotw-db marked this conversation as resolved.
Outdated
list_builder.finish();
}
Value::Object(obj) => {
let mut obj_builder = builder.new_object();
build_object(obj, &mut obj_builder)?;
Comment thread
harshmotw-db marked this conversation as resolved.
Outdated
obj_builder.finish();
}
};
Ok(())
}

fn build_list(arr: &[Value], builder: &mut ListBuilder) -> Result<(), ArrowError> {
for val in arr {
match val {
Value::Null => builder.append_value(Variant::Null),
Value::Bool(b) => builder.append_value(*b),
Value::Number(n) => builder.append_value(Variant::try_from(n)?),
Value::String(s) => builder.append_value(s.as_str()),
Value::Array(arr) => {
let mut list_builder = builder.new_list();
build_list(arr, &mut list_builder)?;
list_builder.finish()
}
Value::Object(obj) => {
let mut obj_builder = builder.new_object();
build_object(obj, &mut obj_builder)?;
obj_builder.finish();
}
}
}
Ok(())
}

fn build_object<'a, 'b>(
obj: &'b Map<String, Value>,
builder: &mut ObjectBuilder<'a, 'b>,
) -> Result<(), ArrowError> {
for (key, value) in obj.iter() {
match value {
Value::Null => builder.insert(key, Variant::Null),
Value::Bool(b) => builder.insert(key, *b),
Value::Number(n) => builder.insert(key, Variant::try_from(n)?),
Value::String(s) => builder.insert(key, s.as_str()),
Value::Array(arr) => {
let mut list_builder = builder.new_list(key);
build_list(arr, &mut list_builder)?;
list_builder.finish()
}
Value::Object(obj) => {
let mut obj_builder = builder.new_object(key);
build_object(obj, &mut obj_builder)?;
obj_builder.finish();
}
}
}
Ok(())
}
6 changes: 6 additions & 0 deletions parquet-variant/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,16 @@ mod decoder;
mod variant;
// TODO: dead code removal
mod builder;
mod from_json;
mod to_json;
#[allow(dead_code)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allow was bothering me , so here is a PR to clean it up:

mod utils;
mod variant_buffer_manager;

pub use builder::*;
pub use from_json::json_to_variant;
pub use to_json::{variant_to_json, variant_to_json_string, variant_to_json_value};
pub use variant::*;
pub use variant_buffer_manager::{
SampleBoxBasedVariantBufferManager, SampleVecBasedVariantBufferManager, VariantBufferManager,
};
52 changes: 52 additions & 0 deletions parquet-variant/src/variant.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ mod decimal;
mod list;
mod metadata;
mod object;
use rust_decimal::prelude::*;
use serde_json::Number;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be eventually be planning to avoid all use of serde_json for parsing / serializing.

It is fine to use serde_json for the initial version / getting things functionally working, but it is terribly inefficient compared to more optimized implementations like the tape decoder

Thus can we please try and keep anything serde_json specific out of Variant (like this From impl)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'll move much of this code from variant.rs to from_json.rs if that's what you meant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


const MAX_SHORT_STRING_BYTES: usize = 0x3F;

Expand Down Expand Up @@ -941,6 +943,56 @@ impl From<VariantDecimal16> for Variant<'_, '_> {
}
}

impl TryFrom<&Number> for Variant<'_, '_> {
type Error = ArrowError;

fn try_from(n: &Number) -> Result<Self, Self::Error> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to take a look at the decimal parsing code in my prototype?
https://github.com/apache/arrow-rs/pull/7403/files#diff-fce219df196b3dffd6c2f6bb4924a34f299817a317943a19482289ecb8f6af69R1065-R1079

(tho that code is not optimal, it might at least provide some ideas)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the code pointer! Do you think we could implement custom decimal parsing as a follow up? I think 2^96 is a good bound for now even though the Variant spec allows for a larger width (about 2^128).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up sounds like a good plan to me

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if let Some(i) = n.as_i64() {
// Find minimum Integer width to fit
if i as i8 as i64 == i {
Ok((i as i8).into())
} else if i as i16 as i64 == i {
Ok((i as i16).into())
} else if i as i32 as i64 == i {
Ok((i as i32).into())
} else {
Ok(i.into())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach. That double .. as .. as .. definitely caused a double take, but I guess it could be cheaper than TryFrom?

Suggested change
if i as i8 as i64 == i {
Ok((i as i8).into())
} else if i as i16 as i64 == i {
Ok((i as i16).into())
} else if i as i32 as i64 == i {
Ok((i as i32).into())
} else {
Ok(i.into())
}
let value = i8::try_from(i)
.map(Variant::from)
.or_else(|_| i16::try_from(i).map(Variant::from))
.or_else(|_| i32::try_from(i).map(Variant::from))
.unwrap_or_else(|| Variant::from(i));
Ok(value)

} else {
// Try decimal
// TODO: Replace with custom decimal parsing as the rust_decimal library only supports
// a max unscaled value of 2^96.
match Decimal::from_str_exact(n.as_str()) {
Ok(dec) => {
let unscaled: i128 = dec.mantissa();
let scale = dec.scale() as u8;
if unscaled.abs() <= decoder::MAX_UNSCALED_DECIMAL_4 as i128
&& scale <= decoder::MAX_PRECISION_DECIMAL_4
{
(unscaled as i32, scale).try_into()
} else if unscaled.abs() <= decoder::MAX_UNSCALED_DECIMAL_8 as i128
&& scale <= decoder::MAX_PRECISION_DECIMAL_8
{
(unscaled as i64, scale).try_into()
} else {
(unscaled, scale).try_into()
}
}
Err(_) => {
// Try double
match n.as_f64() {
Some(f) => return Ok(f.into()),
None => Err(ArrowError::InvalidArgumentError(format!(
"Failed to parse {} as number",
n.as_str()
))),
}?
}
}
}
}
}

impl From<f32> for Variant<'_, '_> {
fn from(value: f32) -> Self {
Variant::Float(value)
Expand Down
Loading
Loading