Skip to content

Pluggable ingestion serializers with json_ingestion#1079

Open
jwils wants to merge 1 commit intomainfrom
joshauw/pluggable-ingestion-serializers
Open

Pluggable ingestion serializers with json_ingestion#1079
jwils wants to merge 1 commit intomainfrom
joshauw/pluggable-ingestion-serializers

Conversation

@jwils
Copy link
Copy Markdown
Collaborator

@jwils jwils commented Mar 21, 2026

Summary

  • Introduces the concept of pluggable ingestion serializers by extracting JSON Schema logic from core into a new elasticgraph-json_schema extension gem
  • Follows the established extension pattern (APIExtension → FactoryExtension → ResultsExtension + SchemaArtifactManagerExtension) used by elasticgraph-warehouse and elasticgraph-apollo
  • Auto-applied by default in API.initialize for full backward compatibility — no configuration changes needed for existing users
  • Paves the way for a future elasticgraph-protobuf gem implementing the same interface

Motivation

Per discussion #1059, rather than adding protobuf support as a pure extension while keeping JSON Schema hardcoded in core, we should refactor toward pluggable ingestion serializers. This PR is the first step: extracting JSON Schema into its own extension gem with a clear interface that a protobuf serializer could also implement.

What moved where

From To
API#json_schema_version, API#json_schema_strictness JsonSchema::SchemaDefinition::APIExtension
Results#json_schemas_for, #available_json_schema_versions, #latest_json_schema_version, #current_public_json_schema, #merge_field_metadata_into_json_schema, etc. JsonSchema::SchemaDefinition::ResultsExtension
SchemaArtifactManager JSON schema artifact generation, version bump checking, versioned schema building, merge error reporting JsonSchema::SchemaDefinition::SchemaArtifactManagerExtension

What stays in core

  • State fields (json_schema_version, etc.) — just struct fields the extension writes to
  • Constants (JSON_SCHEMA_VERSION_KEY, etc.) — runtime deps of elasticgraph-indexer
  • FromDisk, RecordPreparer, Operation::Factory — runtime indexer code, unchanged
  • FieldType::*#to_json_schema, Indexing::Field#json_schema — called by the extension
  • EventEnvelope, JSONSchemaWithMetadata, JSONSchemaPruner — used by the extension

Test plan

  • All 1584 elasticgraph-schema_definition unit tests pass
  • All 17 elasticgraph-schema_definition integration tests pass (including the "bundle with only schema_definition gem" test)
  • All 184 elasticgraph-indexer unit tests pass
  • All 294 elasticgraph-support tests pass
  • All 33 elasticgraph-warehouse tests pass
  • bundle exec rake schema_artifacts:check confirms all artifacts are identical/up-to-date

🤖 Generated with Claude Code

@jwils jwils force-pushed the joshauw/pluggable-ingestion-serializers branch 3 times, most recently from d29bd5d to a3cfa5c Compare March 22, 2026 16:47
@jwils jwils force-pushed the joshauw/pluggable-ingestion-serializers branch 2 times, most recently from 9e6f39a to c408a50 Compare March 23, 2026 04:10
Copy link
Copy Markdown
Collaborator

@myronmarston myronmarston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start but I think there's a lot more to do here. If the JSON schema ingestion support is no longer part of elasticgraph-schema_definition (now that it lives in elasticgraph-json_schema) then elasticgraph-json_schema needs to be solely responsible for all aspects of the JSON schema artifact.

Here's a non-exhaustive list of things that are still in elasticgraph-schema_definition that should be moved:

I'm sure there's a lot more than that but that'll get you started. the way I'm thinking about this: when someone uses ElasticGraph with only protobuf ingestion in the future, and doesn't even install the elasticgraph-json_schema gem, their collection of ElasticGraph gems should have no JSON-schema related logic, in a similar fashion to how we're not going to add protobuf directly to elasticgraph-schema_definition--it'll be provided by an extesnion.

A good way to approach this is to grep for json_schema in elasticgraph-schema_definition. Any mention of json_schema in elasticgraph-schema_definition should probably be moved. (Maybe some mentions of json without -schema, too, although there are some legit mentions of JSON that will remain like JSONSafeLong).

@jwils jwils force-pushed the joshauw/pluggable-ingestion-serializers branch from c408a50 to 5717334 Compare March 26, 2026 14:32
@jwils jwils force-pushed the joshauw/pluggable-ingestion-serializers branch 11 times, most recently from 23c3ead to 0607029 Compare April 4, 2026 21:34
@jwils jwils changed the title Prototype pluggable ingestion serializers Pluggable ingestion serializers with json_ingestion Apr 4, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jwils jwils force-pushed the joshauw/pluggable-ingestion-serializers branch from 0607029 to 40815e1 Compare April 4, 2026 21:53
Copy link
Copy Markdown
Collaborator

@myronmarston myronmarston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed review--this is massive! (But also exciting to see...). On top of my inline feedback below, I've got some meta feedback:

  • The fact that this reworks both the tests and the implementation makes it hard to have confidence that it preserves JSON schema behavior (and also makes it a very large PR!). I'd like to see this split:
    • First, extract the implementation into elasticgraph-json_ingestion while leaving elasticgraph-schema_definition/spec changed as little as possible. Leaving it unchanged gives us confidence that the extraction produces equivalent behavior. You'll probably have to update elasticgraph-schema_definition/spec/spec_helper.rb to have it (for now) automatically use elasticgraph-json_ingestion in all its tests so that the existing tests can continue to pass.
    • Secondly, move the tests from elasticgraph-schema_definition into elasticgraph-json_ingestion.
    • Even with that split into 2 steps, the PRs will still be massive. If you see some further opportunity to split into smaller PRs, please take the opportunity to do that--I'm happy to review a whole stack of PRs.
  • There's still a ton of json_schema references in elasticgraph_schema_definition. Can you audit these? It would be good to confirm that every reference to json_schema needs to remain. I'll plan to do an audit in a future review as well (once you've let me know that you've finished your review).
    • ...but obviously the audit can be done at the end on top of the final PR in your (eventual) stack.

module SchemaDefinition
module FieldType
# Wraps an enum indexing field type to add JSON schema serialization.
class Enum < ::SimpleDelegator
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this is using a delegator pattern when all our other extensions use extend a module onto the object?

Unless there's a concrete reason why delegation is better here, it would be nice to stick with module extension for consistency.

OTOH, if you want to stick with delegation, I think it's best to use DelegateClass(::ElasticGraph::SchemaDefinition::Indexing::FieldType::Enum). SimpleDelegator is designed to be dynamic, supporting delegating to any wrapped object, on the fly. DelegateClass is designed for use when you're delgating to an instance of a specific class.

module FieldType
# Wraps an enum indexing field type to add JSON schema serialization.
class Enum < ::SimpleDelegator
# @return [Hash] empty hash, as enum types have no subfields
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# @return [Hash] empty hash, as enum types have no subfields
# @return [Hash<String, ::Object>] additional ElasticGraph metadata to put in the JSON schema for this enum type.¬

(Might as well match the original doc on this).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is the equivalent of https://github.com/block/elasticgraph/blob/main/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/indexing/field_type/enum.rb but here it lacks the indexing part of the file name and class name.

It's useful to keep that so that it's clear that this isn't being applied to ElasticGraph::SchemaDefinition::SchemaElements::EnumType.

# @param customizations [Hash<String, Object>] the customizations to format
# @return [Hash<String, Object>] the filtered customizations
def format_field_json_schema_customizations(customizations)
customizations.slice("enum")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lengthy, useful comment you've lost here:

# Since an enum type already restricts the values to a small set of allowed values, we do not need to keep
# other customizations (such as the `maxLength` field customization EG automatically applies to fields
# indexed as a `keyword`--we don't allow enum values to exceed that length, anyway).
#
# It's desirable to restrict what customizations are applied because when a publisher uses the JSON schema
# to generate code using a library such as https://github.com/pwall567/json-kotlin-schema-codegen, we found
# that the presence of extra field customizations inhibits the library's ability to generate code in the way
# we want (it causes the type of the enum to change since the JSON schema changes from a direct `$ref` to
# being wrapped in an `allOf`).
#
# However, we still want to apply `enum` customizations--this allows a user to "narrow" the set of allowed
# values for a field. For example, a `Currency` enum could contain every currency, and a user may want to
# restrict a specific `currency` field to a subset of currencies (e.g. to just USD, CAD, and EUR).
customizations.slice("enum")

# @return [void]
# @api private
def self.extended(api)
api.instance_variable_get(:@state).ingestion_serializer_state.tap do |state|
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't have to do instance_variable_get(:@state). The state is available via a public attr_reader:

# @return [State] object which holds all state for the schema definition
attr_reader :state

expect(described_class.name).to eq "ElasticGraph::JSONIngestion"
end
end
end
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem useful.


RSpec.configure do |config|
config.when_first_matching_example_defined(:json_ingestion_schema) do
require "support/json_ingestion_schema_support"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just move the logic from support/json_ingestion_schema_support into here, with no need to deal with when_first_matching_example_defined.

when_first_matching_example_defined is useful in https://github.com/block/elasticgraph/blob/main/spec_support/spec_helper.rb because that spec file is always loaded, no matter what spec gets run, and to avoid bloating the RSpec boot time by loading support things that wind up not being used during that test run, when_first_matching_example_defined allows us to defer that loading until we know there's at least one spec that needs it.

This file is different--it's only loaded if we're running specs in elasticgraph-json_ingestion. So it already applies more narrowly. Check out the other */spec/spec_helper.rb for examples of how we do it for other gems.

spec.email = ["joshuaw@squareup.com"]
spec.homepage = "https://block.github.io/elasticgraph/"
spec.license = "MIT"
spec.summary = "Pluggable JSON Schema ingestion serializer for ElasticGraph."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
spec.summary = "Pluggable JSON Schema ingestion serializer for ElasticGraph."
spec.summary = "JSON Schema ingestion support for ElasticGraph."
  • EG's ingestion support is becoming pluggable; this is an example of a library that can "plug in" but doesn't itself proivde the pluggability.
  • We haven't talked about this in depth, but I'm hoping that this library grows to encompass the full support needed for JSON ingestion, not just the serialization side of things.

# end
#
# @dynamic schema_definition_ingestion_serializer_extension_modules, schema_definition_ingestion_serializer_extension_modules=
attr_accessor :schema_definition_ingestion_serializer_extension_modules
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this as separate from schema_definition_extension_modules? It's be nice to stick with a single uniform way to specify schema def extension modules.

self.type_name_overrides = {}
self.enum_value_overrides_by_type = {}
self.schema_definition_extension_modules = []
self.schema_definition_ingestion_serializer_extension_modules = SchemaDefinition::ExtensionModuleSupport.default_ingestion_serializer_extension_modules
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we need a default. (I'd prefer we not have one). IMO, our goal should be that EG doesn't have a "privileged" ingestion serializers--all the options should be treated equally.

If we go that direction, EG users will have to configure elasticgraph-json_ingestion during upgrade, but I'm fine with that--as per our versioning policy, it would be a minor version bump.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants