Pluggable ingestion serializers with json_ingestion#1079
Conversation
d29bd5d to
a3cfa5c
Compare
9e6f39a to
c408a50
Compare
myronmarston
left a comment
There was a problem hiding this comment.
This is a great start but I think there's a lot more to do here. If the JSON schema ingestion support is no longer part of elasticgraph-schema_definition (now that it lives in elasticgraph-json_schema) then elasticgraph-json_schema needs to be solely responsible for all aspects of the JSON schema artifact.
Here's a non-exhaustive list of things that are still in elasticgraph-schema_definition that should be moved:
- https://github.com/block/elasticgraph/blob/main/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/json_schema_pruner.rb
- https://github.com/block/elasticgraph/blob/main/elasticgraph-schema_definition/spec/unit/elastic_graph/schema_definition/json_schema_spec.rb
- The
to_json_schemapolymorphic API onindexing/field_typeclasses.
I'm sure there's a lot more than that but that'll get you started. the way I'm thinking about this: when someone uses ElasticGraph with only protobuf ingestion in the future, and doesn't even install the elasticgraph-json_schema gem, their collection of ElasticGraph gems should have no JSON-schema related logic, in a similar fashion to how we're not going to add protobuf directly to elasticgraph-schema_definition--it'll be provided by an extesnion.
A good way to approach this is to grep for json_schema in elasticgraph-schema_definition. Any mention of json_schema in elasticgraph-schema_definition should probably be moved. (Maybe some mentions of json without -schema, too, although there are some legit mentions of JSON that will remain like JSONSafeLong).
...ingestion/spec/unit/elastic_graph/json_ingestion/schema_definition/results_extension_spec.rb
Outdated
Show resolved
Hide resolved
c408a50 to
5717334
Compare
23c3ead to
0607029
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0607029 to
40815e1
Compare
myronmarston
left a comment
There was a problem hiding this comment.
Sorry for the delayed review--this is massive! (But also exciting to see...). On top of my inline feedback below, I've got some meta feedback:
- The fact that this reworks both the tests and the implementation makes it hard to have confidence that it preserves JSON schema behavior (and also makes it a very large PR!). I'd like to see this split:
- First, extract the implementation into
elasticgraph-json_ingestionwhile leavingelasticgraph-schema_definition/specchanged as little as possible. Leaving it unchanged gives us confidence that the extraction produces equivalent behavior. You'll probably have to updateelasticgraph-schema_definition/spec/spec_helper.rbto have it (for now) automatically useelasticgraph-json_ingestionin all its tests so that the existing tests can continue to pass. - Secondly, move the tests from
elasticgraph-schema_definitionintoelasticgraph-json_ingestion. - Even with that split into 2 steps, the PRs will still be massive. If you see some further opportunity to split into smaller PRs, please take the opportunity to do that--I'm happy to review a whole stack of PRs.
- First, extract the implementation into
- There's still a ton of
json_schemareferences inelasticgraph_schema_definition. Can you audit these? It would be good to confirm that every reference tojson_schemaneeds to remain. I'll plan to do an audit in a future review as well (once you've let me know that you've finished your review).- ...but obviously the audit can be done at the end on top of the final PR in your (eventual) stack.
| module SchemaDefinition | ||
| module FieldType | ||
| # Wraps an enum indexing field type to add JSON schema serialization. | ||
| class Enum < ::SimpleDelegator |
There was a problem hiding this comment.
Curious why this is using a delegator pattern when all our other extensions use extend a module onto the object?
Unless there's a concrete reason why delegation is better here, it would be nice to stick with module extension for consistency.
OTOH, if you want to stick with delegation, I think it's best to use DelegateClass(::ElasticGraph::SchemaDefinition::Indexing::FieldType::Enum). SimpleDelegator is designed to be dynamic, supporting delegating to any wrapped object, on the fly. DelegateClass is designed for use when you're delgating to an instance of a specific class.
| module FieldType | ||
| # Wraps an enum indexing field type to add JSON schema serialization. | ||
| class Enum < ::SimpleDelegator | ||
| # @return [Hash] empty hash, as enum types have no subfields |
There was a problem hiding this comment.
| # @return [Hash] empty hash, as enum types have no subfields | |
| # @return [Hash<String, ::Object>] additional ElasticGraph metadata to put in the JSON schema for this enum type.¬ |
(Might as well match the original doc on this).
There was a problem hiding this comment.
It looks like this is the equivalent of https://github.com/block/elasticgraph/blob/main/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/indexing/field_type/enum.rb but here it lacks the indexing part of the file name and class name.
It's useful to keep that so that it's clear that this isn't being applied to ElasticGraph::SchemaDefinition::SchemaElements::EnumType.
| # @param customizations [Hash<String, Object>] the customizations to format | ||
| # @return [Hash<String, Object>] the filtered customizations | ||
| def format_field_json_schema_customizations(customizations) | ||
| customizations.slice("enum") |
There was a problem hiding this comment.
There's a lengthy, useful comment you've lost here:
| # @return [void] | ||
| # @api private | ||
| def self.extended(api) | ||
| api.instance_variable_get(:@state).ingestion_serializer_state.tap do |state| |
There was a problem hiding this comment.
You shouldn't have to do instance_variable_get(:@state). The state is available via a public attr_reader:
| expect(described_class.name).to eq "ElasticGraph::JSONIngestion" | ||
| end | ||
| end | ||
| end |
There was a problem hiding this comment.
This doesn't seem useful.
|
|
||
| RSpec.configure do |config| | ||
| config.when_first_matching_example_defined(:json_ingestion_schema) do | ||
| require "support/json_ingestion_schema_support" |
There was a problem hiding this comment.
You can just move the logic from support/json_ingestion_schema_support into here, with no need to deal with when_first_matching_example_defined.
when_first_matching_example_defined is useful in https://github.com/block/elasticgraph/blob/main/spec_support/spec_helper.rb because that spec file is always loaded, no matter what spec gets run, and to avoid bloating the RSpec boot time by loading support things that wind up not being used during that test run, when_first_matching_example_defined allows us to defer that loading until we know there's at least one spec that needs it.
This file is different--it's only loaded if we're running specs in elasticgraph-json_ingestion. So it already applies more narrowly. Check out the other */spec/spec_helper.rb for examples of how we do it for other gems.
| spec.email = ["joshuaw@squareup.com"] | ||
| spec.homepage = "https://block.github.io/elasticgraph/" | ||
| spec.license = "MIT" | ||
| spec.summary = "Pluggable JSON Schema ingestion serializer for ElasticGraph." |
There was a problem hiding this comment.
| spec.summary = "Pluggable JSON Schema ingestion serializer for ElasticGraph." | |
| spec.summary = "JSON Schema ingestion support for ElasticGraph." |
- EG's ingestion support is becoming pluggable; this is an example of a library that can "plug in" but doesn't itself proivde the pluggability.
- We haven't talked about this in depth, but I'm hoping that this library grows to encompass the full support needed for JSON ingestion, not just the serialization side of things.
| # end | ||
| # | ||
| # @dynamic schema_definition_ingestion_serializer_extension_modules, schema_definition_ingestion_serializer_extension_modules= | ||
| attr_accessor :schema_definition_ingestion_serializer_extension_modules |
There was a problem hiding this comment.
Why do we need this as separate from schema_definition_extension_modules? It's be nice to stick with a single uniform way to specify schema def extension modules.
| self.type_name_overrides = {} | ||
| self.enum_value_overrides_by_type = {} | ||
| self.schema_definition_extension_modules = [] | ||
| self.schema_definition_ingestion_serializer_extension_modules = SchemaDefinition::ExtensionModuleSupport.default_ingestion_serializer_extension_modules |
There was a problem hiding this comment.
I'm not sure that we need a default. (I'd prefer we not have one). IMO, our goal should be that EG doesn't have a "privileged" ingestion serializers--all the options should be treated equally.
If we go that direction, EG users will have to configure elasticgraph-json_ingestion during upgrade, but I'm fine with that--as per our versioning policy, it would be a minor version bump.
Summary
elasticgraph-json_schemaextension gemelasticgraph-warehouseandelasticgraph-apolloAPI.initializefor full backward compatibility — no configuration changes needed for existing userselasticgraph-protobufgem implementing the same interfaceMotivation
Per discussion #1059, rather than adding protobuf support as a pure extension while keeping JSON Schema hardcoded in core, we should refactor toward pluggable ingestion serializers. This PR is the first step: extracting JSON Schema into its own extension gem with a clear interface that a protobuf serializer could also implement.
What moved where
API#json_schema_version,API#json_schema_strictnessJsonSchema::SchemaDefinition::APIExtensionResults#json_schemas_for,#available_json_schema_versions,#latest_json_schema_version,#current_public_json_schema,#merge_field_metadata_into_json_schema, etc.JsonSchema::SchemaDefinition::ResultsExtensionSchemaArtifactManagerJSON schema artifact generation, version bump checking, versioned schema building, merge error reportingJsonSchema::SchemaDefinition::SchemaArtifactManagerExtensionWhat stays in core
Statefields (json_schema_version, etc.) — just struct fields the extension writes toJSON_SCHEMA_VERSION_KEY, etc.) — runtime deps ofelasticgraph-indexerFromDisk,RecordPreparer,Operation::Factory— runtime indexer code, unchangedFieldType::*#to_json_schema,Indexing::Field#json_schema— called by the extensionEventEnvelope,JSONSchemaWithMetadata,JSONSchemaPruner— used by the extensionTest plan
elasticgraph-schema_definitionunit tests passelasticgraph-schema_definitionintegration tests pass (including the "bundle with only schema_definition gem" test)elasticgraph-indexerunit tests passelasticgraph-supporttests passelasticgraph-warehousetests passbundle exec rake schema_artifacts:checkconfirms all artifacts are identical/up-to-date🤖 Generated with Claude Code