Feature/schema evolution by guilleov · Pull Request #687 · ClickHouse/clickhouse-kafka-connect

guilleov · 2026-03-23T17:10:25Z

Implement `auto.evolve` schema evolution (closes #277)

Summary

Adds automatic table schema evolution to the ClickHouse Kafka Connect sink connector. When auto.evolve=true, the connector detects new fields in incoming Kafka records and issues ALTER TABLE ... ADD COLUMN IF NOT EXISTS statements against ClickHouse - no manual DDL required when upstream schemas change.

This mirrors the auto.evolve feature in the Confluent JDBC Sink Connector, adapted for ClickHouse.

Resolves: #277 - Support automatic table schema evolution

Why this feature

When a Kafka topic's schema changes (e.g. a new field is added to an Avro/Protobuf/JSON Schema definition), the connector currently either silently drops the new field (input_format_skip_unknown_fields=1) or fails. Users must manually issue ALTER TABLE statements to add columns before the connector can ingest the new fields. This is inconvenient and error-prone, especially in organizations with frequent domain model changes (original issue context).

Design decisions

Idempotent DDL

Issue #277 raised a concern about race conditions when multiple connector tasks run concurrently. ClickHouse has ADD COLUMN IF NOT EXISTS, which makes the DDL idempotent - if two tasks race to add the same column, both will succeed.

Async DDL propagation on replicated tables

For replicated tables, ALTER queries add instructions to ZooKeeper/Keeper and are applied asynchronously on replicas. Two mitigations:

alter_sync=1 on DDL statements - Waits for the local replica to apply the change before returning. This eliminates most staleness without the risk of alter_sync=2 (which blocks until ALL replicas apply and can hang if a replica is down).
Retry loop on DESCRIBE TABLE - After DDL, refreshTableAfterDDL() verifies the expected new columns are visible. If not (e.g., reading from a different replica), it retries up to auto.evolve.ddl.refresh.retries times (default 3) with 200ms backoff.

Type mapping

Primitive types (Int, Float, String, Bool, etc.) are wrapped in Nullable(...) so older records missing the field get NULL.
Array and Map types cannot be Nullable in ClickHouse, so they are created with an explicit DEFAULT expression (DEFAULT [] for Array, DEFAULT map() for Map). This allows RowBinaryWithDefaults to emit the "use server default" marker for older records that lack the field.
Variant types (from Avro/Protobuf union schemas with distinct branches) are created without Nullable or DEFAULT - Variant has an implicit NULL discriminator.
Struct types can optionally be mapped to ClickHouse JSON columns via auto.evolve.struct.to.json=true.
Logical types (Decimal, Date, Timestamp, Time) are mapped to their ClickHouse equivalents (Decimal128, Date32, DateTime64, Int64).

Configuration

Property	Type	Default	Description
`auto.evolve`	boolean	`false`	Enable automatic schema evolution
`auto.evolve.ddl.refresh.retries`	int	`3`	Retries when waiting for DDL propagation
`auto.evolve.struct.to.json`	boolean	`false`	Map STRUCT fields to ClickHouse JSON columns

Backward compatibility

auto.evolve defaults to false - existing deployments are completely unaffected
No record processing logic changes behavior for users who don't enable the flag

Tests

30 new auto-evolve tests in ClickHouseSinkTaskWithSchemaTest.java covering:

Basic column addition (nullable, non-nullable, multiple columns)
Mixed schema batches (V1+V2, V1+V2+V3, interleaved versions, cross-partition)
Array, Map, and typed Array column evolution
Variant/union column evolution
Struct-to-JSON mapping
Logical types (Date, Timestamp, Decimal)
DDL refresh timeout and failure handling
Schemaless record rejection
Cache behavior after DDL

Full test suite: 264 tests, 261 succeeded, 0 failed, 3 skipped

Checklist

Unit and integration tests covering the common scenarios were added
Full test suite passes (./gradlew test - 0 failures)
A human-readable description of the changes was provided to include in CHANGELOG

CLAassistant · 2026-03-23T17:10:32Z

All committers have signed the CLA.

guilleov · 2026-03-23T22:55:45Z

ClickHouse docs PR: ClickHouse/clickhouse-docs#5824

chernser · 2026-03-24T16:20:32Z

Good day, @guilleov !

Thank you for the contribution!
Please give us sometime to review.

chernser · 2026-03-24T16:29:14Z

+                }
+                return "Map(" + keyType + ", " + valType + ")";
+            case STRUCT:
+                throw new RuntimeException(


here we can add one more flag to allow creating JSON for struct
Also we need to handle unions because it is common case.

How we should handle unions?

guilleov@a9586e5 dunno what you think of this way of handling it @chernser let me know

chernser · 2026-03-24T16:32:45Z

+
+    private Table doInsertWithSchemaEvolution(List<Record> records, Table table, QueryIdentifier queryId) throws IOException, ExecutionException, InterruptedException {
+        // Split records into sub batches at schema boundaries (like JDBC BufferedRecords.add pattern)
+        // When schema changes mid batch the current sub batch is flushed, the table is evolved, and the insertion continues.


what problem does it solve?

I was thinking it might fail if schema changed mid batch insert would fail because some records wouldn't have the correct schema but I think thats not the case right?

I did simplified this section on commit: guilleov@022f88e

chernser · 2026-03-24T16:56:37Z

@guilleov
test data is updated but I do not see where it is used.
we need next tests:

positive when several columns are changed for all types
positive test for different types of arrays
negative test when we failed to update schema
negative test for unsupported data

chernser · 2026-03-25T16:30:51Z

@guilleov

Thank you for the update!

I will review today.
Team also might have some comments.

guilleov · 2026-03-25T16:37:10Z

Didn't finish yet all changes from the comments @chernser still working on some of them. thanks for the comments!

mzitnik · 2026-03-26T09:18:53Z

+
+        if (csc.isAutoEvolve()) {
+            // New columns are Nullable, so older records without the new fields insert with NULL.
+            Record last = records.get(records.size() - 1);


how can we garante that the last recrod will have all fields?

If kafka topics are ordered last messages should have the newest schema.

If ignorePartitionsWhenBatching=true that might not be the case.

If someones evolves the schema to V2 but then another producer keeps sending V1 messages since are backward compatible could potentially have a V1 -> V2 -> V1 schema insert and also fail to evolve

Might need to check all records and look for the new fields since we don't support deletion we could obtain a superset of all new fields. Imagine schema changes from V1 -> V2 -> V3 -> V4 in one batch we should keep new fields from V2/3/4 and generate the DDL based on that like if it was only one migration from V1 to a V4 with all fields

Check guilleov@859e162 @mzitnik

This implementation is overkill to iterte all records again.
I would have a dislaimer in docs about this situation or implemented some other way

Another way might be to let it fail when schema changes (schema should change often) and then retry and in that case we check all records to see what needs to be modified (Bad design to let a query fail just to retry I think but might work)

Other options are moving the detection of schema changes up to the ProxySinkTask.java where we iterate over all records already

What option do you think will be better for this new feature? @chernser @mzitnik

Yes, this is what I was thinking @guilleov, let's roll back to check only the last for now.
Add this as a limitation, and we will come back to it later

Ok, I reverted it and noted it both here and in the clickhouse-docs PR. Thanks @mzitnik

guilleov@4ad5f17

mzitnik · 2026-03-26T10:26:06Z

            assertEquals(event.getTime2().atDate(LocalDate.of(1970, 1, 1)).format(localFormatter), row.get("time2"));
        }
    }
+


Can we expand our tests to insert 3 batches:

Batch 1 → Schema V1
Batch 2 → Schema V2
Batch 3 → Schema V3

Also, add one batch with 10 records:

Records 1–5 with Schema V1 (3 fields)
Records 6–7 with Schema V2 (8 fields)
Records 8–10 with Schema V3 (5 fields)

mzitnik · 2026-03-26T10:30:13Z

@chernser i have triggered the tests while @guilleov reviews our feedback

Copilot

Pull request overview

Implements auto.evolve schema evolution for the ClickHouse Kafka Connect sink connector, allowing the connector to automatically add new ClickHouse columns when upstream Kafka record schemas introduce new fields.

Changes:

Added auto.evolve / DDL-refresh / struct-to-JSON configuration and wiring to trigger ALTER TABLE ... ADD COLUMN IF NOT EXISTS.
Added Connect-schema → ClickHouse-type inference (including Confluent Avro/Protobuf union detection and Variant mapping) plus a new SchemaTypeInferenceException.
Fixed RowBinary serialization for Map(K, Nullable(V)) values and added extensive unit/integration-style tests for schema evolution scenarios.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/main/java/com/clickhouse/kafka/connect/sink/db/ClickHouseWriter.java	Adds schema evolution flow (detect missing columns, run ALTER, refresh mapping) and fixes Map(…, Nullable(…)) RowBinary marker writing.
src/main/java/com/clickhouse/kafka/connect/sink/db/helper/ClickHouseHelperClient.java	Adds helper to issue multi-column ALTER TABLE ADD COLUMN IF NOT EXISTS with `alter_sync=1`.
src/main/java/com/clickhouse/kafka/connect/sink/db/mapping/Column.java	Introduces Connect schema type inference, union detection, and Variant type resolution for auto-evolve DDL.
src/main/java/com/clickhouse/kafka/connect/sink/db/mapping/Table.java	Adds `getMissingColumns(...)` to compute schema diffs against incoming fields.
src/main/java/com/clickhouse/kafka/connect/sink/db/mapping/SchemaTypeInferenceException.java	New exception type for unsupported schema-to-ClickHouse mappings during auto-evolve.
src/main/java/com/clickhouse/kafka/connect/sink/ClickHouseSinkConfig.java	Adds and documents new connector configs: `auto.evolve`, `auto.evolve.ddl.refresh.retries`, `auto.evolve.struct.to.json`.
src/test/java/com/clickhouse/kafka/connect/sink/ClickHouseSinkTaskWithSchemaTest.java	Adds end-to-end tests covering auto-evolve behavior across many schema evolution scenarios.
src/test/java/com/clickhouse/kafka/connect/sink/db/mapping/ColumnTest.java	Adds focused unit tests for union detection and union→Variant/String mapping.
src/testFixtures/java/com/clickhouse/kafka/connect/sink/helper/SchemaTestData.java	Adds schema-versioned test data builders used by the new auto-evolve tests.
CHANGELOG.md	Documents the new feature and the Map nullable serialization bugfix.
VERSION	Bumps version to v1.3.7.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T17:34:12Z

+import java.util.Collections;
 import java.util.List;



Unused import java.util.Collections will fail compilation (unused imports are errors in Java). Remove it or reference it in the new tests.

Suggested change

import java.util.Collections;

import java.util.List;

Copilot · 2026-03-26T17:34:13Z

+            }
+
+            String chType = Column.connectTypeToClickHouseType(fieldSchema, csc.isAutoEvolveStructToJson());
+            columnDefs.add(String.format("`%s` %s", fieldName, chType));


fieldName is interpolated directly into backticked DDL. Elsewhere (e.g., Utils.escapeName) backticks are stripped to prevent malformed SQL. Please sanitize/escape column names here as well (at minimum remove backticks) before building columnDefs.

Suggested change

columnDefs.add(String.format("`%s` %s", fieldName, chType));

String escapedFieldName = Utils.escapeName(fieldName);

columnDefs.add(String.format("`%s` %s", escapedFieldName, chType));

Copilot · 2026-03-26T17:34:13Z

+        if (csc.isAutoEvolve()) {
+            // Since auto-evolve only adds Nullable columns (never deletes), the superset is ok.
+            Map<String, Schema> allFields = new LinkedHashMap<>();
+            for (Record r : records) {
+                if (r.getFields() != null) {
+                    for (Field f : r.getFields()) {
+                        allFields.putIfAbsent(f.name(), f.schema());
+                    }
+                }
+            }
+            table = evolveTableSchema(table, allFields);
+        }


The comment says auto-evolve "only adds Nullable columns", but Column.connectTypeToClickHouseType explicitly returns non-nullable types for Array/Map/Variant (ClickHouse restriction). After evolving such a column, older records that don't contain the new field will hit the fieldExists=false path in doWriteCol(...) and currently throw (and for Map may NPE if the field exists-but-null). Consider adding DEFAULT expressions for newly added Array/Map/Variant columns (so RowBinaryWithDefaults can emit defaults), or explicitly treat missing fields for these types as empty values / Variant NULL discriminator during serialization.

…seSinkConfig

…ing for ClickHouse integration

… issues

…ap for deduplication

guilleov · 2026-04-28T11:31:00Z

any updates? something else left to do @chernser @mzitnik @kurnoolsaketh

mzitnik · 2026-04-29T04:36:12Z

@guilleov i have approved can just fix the conflicts

guilleov · 2026-04-29T07:27:03Z

@mzitnik conflicts resolved!

…ouse-kafka-connect into feature/schema-evolution

mzitnik · 2026-04-29T20:04:15Z

@guilleov is this passing for you localy? seems like tests are still failling

mzitnik · 2026-04-29T20:36:32Z

@guilleov are sync with main there still errors in tests

guilleov · 2026-04-29T21:25:39Z

let me check @mzitnik

After merging main, the new auto-evolve tests called createProps() and createClient(props) which don't exist on the test class. Replace with the existing helpers getBaseProps() and ClickHouseTestHelpers.createClient(props) to fix the compileTestJava failures.

The auto-evolve detection was simplified in c44c2f4 to inspect only the last record's schema in a batch. Update the four mixed-schema tests that still asserted the previous full-batch-scan behavior: - autoEvolveMixedSchemasTenRecordsInOneBatch: keep V3 last and assert only V3 columns are added; explicitly assert V2-only columns are NOT. - autoEvolveMultiVersionUnionSemantics: assert only V4's unique field is added; V2/V3 unique fields are NOT. - autoEvolveMixedBatchLastRecordOlderSchema: invert to document that no ALTER is issued when the last record is the older schema. - autoEvolveCrossPartitionSchemaDrift: invert similarly for the cross-partition case.

guilleov · 2026-04-29T21:42:41Z

@mzitnik sorry some test were outdated per my last change of checking only last record (and also some missing method that is not available on main)

Could you re-run the CI tests?

guilleov · 2026-04-29T21:47:08Z

I did run locally now and worked

Tests summary: 265 tests, 262 succeeded, 0 failed, 3 skipped

guilleov mentioned this pull request Mar 23, 2026

docs: add docs for kafka sink auto evolve option ClickHouse/clickhouse-docs#5824

Open

BentsiLeviav requested review from chernser and mzitnik March 24, 2026 07:01

chernser requested a review from kurnoolsaketh March 24, 2026 16:20