diff --git a/docs/sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8.md b/docs/sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8.md new file mode 100644 index 0000000000000..b4c099ddfd032 --- /dev/null +++ b/docs/sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8.md @@ -0,0 +1,146 @@ +--- +{ + "title": "IS_VALID_UTF8", + "language": "en", + "description": "The IS_VALID_UTF8 function checks whether a string is valid UTF-8 encoded data. Returns true if the string is valid UTF-8, false otherwise." +} +--- + +## Description + +The IS_VALID_UTF8 function checks whether a string is valid UTF-8 encoded data. It validates every byte sequence in the input and returns `true` if all sequences conform to the UTF-8 encoding standard, or `false` if any invalid byte sequence is found. + +This is useful when dealing with data imported from external sources (files, network streams, etc.) that may contain binary or incorrectly encoded content, and you need to verify data integrity before performing string operations. + +## Alias + +- `ISVALIDUTF8()` + +## Syntax + +```sql +IS_VALID_UTF8() +``` + +## Parameters + +| Parameter | Description | +|-----------|-------------| +| `` | The string to validate. Type: VARCHAR or STRING | + +## Return Value + +Returns BOOLEAN type. + +- Returns `true` if the string is valid UTF-8 encoded data. +- Returns `false` if the string contains any invalid UTF-8 byte sequence. + +Special cases: +- If the parameter is NULL, returns NULL. +- An empty string is considered valid UTF-8, returns `true`. + +## Examples + +1. Valid ASCII strings + +```sql +SELECT IS_VALID_UTF8('hello'); +``` + +```text ++------------------------+ +| is_valid_utf8('hello') | ++------------------------+ +| 1 | ++------------------------+ +``` + +2. Valid multi-byte UTF-8 characters (Chinese) + +```sql +SELECT IS_VALID_UTF8('Hello, 世界'); +``` + +```text ++-----------------------------+ +| is_valid_utf8('Hello, 世界') | ++-----------------------------+ +| 1 | ++-----------------------------+ +``` + +3. Empty string + +```sql +SELECT IS_VALID_UTF8(''); +``` + +```text ++--------------------+ +| is_valid_utf8('') | ++--------------------+ +| 1 | ++--------------------+ +``` + +4. Invalid UTF-8 bytes (constructed via UNHEX) + +```sql +SELECT IS_VALID_UTF8(UNHEX('C0AF')); +``` + +```text ++------------------------------+ +| is_valid_utf8(unhex('C0AF')) | ++------------------------------+ +| 0 | ++------------------------------+ +``` + +5. NULL value handling + +```sql +SELECT IS_VALID_UTF8(NULL); +``` + +```text ++---------------------+ +| is_valid_utf8(NULL) | ++---------------------+ +| NULL | ++---------------------+ +``` + +6. Using with table data + +```sql +CREATE TABLE test_utf8 ( + id INT, + val VARCHAR(200) +) DISTRIBUTED BY HASH(id) BUCKETS 1 +PROPERTIES ("replication_num" = "1"); + +INSERT INTO test_utf8 VALUES +(1, 'hello'), +(2, ''), +(3, 'Hello, 世界'), +(4, NULL); + +INSERT INTO test_utf8 VALUES (5, UNHEX('C0AF')); +INSERT INTO test_utf8 VALUES (6, UNHEX('FF')); + +SELECT id, IS_VALID_UTF8(val) FROM test_utf8 ORDER BY id; +``` + +```text ++------+--------------------+ +| id | is_valid_utf8(val) | ++------+--------------------+ +| 1 | 1 | +| 2 | 1 | +| 3 | 1 | +| 4 | NULL | +| 5 | 0 | +| 6 | 0 | ++------+--------------------+ +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8.md new file mode 100644 index 0000000000000..e6b4c3f513f90 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8.md @@ -0,0 +1,146 @@ +--- +{ + "title": "IS_VALID_UTF8", + "language": "zh-CN", + "description": "IS_VALID_UTF8 函数用于检查字符串是否为合法的 UTF-8 编码数据。如果字符串是合法 UTF-8 则返回 true,否则返回 false。" +} +--- + +## 描述 + +IS_VALID_UTF8 函数用于检查字符串是否为合法的 UTF-8 编码数据。它会验证输入中的每个字节序列,如果所有序列都符合 UTF-8 编码标准则返回 `true`,如果发现任何非法字节序列则返回 `false`。 + +该函数在处理从外部数据源(文件、网络流等)导入的数据时非常有用,这些数据可能包含二进制或编码错误的内容,您可以在执行字符串操作之前验证数据的完整性。 + +## 别名 + +- `ISVALIDUTF8()` + +## 语法 + +```sql +IS_VALID_UTF8() +``` + +## 参数 + +| 参数 | 说明 | +|------|------| +| `` | 需要验证的字符串。类型:VARCHAR 或 STRING | + +## 返回值 + +返回 BOOLEAN 类型。 + +- 如果字符串是合法的 UTF-8 编码数据,返回 `true`。 +- 如果字符串包含任何非法的 UTF-8 字节序列,返回 `false`。 + +特殊情况: +- 如果参数为 NULL,返回 NULL。 +- 空字符串被视为合法的 UTF-8,返回 `true`。 + +## 示例 + +1. 合法的 ASCII 字符串 + +```sql +SELECT IS_VALID_UTF8('hello'); +``` + +```text ++------------------------+ +| is_valid_utf8('hello') | ++------------------------+ +| 1 | ++------------------------+ +``` + +2. 合法的多字节 UTF-8 字符(中文) + +```sql +SELECT IS_VALID_UTF8('Hello, 世界'); +``` + +```text ++-----------------------------+ +| is_valid_utf8('Hello, 世界') | ++-----------------------------+ +| 1 | ++-----------------------------+ +``` + +3. 空字符串 + +```sql +SELECT IS_VALID_UTF8(''); +``` + +```text ++--------------------+ +| is_valid_utf8('') | ++--------------------+ +| 1 | ++--------------------+ +``` + +4. 非法的 UTF-8 字节(通过 UNHEX 构造) + +```sql +SELECT IS_VALID_UTF8(UNHEX('C0AF')); +``` + +```text ++------------------------------+ +| is_valid_utf8(unhex('C0AF')) | ++------------------------------+ +| 0 | ++------------------------------+ +``` + +5. NULL 值处理 + +```sql +SELECT IS_VALID_UTF8(NULL); +``` + +```text ++---------------------+ +| is_valid_utf8(NULL) | ++---------------------+ +| NULL | ++---------------------+ +``` + +6. 配合表数据使用 + +```sql +CREATE TABLE test_utf8 ( + id INT, + val VARCHAR(200) +) DISTRIBUTED BY HASH(id) BUCKETS 1 +PROPERTIES ("replication_num" = "1"); + +INSERT INTO test_utf8 VALUES +(1, 'hello'), +(2, ''), +(3, 'Hello, 世界'), +(4, NULL); + +INSERT INTO test_utf8 VALUES (5, UNHEX('C0AF')); +INSERT INTO test_utf8 VALUES (6, UNHEX('FF')); + +SELECT id, IS_VALID_UTF8(val) FROM test_utf8 ORDER BY id; +``` + +```text ++------+--------------------+ +| id | is_valid_utf8(val) | ++------+--------------------+ +| 1 | 1 | +| 2 | 1 | +| 3 | 1 | +| 4 | NULL | +| 5 | 0 | +| 6 | 0 | ++------+--------------------+ +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8.md new file mode 100644 index 0000000000000..e6b4c3f513f90 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8.md @@ -0,0 +1,146 @@ +--- +{ + "title": "IS_VALID_UTF8", + "language": "zh-CN", + "description": "IS_VALID_UTF8 函数用于检查字符串是否为合法的 UTF-8 编码数据。如果字符串是合法 UTF-8 则返回 true,否则返回 false。" +} +--- + +## 描述 + +IS_VALID_UTF8 函数用于检查字符串是否为合法的 UTF-8 编码数据。它会验证输入中的每个字节序列,如果所有序列都符合 UTF-8 编码标准则返回 `true`,如果发现任何非法字节序列则返回 `false`。 + +该函数在处理从外部数据源(文件、网络流等)导入的数据时非常有用,这些数据可能包含二进制或编码错误的内容,您可以在执行字符串操作之前验证数据的完整性。 + +## 别名 + +- `ISVALIDUTF8()` + +## 语法 + +```sql +IS_VALID_UTF8() +``` + +## 参数 + +| 参数 | 说明 | +|------|------| +| `` | 需要验证的字符串。类型:VARCHAR 或 STRING | + +## 返回值 + +返回 BOOLEAN 类型。 + +- 如果字符串是合法的 UTF-8 编码数据,返回 `true`。 +- 如果字符串包含任何非法的 UTF-8 字节序列,返回 `false`。 + +特殊情况: +- 如果参数为 NULL,返回 NULL。 +- 空字符串被视为合法的 UTF-8,返回 `true`。 + +## 示例 + +1. 合法的 ASCII 字符串 + +```sql +SELECT IS_VALID_UTF8('hello'); +``` + +```text ++------------------------+ +| is_valid_utf8('hello') | ++------------------------+ +| 1 | ++------------------------+ +``` + +2. 合法的多字节 UTF-8 字符(中文) + +```sql +SELECT IS_VALID_UTF8('Hello, 世界'); +``` + +```text ++-----------------------------+ +| is_valid_utf8('Hello, 世界') | ++-----------------------------+ +| 1 | ++-----------------------------+ +``` + +3. 空字符串 + +```sql +SELECT IS_VALID_UTF8(''); +``` + +```text ++--------------------+ +| is_valid_utf8('') | ++--------------------+ +| 1 | ++--------------------+ +``` + +4. 非法的 UTF-8 字节(通过 UNHEX 构造) + +```sql +SELECT IS_VALID_UTF8(UNHEX('C0AF')); +``` + +```text ++------------------------------+ +| is_valid_utf8(unhex('C0AF')) | ++------------------------------+ +| 0 | ++------------------------------+ +``` + +5. NULL 值处理 + +```sql +SELECT IS_VALID_UTF8(NULL); +``` + +```text ++---------------------+ +| is_valid_utf8(NULL) | ++---------------------+ +| NULL | ++---------------------+ +``` + +6. 配合表数据使用 + +```sql +CREATE TABLE test_utf8 ( + id INT, + val VARCHAR(200) +) DISTRIBUTED BY HASH(id) BUCKETS 1 +PROPERTIES ("replication_num" = "1"); + +INSERT INTO test_utf8 VALUES +(1, 'hello'), +(2, ''), +(3, 'Hello, 世界'), +(4, NULL); + +INSERT INTO test_utf8 VALUES (5, UNHEX('C0AF')); +INSERT INTO test_utf8 VALUES (6, UNHEX('FF')); + +SELECT id, IS_VALID_UTF8(val) FROM test_utf8 ORDER BY id; +``` + +```text ++------+--------------------+ +| id | is_valid_utf8(val) | ++------+--------------------+ +| 1 | 1 | +| 2 | 1 | +| 3 | 1 | +| 4 | NULL | +| 5 | 0 | +| 6 | 0 | ++------+--------------------+ +``` diff --git a/sidebars.ts b/sidebars.ts index 4d2a5e3613077..6e31f51d4243f 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -1450,6 +1450,7 @@ const sidebars: SidebarsConfig = { 'sql-manual/sql-functions/scalar-functions/string-functions/instr', 'sql-manual/sql-functions/scalar-functions/string-functions/int-to-uuid', 'sql-manual/sql-functions/scalar-functions/string-functions/is-uuid', + 'sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8', 'sql-manual/sql-functions/scalar-functions/string-functions/lcase', 'sql-manual/sql-functions/scalar-functions/string-functions/length', 'sql-manual/sql-functions/scalar-functions/string-functions/locate', diff --git a/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8.md b/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8.md new file mode 100644 index 0000000000000..b4c099ddfd032 --- /dev/null +++ b/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8.md @@ -0,0 +1,146 @@ +--- +{ + "title": "IS_VALID_UTF8", + "language": "en", + "description": "The IS_VALID_UTF8 function checks whether a string is valid UTF-8 encoded data. Returns true if the string is valid UTF-8, false otherwise." +} +--- + +## Description + +The IS_VALID_UTF8 function checks whether a string is valid UTF-8 encoded data. It validates every byte sequence in the input and returns `true` if all sequences conform to the UTF-8 encoding standard, or `false` if any invalid byte sequence is found. + +This is useful when dealing with data imported from external sources (files, network streams, etc.) that may contain binary or incorrectly encoded content, and you need to verify data integrity before performing string operations. + +## Alias + +- `ISVALIDUTF8()` + +## Syntax + +```sql +IS_VALID_UTF8() +``` + +## Parameters + +| Parameter | Description | +|-----------|-------------| +| `` | The string to validate. Type: VARCHAR or STRING | + +## Return Value + +Returns BOOLEAN type. + +- Returns `true` if the string is valid UTF-8 encoded data. +- Returns `false` if the string contains any invalid UTF-8 byte sequence. + +Special cases: +- If the parameter is NULL, returns NULL. +- An empty string is considered valid UTF-8, returns `true`. + +## Examples + +1. Valid ASCII strings + +```sql +SELECT IS_VALID_UTF8('hello'); +``` + +```text ++------------------------+ +| is_valid_utf8('hello') | ++------------------------+ +| 1 | ++------------------------+ +``` + +2. Valid multi-byte UTF-8 characters (Chinese) + +```sql +SELECT IS_VALID_UTF8('Hello, 世界'); +``` + +```text ++-----------------------------+ +| is_valid_utf8('Hello, 世界') | ++-----------------------------+ +| 1 | ++-----------------------------+ +``` + +3. Empty string + +```sql +SELECT IS_VALID_UTF8(''); +``` + +```text ++--------------------+ +| is_valid_utf8('') | ++--------------------+ +| 1 | ++--------------------+ +``` + +4. Invalid UTF-8 bytes (constructed via UNHEX) + +```sql +SELECT IS_VALID_UTF8(UNHEX('C0AF')); +``` + +```text ++------------------------------+ +| is_valid_utf8(unhex('C0AF')) | ++------------------------------+ +| 0 | ++------------------------------+ +``` + +5. NULL value handling + +```sql +SELECT IS_VALID_UTF8(NULL); +``` + +```text ++---------------------+ +| is_valid_utf8(NULL) | ++---------------------+ +| NULL | ++---------------------+ +``` + +6. Using with table data + +```sql +CREATE TABLE test_utf8 ( + id INT, + val VARCHAR(200) +) DISTRIBUTED BY HASH(id) BUCKETS 1 +PROPERTIES ("replication_num" = "1"); + +INSERT INTO test_utf8 VALUES +(1, 'hello'), +(2, ''), +(3, 'Hello, 世界'), +(4, NULL); + +INSERT INTO test_utf8 VALUES (5, UNHEX('C0AF')); +INSERT INTO test_utf8 VALUES (6, UNHEX('FF')); + +SELECT id, IS_VALID_UTF8(val) FROM test_utf8 ORDER BY id; +``` + +```text ++------+--------------------+ +| id | is_valid_utf8(val) | ++------+--------------------+ +| 1 | 1 | +| 2 | 1 | +| 3 | 1 | +| 4 | NULL | +| 5 | 0 | +| 6 | 0 | ++------+--------------------+ +``` diff --git a/versioned_sidebars/version-4.x-sidebars.json b/versioned_sidebars/version-4.x-sidebars.json index 67a6e701858db..702c64cd22952 100644 --- a/versioned_sidebars/version-4.x-sidebars.json +++ b/versioned_sidebars/version-4.x-sidebars.json @@ -1477,6 +1477,7 @@ "sql-manual/sql-functions/scalar-functions/string-functions/instr", "sql-manual/sql-functions/scalar-functions/string-functions/int-to-uuid", "sql-manual/sql-functions/scalar-functions/string-functions/is-uuid", + "sql-manual/sql-functions/scalar-functions/string-functions/is-valid-utf8", "sql-manual/sql-functions/scalar-functions/string-functions/lcase", "sql-manual/sql-functions/scalar-functions/string-functions/length", "sql-manual/sql-functions/scalar-functions/string-functions/levenshtein",