diff --git a/cpp-ch/local-engine/Parser/aggregate_function_parser/BloomFilterAggParser.cpp b/cpp-ch/local-engine/Parser/aggregate_function_parser/BloomFilterAggParser.cpp index 6c85bb374886..f0e7bb6c71e7 100644 --- a/cpp-ch/local-engine/Parser/aggregate_function_parser/BloomFilterAggParser.cpp +++ b/cpp-ch/local-engine/Parser/aggregate_function_parser/BloomFilterAggParser.cpp @@ -56,7 +56,7 @@ DB::Array AggregateFunctionParserBloomFilterAgg::parseFunctionParameters( { if (func_info.phase == substrait::AGGREGATION_PHASE_INITIAL_TO_INTERMEDIATE || func_info.phase == substrait::AGGREGATION_PHASE_INITIAL_TO_RESULT) { - auto get_parameter_field = [](const DB::ActionsDAG::Node * node, size_t /*paramter_index*/) -> DB::Field + auto get_parameter_field = [](const DB::ActionsDAG::Node * node, size_t /*parameter_index*/) -> DB::Field { Field ret; node->column->get(0, ret); diff --git a/docs/developers/HowTo.md b/docs/developers/HowTo.md index 2ac2d9f44cba..147a64bc426d 100644 --- a/docs/developers/HowTo.md +++ b/docs/developers/HowTo.md @@ -156,7 +156,7 @@ gdb ${GLUTEN_HOME}/cpp/build/releases/libgluten.so 'core-Executor task l-2000883 Currently, we have no dedicated memory allocator implemented by jemalloc. User can set environment variable `LD_PRELOAD` for lib jemalloc to let it override the corresponding C standard functions entirely. It may help alleviate OOM issues. -`spark.executorEnv.LD_PREALOD=/path/to/libjemalloc.so` +`spark.executorEnv.LD_PRELOAD=/path/to/libjemalloc.so` # How to run TPC-H on Velox backend diff --git a/docs/developers/UsingGperftoolsInCH.md b/docs/developers/UsingGperftoolsInCH.md index 5a4bbea3fbbc..3923c2b6c307 100644 --- a/docs/developers/UsingGperftoolsInCH.md +++ b/docs/developers/UsingGperftoolsInCH.md @@ -11,7 +11,7 @@ We need using gpertools to find the memory or CPU issue. That's what this docume Install gperftools as described in https://github.com/gperftools/gperftools. We get the library and the command line tools. -## Compiler libch.so +## Compile libch.so Disable jemalloc `-DENABLE_JEMALLOC=OFF` in cpp-ch/CMakeLists.txt, and recompile libch.so. ## Run Gluten with gperftools diff --git a/docs/developers/UsingJemallocWithCH.md b/docs/developers/UsingJemallocWithCH.md index 365a35dd39fe..e38cfa24b449 100644 --- a/docs/developers/UsingJemallocWithCH.md +++ b/docs/developers/UsingJemallocWithCH.md @@ -28,7 +28,7 @@ cd $Clickhouse_SOURCE_PATH/contrib/jemalloc && ./autogen.sh && ./configure.sh && ``` Then we get jeprof in the directory `$Clickhouse_SOURCE_PATH/contrib/jemalloc/bin/jeprof`. -## Compiler libch.so +## Compile libch.so Ensure to enable jemalloc `-DENABLE_JEMALLOC=ON` in cpp-ch/CMakeLists.txt, and compile libch.so. ## Run Gluten with jemalloc heap tools diff --git a/docs/get-started/ClickHouse.md b/docs/get-started/ClickHouse.md index 15c06abc0266..c0dd4002fc38 100644 --- a/docs/get-started/ClickHouse.md +++ b/docs/get-started/ClickHouse.md @@ -89,10 +89,10 @@ git submodule update --init --recursive ##### build There are several ways to build the backend library. -1. Build it direclty +1. Build it directly -If you have setup all requirements, you can use following command to build it direclty. +If you have setup all requirements, you can use following command to build it directly. ```bash cd $gluten_root @@ -340,7 +340,7 @@ You need to add these additional configs to spark: --config spark.hadoop.fs.s3a.access.key=YOUR_ACCESS_KEY --config spark.hadoop.fs.s3a.secret.key=YOUR_SECRET_KEY ``` -where S3_ENDPOINT must follow the format of `https://s3.region-code.amazonaws.com`, e.g. `https://s3.us-east-1.amazonaws.com` (or `http://hostname:39090 for MINIO) +where S3_ENDPOINT must follow the format of `https://s3.region-code.amazonaws.com`, e.g. `https://s3.us-east-1.amazonaws.com` (or `http://hostname:39090` for MINIO) When you query the parquet files in S3, you need to add the prefix `s3a://` to the path, e.g. `s3a://your_bucket_name/path_to_your_parquet`. diff --git a/docs/get-started/VeloxGCS.md b/docs/get-started/VeloxGCS.md index 09e0a927cab4..77fe309a4646 100644 --- a/docs/get-started/VeloxGCS.md +++ b/docs/get-started/VeloxGCS.md @@ -10,7 +10,7 @@ Object stores offered by CSPs such as GCS are important for users of Gluten to s ## Installing the gcloud CLI -To access GCS Objects using Gluten and Velox, first you have to [download an install the gcloud CLI] (https://cloud.google.com/sdk/docs/install). +To access GCS Objects using Gluten and Velox, first you have to [download and install the gcloud CLI](https://cloud.google.com/sdk/docs/install). ## Configuring GCS using a user account @@ -22,7 +22,7 @@ After these steps, no specific configuration is required for Gluten, since the a ## Configuring GCS using a credential file For workloads that need to be fully automated, manually authorizing can be problematic. For such cases it is better to use a json file with the credentials. -This is described in the [instructions to configure a service account]https://cloud.google.com/sdk/docs/authorizing#service-account. +This is described in the [instructions to configure a service account](https://cloud.google.com/sdk/docs/authorizing#service-account). Such json file with the credentials can be passed to Gluten: diff --git a/docs/velox-configuration.md b/docs/velox-configuration.md index 2202fed3d5bc..767875bb167e 100644 --- a/docs/velox-configuration.md +++ b/docs/velox-configuration.md @@ -15,7 +15,7 @@ nav_order: 16 | spark.gluten.sql.columnar.backend.velox.SplitPreloadPerDriver | 2 | The split preload per task | | spark.gluten.sql.columnar.backend.velox.abandonPartialAggregationMinPct | 90 | If partial aggregation aggregationPct greater than this value, partial aggregation may be early abandoned. Note: this option only works when flushable partial aggregation is enabled. Ignored when spark.gluten.sql.columnar.backend.velox.flushablePartialAggregation=false. | | spark.gluten.sql.columnar.backend.velox.abandonPartialAggregationMinRows | 100000 | If partial aggregation input rows number greater than this value, partial aggregation may be early abandoned. Note: this option only works when flushable partial aggregation is enabled. Ignored when spark.gluten.sql.columnar.backend.velox.flushablePartialAggregation=false. | -| spark.gluten.sql.columnar.backend.velox.asyncTimeoutOnTaskStopping | 30000ms | Timeout for asynchronous execution when task is being stopped in Velox backend. It's recommended to set to a number larger than network connection timeout that the possible aysnc tasks are relying on. | +| spark.gluten.sql.columnar.backend.velox.asyncTimeoutOnTaskStopping | 30000ms | Timeout for asynchronous execution when task is being stopped in Velox backend. It's recommended to set to a number larger than network connection timeout that the possible async tasks are relying on. | | spark.gluten.sql.columnar.backend.velox.cacheEnabled | false | Enable Velox cache, default off. It's recommended to enablesoft-affinity as well when enable velox cache. | | spark.gluten.sql.columnar.backend.velox.cachePrefetchMinPct | 0 | Set prefetch cache min pct for velox file scan | | spark.gluten.sql.columnar.backend.velox.checkUsageLeak | true | Enable check memory usage leak. | @@ -24,7 +24,7 @@ nav_order: 16 | spark.gluten.sql.columnar.backend.velox.cudf.enableValidation | true | Heuristics you can apply to validate a cuDF/GPU plan and only offload when the entire stage can be fully and profitably executed on GPU | | spark.gluten.sql.columnar.backend.velox.cudf.memoryPercent | 50 | The initial percent of GPU memory to allocate for memory resource for one thread. | | spark.gluten.sql.columnar.backend.velox.cudf.memoryResource | async | GPU RMM memory resource. | -| spark.gluten.sql.columnar.backend.velox.cudf.shuffleMaxPrefetchBytes | 1028MB | Maximum bytes to prefetch in CPU memory during GPU shuffle read while waitingfor GPU available. | +| spark.gluten.sql.columnar.backend.velox.cudf.shuffleMaxPrefetchBytes | 1028MB | Maximum bytes to prefetch in CPU memory during GPU shuffle read while waiting for GPU available. | | spark.gluten.sql.columnar.backend.velox.directorySizeGuess | 32KB | Deprecated, rename to spark.gluten.sql.columnar.backend.velox.footerEstimatedSize | | spark.gluten.sql.columnar.backend.velox.enableTimestampNtzValidation | true | Enable validation fallback for TimestampNTZ type. When true (default), any plan containing TimestampNTZ will fall back to Spark execution. Set to false during development/testing of TimestampNTZ support to allow native execution. | | spark.gluten.sql.columnar.backend.velox.fileHandleCacheEnabled | false | Disables caching if false. File handle cache should be disabled if files are mutable, i.e. file content may change while file path stays the same. | @@ -78,7 +78,7 @@ nav_order: 16 | spark.gluten.sql.enable.enhancedFeatures | true | Enable some features including iceberg native write and other features. | | spark.gluten.sql.rewrite.castArrayToString | true | When true, rewrite `cast(array as String)` to `concat('[', array_join(array, ', ', null), ']')` to allow offloading to Velox. | | spark.gluten.velox.broadcast.build.targetBytesPerThread | 32MB | It is used to calculate the number of hash table build threads. Based on our testing across various thresholds (1MB to 128MB), we recommend a value of 32MB or 64MB, as these consistently provided the most significant performance gains. | -| spark.gluten.velox.castFromVarcharAddTrimNode | false | If true, will add a trim node which has the same sementic as vanilla Spark to CAST-from-varchar.Otherwise, do nothing. | +| spark.gluten.velox.castFromVarcharAddTrimNode | false | If true, will add a trim node which has the same semantic as vanilla Spark to CAST-from-varchar.Otherwise, do nothing. | ## Gluten Velox backend *experimental* configurations diff --git a/docs/velox-spark-configuration.md b/docs/velox-spark-configuration.md index 6543ffd8ffe0..d1fe199b3b84 100644 --- a/docs/velox-spark-configuration.md +++ b/docs/velox-spark-configuration.md @@ -2,7 +2,7 @@ layout: page title: Spark configurations status in Gluten Velox Backend nav_order: 17 -The file lists the if Spark configurations are hornored by Gluten velox backend or not. Table is from Spark4.0 configuration page. The status are: +The file lists the if Spark configurations are honored by Gluten velox backend or not. Table is from Spark4.0 configuration page. The status are: - ✅ Supported
- ❌ Not Supported
- ⚠️ Partial Support