Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ DB::Array AggregateFunctionParserBloomFilterAgg::parseFunctionParameters(
{
if (func_info.phase == substrait::AGGREGATION_PHASE_INITIAL_TO_INTERMEDIATE || func_info.phase == substrait::AGGREGATION_PHASE_INITIAL_TO_RESULT)
{
auto get_parameter_field = [](const DB::ActionsDAG::Node * node, size_t /*paramter_index*/) -> DB::Field
auto get_parameter_field = [](const DB::ActionsDAG::Node * node, size_t /*parameter_index*/) -> DB::Field
{
Field ret;
node->column->get(0, ret);
Expand Down
2 changes: 1 addition & 1 deletion docs/developers/HowTo.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ gdb ${GLUTEN_HOME}/cpp/build/releases/libgluten.so 'core-Executor task l-2000883
Currently, we have no dedicated memory allocator implemented by jemalloc. User can set environment variable `LD_PRELOAD` for lib jemalloc
to let it override the corresponding C standard functions entirely. It may help alleviate OOM issues.

`spark.executorEnv.LD_PREALOD=/path/to/libjemalloc.so`
`spark.executorEnv.LD_PRELOAD=/path/to/libjemalloc.so`

# How to run TPC-H on Velox backend

Expand Down
2 changes: 1 addition & 1 deletion docs/developers/UsingGperftoolsInCH.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ We need using gpertools to find the memory or CPU issue. That's what this docume
Install gperftools as described in https://github.com/gperftools/gperftools.
We get the library and the command line tools.

## Compiler libch.so
## Compile libch.so
Disable jemalloc `-DENABLE_JEMALLOC=OFF` in cpp-ch/CMakeLists.txt, and recompile libch.so.

## Run Gluten with gperftools
Expand Down
2 changes: 1 addition & 1 deletion docs/developers/UsingJemallocWithCH.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ cd $Clickhouse_SOURCE_PATH/contrib/jemalloc && ./autogen.sh && ./configure.sh &&
```
Then we get jeprof in the directory `$Clickhouse_SOURCE_PATH/contrib/jemalloc/bin/jeprof`.

## Compiler libch.so
## Compile libch.so
Ensure to enable jemalloc `-DENABLE_JEMALLOC=ON` in cpp-ch/CMakeLists.txt, and compile libch.so.

## Run Gluten with jemalloc heap tools
Expand Down
6 changes: 3 additions & 3 deletions docs/get-started/ClickHouse.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,10 +89,10 @@ git submodule update --init --recursive
##### build

There are several ways to build the backend library.
1. Build it direclty
1. Build it directly


If you have setup all requirements, you can use following command to build it direclty.
If you have setup all requirements, you can use following command to build it directly.

```bash
cd $gluten_root
Expand Down Expand Up @@ -340,7 +340,7 @@ You need to add these additional configs to spark:
--config spark.hadoop.fs.s3a.access.key=YOUR_ACCESS_KEY
--config spark.hadoop.fs.s3a.secret.key=YOUR_SECRET_KEY
```
where S3_ENDPOINT must follow the format of `https://s3.region-code.amazonaws.com`, e.g. `https://s3.us-east-1.amazonaws.com` (or `http://hostname:39090 for MINIO)
where S3_ENDPOINT must follow the format of `https://s3.region-code.amazonaws.com`, e.g. `https://s3.us-east-1.amazonaws.com` (or `http://hostname:39090` for MINIO)

When you query the parquet files in S3, you need to add the prefix `s3a://` to the path, e.g. `s3a://your_bucket_name/path_to_your_parquet`.

Expand Down
4 changes: 2 additions & 2 deletions docs/get-started/VeloxGCS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Object stores offered by CSPs such as GCS are important for users of Gluten to s

## Installing the gcloud CLI

To access GCS Objects using Gluten and Velox, first you have to [download an install the gcloud CLI] (https://cloud.google.com/sdk/docs/install).
To access GCS Objects using Gluten and Velox, first you have to [download and install the gcloud CLI](https://cloud.google.com/sdk/docs/install).


## Configuring GCS using a user account
Expand All @@ -22,7 +22,7 @@ After these steps, no specific configuration is required for Gluten, since the a
## Configuring GCS using a credential file

For workloads that need to be fully automated, manually authorizing can be problematic. For such cases it is better to use a json file with the credentials.
This is described in the [instructions to configure a service account]https://cloud.google.com/sdk/docs/authorizing#service-account.
This is described in the [instructions to configure a service account](https://cloud.google.com/sdk/docs/authorizing#service-account).

Such json file with the credentials can be passed to Gluten:

Expand Down
6 changes: 3 additions & 3 deletions docs/velox-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ nav_order: 16
| spark.gluten.sql.columnar.backend.velox.SplitPreloadPerDriver | 2 | The split preload per task |
| spark.gluten.sql.columnar.backend.velox.abandonPartialAggregationMinPct | 90 | If partial aggregation aggregationPct greater than this value, partial aggregation may be early abandoned. Note: this option only works when flushable partial aggregation is enabled. Ignored when spark.gluten.sql.columnar.backend.velox.flushablePartialAggregation=false. |
| spark.gluten.sql.columnar.backend.velox.abandonPartialAggregationMinRows | 100000 | If partial aggregation input rows number greater than this value, partial aggregation may be early abandoned. Note: this option only works when flushable partial aggregation is enabled. Ignored when spark.gluten.sql.columnar.backend.velox.flushablePartialAggregation=false. |
| spark.gluten.sql.columnar.backend.velox.asyncTimeoutOnTaskStopping | 30000ms | Timeout for asynchronous execution when task is being stopped in Velox backend. It's recommended to set to a number larger than network connection timeout that the possible aysnc tasks are relying on. |
| spark.gluten.sql.columnar.backend.velox.asyncTimeoutOnTaskStopping | 30000ms | Timeout for asynchronous execution when task is being stopped in Velox backend. It's recommended to set to a number larger than network connection timeout that the possible async tasks are relying on. |
| spark.gluten.sql.columnar.backend.velox.cacheEnabled | false | Enable Velox cache, default off. It's recommended to enablesoft-affinity as well when enable velox cache. |
| spark.gluten.sql.columnar.backend.velox.cachePrefetchMinPct | 0 | Set prefetch cache min pct for velox file scan |
| spark.gluten.sql.columnar.backend.velox.checkUsageLeak | true | Enable check memory usage leak. |
Expand All @@ -24,7 +24,7 @@ nav_order: 16
| spark.gluten.sql.columnar.backend.velox.cudf.enableValidation | true | Heuristics you can apply to validate a cuDF/GPU plan and only offload when the entire stage can be fully and profitably executed on GPU |
| spark.gluten.sql.columnar.backend.velox.cudf.memoryPercent | 50 | The initial percent of GPU memory to allocate for memory resource for one thread. |
| spark.gluten.sql.columnar.backend.velox.cudf.memoryResource | async | GPU RMM memory resource. |
| spark.gluten.sql.columnar.backend.velox.cudf.shuffleMaxPrefetchBytes | 1028MB | Maximum bytes to prefetch in CPU memory during GPU shuffle read while waitingfor GPU available. |
| spark.gluten.sql.columnar.backend.velox.cudf.shuffleMaxPrefetchBytes | 1028MB | Maximum bytes to prefetch in CPU memory during GPU shuffle read while waiting for GPU available. |
| spark.gluten.sql.columnar.backend.velox.directorySizeGuess | 32KB | Deprecated, rename to spark.gluten.sql.columnar.backend.velox.footerEstimatedSize |
| spark.gluten.sql.columnar.backend.velox.enableTimestampNtzValidation | true | Enable validation fallback for TimestampNTZ type. When true (default), any plan containing TimestampNTZ will fall back to Spark execution. Set to false during development/testing of TimestampNTZ support to allow native execution. |
| spark.gluten.sql.columnar.backend.velox.fileHandleCacheEnabled | false | Disables caching if false. File handle cache should be disabled if files are mutable, i.e. file content may change while file path stays the same. |
Expand Down Expand Up @@ -78,7 +78,7 @@ nav_order: 16
| spark.gluten.sql.enable.enhancedFeatures | true | Enable some features including iceberg native write and other features. |
| spark.gluten.sql.rewrite.castArrayToString | true | When true, rewrite `cast(array as String)` to `concat('[', array_join(array, ', ', null), ']')` to allow offloading to Velox. |
| spark.gluten.velox.broadcast.build.targetBytesPerThread | 32MB | It is used to calculate the number of hash table build threads. Based on our testing across various thresholds (1MB to 128MB), we recommend a value of 32MB or 64MB, as these consistently provided the most significant performance gains. |
| spark.gluten.velox.castFromVarcharAddTrimNode | false | If true, will add a trim node which has the same sementic as vanilla Spark to CAST-from-varchar.Otherwise, do nothing. |
| spark.gluten.velox.castFromVarcharAddTrimNode | false | If true, will add a trim node which has the same semantic as vanilla Spark to CAST-from-varchar.Otherwise, do nothing. |

## Gluten Velox backend *experimental* configurations

Expand Down
2 changes: 1 addition & 1 deletion docs/velox-spark-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ layout: page
title: Spark configurations status in Gluten Velox Backend
nav_order: 17

The file lists the if Spark configurations are hornored by Gluten velox backend or not. Table is from Spark4.0 configuration page. The status are:
The file lists the if Spark configurations are honored by Gluten velox backend or not. Table is from Spark4.0 configuration page. The status are:
- ✅ Supported<br>
- ❌ Not Supported<br>
- ⚠️ Partial Support<br>
Expand Down
Loading