how-to-guides

No menu items for this category
Collate Documentation

Spark Engine External Configuration

To configure your profiler pipeline to use Spark Engine, you need to add the processingEngine configuration to your existing YAML file.

Before configuring, ensure you have completed the Spark Engine Prerequisites and understand the Partitioning Requirements.

In your existing profiler YAML, add the processingEngine section under sourceConfig.config:

In the processor.config.tableConfig section, add the sparkTableProfilerConfig:

  1. Add processingEngine to sourceConfig.config
  2. Add sparkTableProfilerConfig to your table configuration
  3. Specify partition column for Spark processing (Else it will fallback to fetching the Primary Key if any, or skipping the table entirely)

Use the same command as before:

The pipeline will now use Spark Engine instead of the Native engine for processing.

  1. Missing Partition Column: Ensure you've specified a suitable partition column
  2. Network Connectivity: Verify Spark Connect and database connectivity
  3. Driver Issues: Check that appropriate database drivers are installed in Spark cluster
  4. Configuration Errors: Validate YAML syntax and required fields
  1. Check Logs: Review profiler logs for specific error messages
  2. Test Connectivity: Verify all network connections are working
  3. Validate Configuration: Ensure all required fields are properly set
  4. Test with Small Dataset: Start with a small table to verify the setup