Config file is one of the most important component in Datastage.
The configuration files in Datastage direct us towards below facts:
1. Degree of Data Partitioning to Scale processing
2. System resources like “Temp Storage”, “Scratch Disk”
3. Resources for Database and Buffer Storage
Tips for Job Optimization:
1. Nodes should be equal to number of CPU Running
2. Use multiple configuration files
a. For low volume data Single node configuration
b. For large volume data, Multi node configuration
3. Span processing across multiple machines by adding nodes from different machines
a. Avoid re-partitioning data in such scenarios because re-partitioning across networks will be costly operation
4. To Maximize I/O use different and multiple “Resource Datasets” on each node
Sample Configuration file (Config File):
{
node “node1” {
fastname “machine1”
pools “” “oraclewrite”
resource disk “/Local_1/mypath" {pools ""}
resource disk “/Local_2/mypath" {pools ""}
resource scratchdisk “/Local_3_1/mypath" {pools "“}
resource scratchdisk “/Local_3_2/mypath" {pools "“}
}
node “node2” {
fastname “machine2”
pools "" “oraclewrite”
resource disk “/Local_4/mypath" {pools ""}
resource scratchdisk “/Local_5_1/mypath" {pools "“}
resource scratchdisk “/Local_5_2/mypath" {pools "“}
}
node “node3” {
fastname “machine2”
pools “oraclewrite”
resource disk “/Local_4/mypath" {pools ""}
resource scratchdisk “/Local_5_1/mypath" {pools "“}
resource scratchdisk “/Local_5_2/mypath" {pools "“}
}
}
Config file explained as below:
1. We are running on two different machines i.e. "machine1" & "machine2"
2. Using multiple "Resource disk and scratchdisk" to get more memory for processing
3. Assign "oraclewrite" to all nodes for parallel execution when writing and reading from Oracle DB
4. Other stages will only run on two nodes because pool is only defined over two nodes, In case of Netezza Connector please enable “Partitioned Read”, The number of reads is equal to the data partitions in config file.