Binary Journal: Datastage

Showing posts with label Datastage. Show all posts

Thursday, December 12, 2013

Creating Datastage Routines for hash files

I created some datastage routines to help manipulate hash files more easily. Instead of trying to work with the Admin console, I created reusable routines that will:

Count rows in a hash file

Remove rows from a hash file

Select rows from a hash file

I'm not exactly an expert, so my code is fairly simple and straightforward.

To create a routine, go up to File, and select New and Server Routine.

Give the routine a name and description, and choose the type you want it to be.

Most of my routines are 'Transform Function' type, because I use them within Transformers. They take data as arguments, and return results.

Next, you add what arguments your function will be using (if any). Then your code. Compile it. Test it. Save it. Done.

Below is the code I use for my different hash file routines:

1. Count rows in a hashed file

I use one argument, the name of the hashed file I want to count.

The Code:

**This creates a pointer called myhash that points to your hashfile (Arg1). I hard code the hashfile location.
Command1 = "SETFILE C:\IBM\InformationServer\Server\Projects\<YourProjectName>\<YourHashDir>\":Arg1:" myhash"
Call DSExecute('TCL', Command1, Output, ReturnCode)

ErrorCode = ReturnCode ;* set this to non-zero to stop the stage/job

**Next you simply execute the COUNT command on your hashfile pointer. The result gets returned in Ans.
Command2 = "COUNT myhash"
Call DSExecute('TCL', Command2, Output, ReturnCode)
** The Ans output has crazy formatting because it returns as '*38454 records counted.*' And I want it to return 38454.
Ans = (Output [' ',1,1]) [2,10]

ErrorCode = ReturnCode ;* set this to non-zero to stop the stage/job

** Remove Pointer when done **
Command3 = "DELETE VOC myhash"
Call DSExecute('TCL', Command3, Output, ReturnCode)

ErrorCode = ReturnCode ;* set this to non-zero to stop the stage/job

*******

Here is an example of how I then use the routine in a Transform.

The routine is called HSUCountHashRows(). I use a job parameter (vHASH) to capture the hashed file name to be counted. The name gets passed to the routine, as the argument. I add the "+ 0" to convert the answer to an integer. The transform then returns the file name and the resulting count.

2. Delete Hashed row by Primary key

This routine takes 3 arguments. The hashed file name, the column name that holds the primary key, and the primary key itself.

The Code

** Set file Pointer **
**This creates a pointer called myhash that points to your hashfile (Arg1). I hard code the hashfile location.
Command1 = "SETFILE C:\IBM\InformationServer\Server\Projects\<YourProjectName>\<YourHashDir>\":Arg1:" myhash"Call DSExecute('TCL', Command1, Output, ReturnCode)

Command2 = "DELETE FROM myhash WHERE ":Arg2:" = '":Arg3:"'"
Call DSExecute('TCL', Command2, Output, ReturnCode)
Ans = Output

ErrorCode = ReturnCode ;* set this to non-zero to stop the stage/job

*******

3. Delete Hashed row by Date

This routine deletes rows within a certain date range. I normally use this when I need to redo a load for the day. I delete from the hashed file those records with a load date (CREATE_EW_DTTM) of today.

The Code

**This creates a pointer called myhash that points to your hashfile (Arg1). I hard code the hashfile location.
Command1 = "SETFILE C:\IBM\InformationServer\Server\Projects\<YourProjectName>\<YourHashDir>\":Arg1:" myhash"Call DSExecute('TCL', Command1, Output, ReturnCode)

Command2 = "DELETE FROM myhash WHERE CREATE_EW_DTTM > '":Arg2:"' AND CREATE_EW_DTTM < '":Arg3:"'"
Call DSExecute('TCL', Command2, Output, ReturnCode)
Ans = Output

** Remove Pointer when done **
Command3 = "DELETE VOC myhash"
Call DSExecute('TCL', Command3, Output, ReturnCode)

4. Select a Row with a given Primary Key

My hashed files all use the same column for the primary key, so I didn't need to pull in the column name as an argument.

The Code

**This creates a pointer called myhash that points to your hashfile (Arg1). I hard code the hashfile location.
Command1 = "SETFILE C:\IBM\InformationServer\Server\Projects\<YourProjectName>\<YourHashDir>\":Arg1:" myhash"

Call DSExecute('TCL', Command1, Output, ReturnCode)

Command2 = "SELECT * FROM myhash WHERE PRIM_KEY = '":Arg2:"'"
Call DSExecute('TCL', Command2, Output, ReturnCode)
Ans = Output

** Remove Pointer when done **
Command3 = "DELETE VOC myhash"
Call DSExecute('TCL', Command3, Output, ReturnCode)

ErrorCode = ReturnCode ;* set this to non-zero to stop the stage/job

***

Hope this has been helpful. Happy coding!

Tuesday, January 8, 2013

Datastage - Loading MySQL Text into Oracle Varchar

I had to create a simple job to load data from a MySQL database into an Oracle database.
Everything was working fine until I had to move MySQL data that was in a Text format, into an equivalent Oracle field.

I grab the data in my first stage, and move it into a field of type Char, with the necessary byte size. The data then gets loaded into CLOB fields in Oracle, and in Datastage I label them as type 'Char' with the necessary byte size.
So far, this seems to be working well.

---------------------------
You could also try using a User-created SQL statement, that casts the MySQL text data into characters.

Example: cast(description as char(2000)) as description

Then in DataStage, I have the data coming in as Varchar, and loading into Varchar2(2000) fields in Oracle. Keep in mind that I know these fields aren't very big, even though they are Text fields.
Perhaps not the most elegant, but it got the job done.

It also worked to have the data coming in as Char, and loading as Char. Make sure the field size is large enough.

Sunday, December 16, 2012

Reading Datastage Logs

When using IBM InfoSphere DataStage and QualityStage, it's important to know how to use the logs. They can answer a lot of questions, and help isolate problems.

This is a simple introduction to reading the logs in Director.

1. Change the Refresh
You'll learn pretty quick that the default refresh is way too short. The page will refresh every 5 seconds, making it practically impossible to read anything.
First thing you should do, is lengthen the refresh time, or change it to manuel refresh only.
Go to Tools -> Options... -> General Tab -> change the refresh time to 60 seconds or disable it.

2. Filter/Search for Warnings/Abort
Open your log file in Director. Go to View -> Filter Entries...
This will open up a box where you can filter for particular errors (ex. Warnings). You can also use Ctrl-F to search for a specific word or phrase within your log. You can also have it show more or less of the logs.

3. Check Related Log(s)

When you are checking a log for a sequence, that has several jobs, you can right-click a "Run Job" line, to see the log for that specific job. Click on Related Log.

4. Check the Details
When jobs abort, check the details. This is obvious, but some people gloss over these. Often there is more information about what caused the job to abort. Bad queries may display an error of bad syntax. A record with bad data may cause the job to abort, and often the record's data will be displayed in the error, so you can find the record and fix it, or delete it.

This is very basic information, but this is the best place to start when troubleshooting. For any just beginning in Datastage, hopefully this was helpful and Good Luck!