Locally Maven-ize Pentaho Kettle and develop a Data Integration webapp with Eclipse (also integrated with container managed datasource)


Install Kettle’s JAR into the local repository

Execute some INSTALL command line statements:

  • mvn install:install-file -DgroupId=pentaho.kettle -DartifactId=kettle-core -Dversion=4.0.0 -Dpackaging=jar -Dfile=C:\Pdi-ce-4.0.0-stable\data-integration\lib\kettle-core.jar -DgeneratePom=true
  • mvn install:install-file -DgroupId=pentaho.kettle -DartifactId=kettle-db -Dversion=4.0.0 -Dpackaging=jar -Dfile=C:\Pdi-ce-4.0.0-stable\data-integration\lib\kettle-db.jar  -DgeneratePom=true
  • mvn install:install-file -DgroupId=pentaho.kettle -DartifactId=kettle-engine -Dversion=4.0.0 -Dpackaging=jar -Dfile=C:\Pdi-ce-4.0.0-stable\data-integration\lib\kettle-engine.jar  -DgeneratePom=true
  • mvn install:install-file -DgroupId=pentaho.kettle -DartifactId=kettle-ui-swt -Dversion=4.0.0 -Dpackaging=jar -Dfile=C:\Pdi-ce-4.0.0-stable\data-integration\libext\pentaho\kettle-ui-swt.jar  -DgeneratePom=true
  • mvn install:install-file -DgroupId=pentaho.kettle -DartifactId=kettle-vfs -Dversion=4.0.0 -Dpackaging=jar -Dfile=C:\Pdi-ce-4.0.0-stable\data-integration\libext\pentaho\kettle-vfs-20091118.jar  -DgeneratePom=true
  • mvn install:install-file -DgroupId=pentaho -DartifactId=pentaho-libbase -Dversion=1.1.6 -Dpackaging=jar -Dfile=C:\Pdi-ce-4.0.0-stable\data-integration\libext\pentaho\libbase-1.1.6.jar  -DgeneratePom=true
  • mvn install:install-file -DgroupId=pentaho -DartifactId=pentaho-libformula -Dversion=1.1.7 -Dpackaging=jar -Dfile=C:\Pdi-ce-4.0.0-stable\data-integration\libext\pentaho\libformula-1.1.7.jar  -DgeneratePom=true
  • Some others may be needed depending on used libraries or transformation blocks

Modify the Kattle dependencies POM

Edit the %MAVEN_REPOSITORY%\pentaho\kettle\kettle-core\4.0.0\kettle-core-4.0.0.pom file adding dependencies element:

<?xml version=”1.0″ encoding=”utf-8″?>

<project xsi:schemaLocation=”http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd&#8221;







<description>POM was created from install:install-file</description>









































































Some others dependencies may be needed depending on used libraries or transformation blocks, you can discover those executing the Transformations or the Jobs at the end of this guide.

Note: the JDBC-Stdext exclusion will prevent the “Unable to resolve artifact: required artifacts missing: javax.sql:jdbc-stdext:jar:2.0” error according to http://www.osjava.org/issues/browse/SJN-74.html.

Use JNDI reference within Spoon

To simulate the JNDI datasource availability in Spoon, as it happens in an application server, we need to write a proper jdbc.properties file within the pdi-ce-4.0.0-stable\data-integration\simple-jndi directory:






After that you can use a JNDI database connection with JNDI name java:/comp/env/jdbc/NAME.

Note: remember to put the JAR with database drivers into the directory pdi-ce-4.0.0-stable\data-integration\libext\JDBC.

To explicitly set the jdbc.properties path, it’s possible to use the following VM argument to start Spoon modifying spoon.bat or spoon.sh:

-Djava.naming.factory.initial=”org.osjava.sj.Simple ContextFactory” -Dorg.osjava.sj.root=”C:/directory/simple-jndi” -Dorg.osjava.sj.delimiter=”/”

I suggest you to test that connection within Spoon interface.

Create a Data Integration webapp project with Eclipse

Launch Eclipse and create a Maven webapp project.

When you run Tomcat or other JEE container you need to define the KETTLE_PLUGIN_BASE_FOLDERS variable in “VM arguments” text area within the Run Configuration dialog box:


If you don’t do that, you will face a problem with the plugin loader: the KettleEnvironment initialization scans the Eclipse “plugins” directory instead of Kettle “plugins” directory searching for JAR this will cause waste of time due to the many and many jar available in the Eclipse directory.

More plugin directories are allowed using comma separated values: “C:/pdi-ce-4.0.0-stable/data-integration/plugins, C:/dir/plugins”.

An argument -Dorg.osjava.sj.root=”C:/directory/simple-jndi” is not necessary if you use KettleEnvironment.init(false) statement in initialization phase (see below).

Import Kettle libraries

Now import Kettle JARs within your webapp project will be easy, just add “kettle-core” dependency. If you don’t have “kettle-core” in your Maven artifact list, just Reindex Local Repository available in the Maven Preferences dialog box.

Define datasource in context.xml (for Tomcat)

Create the META-INF\context.xml as for any other standard webapp project:

<?xml version=“1.0” encoding=“UTF-8”?>


<Resource name=“jdbc/NOME auth=“Container” type=“javax.sql.DataSource”

maxActive=“100” maxIdle=“30” maxWait=“10000”

username=“username” password=“password”




Note: remember to add database drivers into the classpath or add Maven dependency.

Invoke the Kettle transformation from a JSP

You can test the success of this procedure using a JSP with this scriptlet:

<%@ page language=“java” contentType=“text/html; charset=ISO-8859-1”


<!DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN”


<%@page import=“org.pentaho.di.core.KettleEnvironment”%>

<%@page import=“org.pentaho.di.core.util.EnvUtil”%>

<%@page import=“org.pentaho.di.trans.TransMeta”%>

<%@page import=“org.pentaho.di.trans.Trans”%>

<%@page import=“org.pentaho.di.core.Result”%>

<%@page import=“java.util.List”%>

<%@page import=“org.pentaho.di.core.RowMetaAndData”%>

<%@page import=“org.pentaho.di.core.exception.KettleException”%>







try {



TransMeta transMeta = new TransMeta(“C:\\ PentahoTestIntegration\\test.ktr”);

Trans trans = new Trans(transMeta);

trans.execute(null); // You can pass arguments instead of null.


Result r = trans.getResult();

List<RowMetaAndData> rowsResult = r.getRows();

if (trans.getErrors() > 0) {

throw new RuntimeException(“There were errors during transformation execution.”);


} catch (KettleException e) {






If you face some ClassNotFoundException, you can modify the Kettle dependencies POM according to the missing JAR. If this is not available into the remote Maven repository, you can use the INSTALL Maven command to add it to your local repository.

If the Transformation succeeded, you can move the KettleEnvironment.init(false) and EnvUtil.environmentInit() in a ContextListener to initialize Kettle components once at startup.

JBoss 5 and AOP: how-to guide to add AOP aspects to your application


A quick example on how to add an interceptor to your application, for example to log or profile methods invocation:

  • Add “-javaagent:pluggable-instrumentor.jar” into JAVA_OPTS variable of file run.bat (or run.sh)
  • Edit \conf\bootstrap\aop.xml and set the element enableLoadtimeWeaving to “true” and add to the element include the package you want to instrument
  • Create the jboss-aop.xml in the META-INF directory of your package like this:
<?xml version="1.0" encoding="UTF-8"?>
<aop xmlns="urn:jboss:aop-beans:1.0">
  <interceptor name="Int" class="com.xxx.LogInterceptor"/>
  <bind pointcut="execution(* com.*->*(..))">
    <interceptor-ref name="Int"/>
  </bind >
  • Create the Interceptor class:
import org.jboss.aop.advice.Interceptor;
import org.jboss.aop.joinpoint.Invocation;
import org.jboss.aop.joinpoint.MethodInvocation;
public class LogInterceptor implements Interceptor {
public String getName() {
  return "LogInterceptor";
public Object invoke(Invocation invocation) throws Throwable {
  long time=System.currentTimeMillis();
  try {
    return invocation.invokeNext();
  } finally {
    if (invocation instanceof MethodInvocation) {
      MethodInvocation mi = (MethodInvocation) invocation;
      String clazz="";
      try {
      } catch (Throwable e) {
        // TODO: handle exception
      //Here you can use a logger to log time and method name


Agile practices: User Story


Describes functionality that will be valuable to a user of a system and composed of three aspects:

  • Card: a written description of the story used for planning and as a reminder
  • Conversation: verbal conversations about the story that serve to flesh out the details of the story
  • Confirmation: tests that convey and document details and that can be used to determine when a story is complete


  • Indipendent: as much as possible (to prevent planning and estimation  problems due to story dependencies);
  • Negotiable: they are remainders for the team-customer conversation, details are written as notes (1 or 2 phrase). The challenge comes in learning to include just enough detail, other discussions become test (written on the rear of the card: “if I haven’t an account the application invite me to subscribe the service”);
  • Valuable: give value to the user, no use of technical details;
  • Estimable: it’s important the team is able to estimate. Common problems which block estimation: domain knowledge problem (solution: talk to product owner!), lack of technical knowledge (solution: realize a proof of concept), the story is too big (solution: find smaller constituent stories);
  • Small: too big or small stories are not suitable for planning (the burndown chart lose value and this prevent from project day-by-day status evaluation);
  • Testable: successfully passing its test prove that a story has been successfully developed and Done!.

Classic story statement template:

As a <role> I want to <what> So that <why>

Use Case versus User Story:

Use Case: Describes how an Actor interacts with the system to achieve a Goal
Focus is on user and validation – Tells a “complete story” with main flow and alternative flow (in case of error or different user behaviors).
User Story: A bite-size bit of functionality that has business value and can be developed in a few days.
Focus is on developer and production – Part of a “complete story”.


  • Try to write independent stories to avoid story dependencies in priority evaluation;
  • Split stories when: developers say “too large”, cannot fit in one iteration, needs too much to do;
  • User story represents a team activity, so it could be of different kind: standard (standard user story), constraint (abstract non-functional story as “each password should be cipher on the database”), bug (if you consider the bugfixing in your development activity), technical (other time-consuming tasks, for example “install a new profiling tools”);
  • Cards are small to reduce verbosity, use the rear to best capture stories in the form of acceptance tests (“try with alphanumeric input”, “try miss name field”, …);
  • Check the story doesn’t contain technical jergon;
  • If you need to release an excluded story inside the current iteration, you can: change priority, split a big story and exclude its less-priority part, reduce the scope of other stories.

Non-functional requirements:

We can consider some of these as standalone stories (password expiration, …) but others have to be written in a Non Functional Requirements Guideline document (password encryption on database, general validation rules, …) shared with the team. However we must try to remember and express some of these as Acceptance Test during the discussion.

Documentation tasks:

Effort for write/update documents with a relationship with a specific user story (e.g. functional document, …) must be included in the story estimation.

Effort for cross-stories documents (installation guides, …) must be considered in a separate story (e.g. a “As Operation I want an installation guide so that I can install”) and have specific estimation.

Agile practices: Sprint review meeting (Demo!)


Leaded by the Team Members and attended by Product Owner, anyone can join (usually the customer make invitations).


  • Business level demo, if the software has no UI encourage the team to create it;
  • No use of long ppt presentations and let the audience try;
  • Use standard, shared and known tools (wiki, team room, …);
  • Begin with a clear presentation of goal;
  • Be informative (e.g. show graph to explain performance improvements, etc.);
  • Bringing food to a meeting (such as biscuits) is a good way to relax people and make the meeting friendlier, a nice way to break up a long meeting or to encourage to arrive on time;
  • If possible, use informal or funny elements (pic, example, …) to get an entertaining and amusing atmosphere.

Finalize the demo:

  • On the last day of iteration finalize demo (everyone have to play a part!):
  • Clarify which stories are complete and ready to demo;
  • Decide on a running order for presenting the stories;
  • Agree who will be presenting which stories;
  • Schedule the activities to finalize and organize a run-through to rehearse the demo (don’t undervalue preparation!);
  • Reconnaissance of the demo room: network connectivity, proxy, browser version, projector, whiteboard, …;
  • Technical check of the demo test environment: integration status, software configuration, build version,… .

Demo Agenda:

  • Introduction for the customer (overview of goal and user stories chosen) using standard team’s tools (wiki, …);
  • Demonstrate stories and ensure that positive and negative feedbacks are captured (use cards or other sheets, whiteboard can distract the audience);
  • Review the main points with the group to check none has been missed, these can be used for next iterations;
  • Celebrate!

Other references: It’s all about marketing.

Agile do it better?


Agile is not a silver bullet to low cost and do things better at all with the same team you always use for RUP or Waterfall project, or worst with a new inexperienced team, just because “Agile do it better!”.
I appreciate a metaphor that could be our starting point for the discussion:
“Waterfall is like determining where your child will go to college while he’s still in diapers: you hope for Harvard, you save money, lecture the kid on value of a good education but…”
“Agile is like reading age appropriate stories to your child, reading just ahead on childhood development, testing and adapting, rather than expecting and looking for expectation.”
Before you go on, I want to make it clear that if your project have to solve a well known problem, not likely to change with a well defined scope (maintenance is a common example), please stop reading… in this scenario selling or buying agile project for cost saving isn’t always a good idea: I’ll explain because Agile is about lower overall cost but lower overall cost isn’t doing things cheap!.

  • Agile addresses risk in concrete way in case the client really don’t know what he wants or needs until they’re in it (a classic statement is “If I can’t see I can’t decide what I want!”). Otherwise, if he knows his willing and he had the capability to express and communicate his requirements in appropriate way (I have still to meet such this client!), you can do it traditional way!
  • Agile doesn’t aim to be a low-cost alternative at all. In Agile the client pays for the risk reduction that an Agile team provides and, considering only an overall cost component of the project, the unit cost for an Agile project could be higher. But within that kind of project the client get as much ROI as their budget will allow over well-defined periods: shorten time-to-market, increase adaptability and reap greater ROI over the lifetime of a product. Earlyer opportunities to start getting ROI with the Agile ability to sell some critical subset of the features to a subset of your potential users providing a good value to some of your market even if the product lacks some features. Even if a project ends prematurely, the client has something of value.
    The benefits are about revenue, not costs. One more time: if the overall cost is our client only concern, Agile is not our silver bullet!
  • If a company doesn’t matter about quality and defect costs, it ignores a cost component of its project which sooner or later it will pay. In this scenario, one more time, Agile (which takes more effort/work to quality, testing and evolutionary design) will look more expensive on our costs spreadsheet.
    If you want code without automated tests you can do your project in half the time… do it traditional way! But what will cost later? (I hope you’ll never curse the missing of non-regression tests!)
    Or Even more, from a business perspective, I can plan and design everything now but what will be the cost of business/process unknowns and market changes later when I can’t handle it?
  • Early or late some projects are doomed to fail, no matter what method is used to develop them. Earlier you can measure your project success, earlier you can react appropriately and either turn things around or in the worst case scenario re-direct the remaining resources into something else that will hopefully provide ROI. The way Agile works tends to get feedback early and if something if something just isn’t going to work no matter what (technical problems or improvements, market changes, business model corrections).
  • Inexperienced team will be inexperienced team no matter the project methodology. Agile, with smaller iterations and frequent process review, allows an inexperienced team to learn and improve faster if well coached.

Before you will start to embrace Agile for your project, take some times to evaluate other components: customer relationship, contractual constraints, far-located teams, …
No one methodology is automatically best suited for all the project, product, customer and team. Find your own way!

Agile Practices: Estimation


Agile time management depends on two variables:

  • Team Velocity, i.e. number of story point for iteration
  • Stories Size, i.e. the dimension associated with a story

There are different way to evaluate stories size, here some suggestions.

Approach 1 – Ideal time:

During the planning meeting estimate stories with ideal days of work, i.e. a day without any interruptions whatsoever, no telephone, no meeting, …
Velocity comes from historical series of iterations velocity, but for initial iteration we can define Initial Velocity as:

(Number of team members/Load factor) * Sprint days

Load factor is a number that you can use to adjust the ideal day, i.e. the impact of distractions on a team’s performance.


  • Easier to explain outside the team
  • Easier to estimate at first
  • Make the team more comfortable

Approach 2 – Story Point:

If you don’t have previous estimated user stories, ask the team members to select one story which they think is about average effort to implement.
Give the value of 5 story points to this story and estimate all the other stories relative to the selected story.
Break down the selected story to estimate its parts in hours and define the value of initial story point in hours.
Estimated initial velocity is:

(Sprint hours/hours per Story Point)*Focus factor

with focus factor in percentage.


  • Faster
  • Do not decay
  • Pure measure of size

A good starting compromise (to define initial velocity):

For your first iteration define “1 story point = 1 ideal day” help team to get started and gradually convert team to thinking in unit-less story points (“this story is like the story …”) stopping talking about how long it will take.

JConsole problem: process not visible even in JPS!


I set the JMX argument in JBoss and Tomcat (-Dcom.sun.management.jmxremote) and I get a
“– process information unavailable” message in JPS and “The management agent is not enabled on this process” in JConsole.
I solved with a SET of TMP environment variable different from C:\WINDOWS\Temp: SET TMP=c:\temp.
Maybe it is due to directory permissions.

Code snipplet


Some code snipplet useful everyday:

Read a file line by line:

InputStream in=this.getClass().getClassLoader().getResourceAsStream("aaa.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(in));
while(br.readLine()==null) { .... }

Integer to hexadecimal string:

String hex=Integer.toHexString(number);

Pretty Print XML Document:

org.w3c.dom.Document document=...;
try {
  javax.xml.transform.TransformerFactory tfactory = 
  javax.xml.transform.Transformer xform = tfactory.newTransformer();
  xform.setOutputProperty(javax.xml.transform.OutputKeys.INDENT, "yes");
  xform.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
  java.io.StringWriter writer = new java.io.StringWriter();
  javax.xml.transform.Result result = 
                               new javax.xml.transform.stream.StreamResult(writer);
  xform.transform(new javax.xml.transform.dom.DOMSource(document), result);
} catch (Exception e) {

Marshall JAXB to a file:

JAXBContext jaxbC;
try {
      jaxbC = JAXBContext.newInstance(jaxbClass.class);
      Marshaller marshaller = jaxbC.createMarshaller();
      marshaller.setProperty( Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE );
      marshaller.marshal(jaxbObject, new File("c:/aaa.txt"));
} catch (Exception e) {
      // TODO Auto-generated catch block

Print InputStream:

InputStream input=...;
byte[] buffer = new byte[1024];
int len = input.read(buffer);
while (len != -1) {
    System.out.write(buffer, 0, len);
    len = input.read(buffer);

Let’s start Performance Test analysis with Grinder Analyzer


To generate performance analysis I found Grinder Analyzer. Some step to start:

  1. Download and unzip Grinder Analyzer
  2. Download jython_installer-2.2.1.jar (at least version 2.2.1)
  3. Install jython: java -jar jython_installer-2.2.1.jar (I suggest the “All” installation type)
  4. Create the run.bat batch:
    set CLASSPATH=…\grinderAnalyzer.V2.b10\lib\commons-collections-3.2.jar
    …\Jython-2.2.1\jython.bat run.py %1 %2 %3 %4 %5 %6 %7 %8 %9
  5. To run: run “data_agent-10.log” out_agent-10.log [# of agents]

The number of agents is an optional multiplier you can apply to the bandwidth and transactions per second graphs. Other options in …\grinderAnalyzer.V2.b10\conf\analyzer.properties. The output generated report will be in …\grinderAnalyzer.V2.b10\grinderReport directory.

Let’s start Performance Test with Grinder in 10 steps


A very quick tutorial about how to start with Grinder 3:

  1. Download and unpack Grinder 3.2
  2. Create a mygrinder directory (wherever you want)
  3. Copy …\grinder-3.2\examples\grinder.properties to mygrinder
  4. Create in mygrinder the batch console.bat: java -cp …\grinder-3.2\lib\grinder.jar net.grinder.Console
  5. Create in mygrinder the batch agent.bat: java -cp …\grinder-3.2\lib\grinder.jar net.grinder.Grinder mygrinder\grinder.properties
  6. Create in mygrinder the batch tcpProxy.bat java -cp …\grinder-3.2\lib\grinder.jar  net.grinder.TCPProxy -console -http [-httpproxy {host} {port}] > script_name.py (-httpproy parameter only if required)
  7. To record a test script set your browser proxy settings to localhost and port 8001, disable cache, execute tcpProxy.bat, while you use the browser the tcpProxy will record the requests-responses flow. If you use a proxy server you have to uncomment the script line “connectionDefaults.setProxyServer(…)”.
  8. Edit grinder.properties and add the line: grinder.jvm.arguments = -Dpython.cachedir=”mygrinder” (for example “C:\\tools\\mygrinder”)
  9. The line grinder.script contains the name of the script which will be executed during the performance test: put a reference to the recorded script_name.py.
  10. Execute console.bat, agent.bat and finally Start the Test!

The Console process is the collector of the performance data and Agents are the executor of the tests. They will send test data to the Console. By default the Console and the Agents communicate on port 6372 of the localhost machine. You can change this values of grinder.consoleHost and grinder.consolePort into grinder.properties.