This is a tutorial in using RapidMiner 5.3 ETL features to merge several Excel worksheets from the same file into one single output which can then be saved into a separate Excel file, a CSV file or database table.
RapidMiner (http://rapid-i.com) is the world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. It has an awesome array of ETL tools that lets you work with any data quickly and efficiently. The Community Edition is free of charge which makes it perfect for use in this task.
Introduction
RapidMiner uses a vast collection of customizable data operators to design a process flow for your data. I’ve used RapidMiner before to merge several CSV files into one using its Loop Files parameter, but in this case the data was all in one excel file separated into several tabs, so I needed to go a little deeper. This tutorial assumes some knowledge of RapidMiner although it’s really easy to learn.
There is another operator called Loop Parameters, which you can find by typing it in the search box. It allows you to loop through the parameters of any operator.
The little blue icon on the bottom right indicates that this is a container, inside of which we need to place the actual action operator that does the work. So double click on the Loop Parameters operator to go inside it.
Now we need to add our action operator, in our case it’s Read Excel. Search for it and drop it into the palette. If you’ve set RapidMiner to auto-wire inputs, it will automatically link the output from the operator to the output wall.
There are two things we need to customize:
First let’s customize the Read Excel operator’s properties. We will need to specify the file that we’re reading from. I’ve created a dummy file (which you can get here) for the purposes of this tutorial; the data in it doesn’t mean anything. The file contains 3 sheets with sales data.
Sheet1
|
Sheet2
|
Sheet3
|
Our goal is to merge the data from all three sheets into one single sheet.
Note: In order for this to work properly, ALL the sheets need to have the same exact headings. If they don’t have the same exact headings, you have to take care of that first, otherwise the merge will not work.
Click on the Read Excel operator and then on the Import Configuration Wizard button on the properties panel on the right hand side.
The wizard will guide through configuring the input file. In Step 1 you choose the directory where you saved the file, click on the file and then click Next.
In Step 2, RapidMiner will load the Excel file so you can see all the Sheets and go through them to ensure they all have the same headings. Ensure you can see Sheet1, Sheet2 and Sheet3. Click Next.
In Step 3, you can annotate a column with special properties, in our case the column headings. RapidMiner does this for you, so no action needed. Click Next to proceed to the final step.
In Step 4, you can choose the data types for your columns. RapidMiner will try to guess the values for you and it usually does a good job, but if you want to change them, just click on the down arrows for each column and pick the data type you like. When done click Finish.
Next, we need to specify the settings for the Loop Parameters operator. To do this, click out of the Read Excel operator (by clicking anywhere else on the palette) and the properties box should change as shown: At this point we are till inside the Loop Parameters operator
Click on the Edit Parameter Settings to continue. We get the following dialog box.
There are several things going on here. Since this is a generic loop container, we need to specify the operator(s), the parameters to loop through and the range of values. In our case, we only have one operator (Read Excel) Click on it and you’re presented with the parameters that we can choose to loop. We want the sheet_number parameter. Click on it and then click on the right green arrow (==>) to add it to the Selected Parameters box. Your screen should look like this:
Next we need to specify the range of values we want the parameter to take. We only have 3 sheets so choose Min = 1; Max = 3; Steps = 3; Scale = Linear. When done, click Ok. There are more options here for advanced looping, but this is outside the scope of this tutorial.
We are now ready to test our work! The nice thing about RapidMiner is that you can build your process in steps and check your work at any step by running it and seeing your results. To do this we need to first exit the loop container by clicking on the blue Up arrow to get to the main screen.
Once in the main screen, connect the Loop Parameter‘s res (ResultSet) port to the res port on the output wall. Then click on the blue triangle on top to run the process.
If everything was set up correctly, RapidMiner will execute the process and present you with the results as shown here:
We can see in the Results Overview an IOObjectCollection tab with all three sheets as separate ExampleSets (RapidMiner calls its outputs ExampleSet. What we now have is a collection of the data from the three sheets but it’s still separate, so we need one more operator to do the actual merge.
Switch back to the Design perspective and add the Append operator. If you’ve set RapidMiner to auto-wire the operators and if you hold the Append operator right over the purple output line, it will drop it in place and be automatically wire it. If not, you have to manually wire it.
Run the process again and now we get the results we want! The ExampleSet from the (Append) operator shows the data from the three sheets merged into one. Nice!
At this point we can add an export operator to save it somewhere. RapidMiner supports a variety of export options: there’s separate Write operators for Excel, CSV, database tables, etc.
We are done. If you have any questions, please put it in the comments.