How to merge several Excel worksheets using RapidMiner

This is a tutorial in using RapidMiner 5.3 ETL features to merge several Excel worksheets from the same file into one single output which can then be saved into a separate Excel file, a CSV file or database table.

RapidMiner (http://rapid-i.com) is the world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. It has an awesome array of ETL tools that lets you work with any data quickly and efficiently. The Community Edition is free of charge which makes it perfect for use in this task.

Introduction

RapidMiner uses a vast collection of customizable data operators to design a process flow for your data. I’ve used RapidMiner before to merge several CSV files into one using its Loop Files parameter, but in this case the data was all in one excel file separated into several tabs, so I needed to go a little deeper. This tutorial assumes some knowledge of RapidMiner although it’s really easy to learn.

There is another operator called Loop Parameters, which you can find by typing it in the search box. It allows you to loop through the parameters of any operator.

Untitled picture

The little blue icon on the bottom right indicates that this is a container, inside of which we need to  place the actual action operator that does the work. So double click on the Loop Parameters operator to go inside it.

Now we need to add our action operator, in our case it’s Read Excel. Search for it and drop it into the palette. If you’ve set RapidMiner to auto-wire inputs, it will automatically link the output from the operator to the output wall.

2013-09-29_2120

There are two things we need to customize:

First let’s customize the Read Excel operator’s properties. We will need to specify the file that we’re reading from. I’ve created a dummy file (which you can get here) for the purposes of this tutorial; the data in it doesn’t mean anything. The file contains 3 sheets with sales data.

Sheet1

Division Sales
1 $3,400
2 $3,500
3 $3,600
4 $3,700
Sheet2

Division Sales
4 $4,100
5 $4,200
6 $4,300
7 $4,400
Sheet3

Division Sales
9 $4,600
10 $4,700
11 $4,800
12 $4,900

Our goal is to merge the data from all three sheets into one single sheet.

Note: In order for this to work properly, ALL the sheets need to have the same exact headings. If they don’t have the same exact headings, you have to take care of that first, otherwise the merge will not work.

Click on the Read Excel operator and then on the Import Configuration Wizard button on the properties panel on the right hand side.

2013-09-29_2138

The wizard will guide through configuring the input file. In Step 1 you choose the directory where you saved the file, click on the file and then click Next.

2013-09-29_2140

In Step 2, RapidMiner will load the Excel file so you can see all the Sheets and go through them to ensure they all have the same headings. Ensure you can see  Sheet1, Sheet2 and Sheet3. Click Next.

2013-09-29_2142

In Step 3, you can annotate a column with special properties, in our case the column headings. RapidMiner does this for you, so no action needed. Click Next to proceed to the final step.

2013-09-29_2144

In Step 4, you can choose the data types for your columns. RapidMiner will try to guess the values for you and it usually does a good job, but if you want to change them, just click on the down arrows for each column and pick the data type you like. When done click Finish.

2013-09-29_2146

Next, we need to specify the settings for the Loop Parameters operator. To do this, click out of the Read Excel operator (by clicking anywhere else on the palette) and the properties box should change as shown: At this point we are till inside the Loop Parameters operator

2013-09-29_2150

Click on the Edit Parameter Settings to continue. We get the following dialog box.

2013-09-29_2153

There are several things going on here. Since this is a generic loop container, we need to specify the operator(s), the parameters to loop through and the range of values. In our case, we only have one operator (Read Excel) Click on it and you’re presented with the parameters that we can choose to loop. We want the sheet_number parameter. Click on it and then click on the right green arrow (==>) to add it to the Selected Parameters box. Your screen should look like this:

2013-09-29_2157

Next we need to specify the range of values we want the parameter to take. We only have 3 sheets so choose Min = 1; Max = 3; Steps = 3; Scale = Linear. When done, click Ok. There are more options here for advanced looping, but this is outside the scope of this tutorial.

We are now ready to test our work! The nice thing about RapidMiner is that you can build your process in steps and check your work at any step by running it and seeing your results. To do this we need to first exit the loop container by clicking on the blue Up arrow to get to the main screen.

Untitled picture 1

Once in the main screen, connect the Loop Parameter‘s res (ResultSet) port to the res port on the output wall. Then click on the blue triangle on top to run the process.

2013-09-29_2210_002

If everything was set up correctly, RapidMiner will execute the process and present you with the results as shown here:

2013-09-29_2213

We can see in the Results Overview an IOObjectCollection tab with all three sheets as separate ExampleSets (RapidMiner calls its outputs ExampleSet. What we now have is a collection of the data from the three sheets but it’s still separate, so we need one more operator to do the actual merge.

Switch back to the Design perspective and add the Append operator. If you’ve set RapidMiner to auto-wire the operators and if you hold the Append operator right over the purple output line, it will drop it in place and be automatically wire it. If not, you have to manually wire it.

2013-09-29_2217

Run the process again and now we get the results we want! The ExampleSet from the (Append) operator shows the data from the three sheets merged into one. Nice!

2013-09-29_2221

At this point we can add an export operator to save it somewhere. RapidMiner supports a variety of export options: there’s separate Write operators for Excel, CSV, database tables, etc.

2013-09-29_2226

We are done. If you have any questions, please put it in the comments.

8 thoughts on “How to merge several Excel worksheets using RapidMiner

  1. Pingback: How to merge several Excel worksheets using Rap...

  2. Hi,
    does it work the same way with CSV files? I have 2 example files (customer and sales) which i would like to join. I would really appreciate any kind of help. Thank you!

    • Yes it should work exactly the same way. In my example I was merging multiple files from the same folder while showcasing RapidMiner’s ability to loop through different parameters. Keep in mind that in order for this to work, the files need to have the same exact headers. In your case of joining customers with sales, you might want to use the join operator instead. Do it for just two files, then add the appropriate loop operator.

  3. How would one re-code this to work out how many worksheets exist in the Excel file and loop automatically through all of them? (i.e. a situation where your input file could have any number of tabs but you do not want to open it to calculate, nor have to change the loop parameters every time)

    • Well I think RapidMiner automatically loops through all the sheets inside the file without having to know how many there are. The Loop Parameters node takes care of it.

  4. I have try to use the same concept to write excel file into different spreadsheet using macro but I cant seem to achieve the result I wanted. Instead, it writes on the first sheet and replace the new set of examples. Is there a way to write excel on different sheet?

  5. Is there any way to solve if the sequence of the headers is different in Excel sheets using RapidMiner? I used loop and append to combined all the Excel sheets.
    Example
    Sheet 1 Header : Name, Date, Contact
    Sheet 2 Header : Name, Contact, Date

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s