KNIME tutorial: How to filter rows and remove duplicates easily

Introduction

Hi, I am Akira, your guide at Data Without Code. As a DX manager who spent years dealing with messy Excel files, I know firsthand how tedious data cleaning can be.

Think about your weekly reporting routine. How often do you open a massive spreadsheet, apply an auto-filter to remove irrelevant rows, and then click “Remove Duplicates” just to get a clean list of customers or transactions? Doing this manually every single week is not just boring—it leaves room for human error.

In our previous tutorial, we learned how to do a VLOOKUP using the Joiner node. Now that your data is combined, it is time to clean it up. In this KNIME tutorial, I will show you how to automate filtering and deduplication using two incredibly simple nodes.

1. How to Filter Rows in KNIME (Row Filter Node)

When you want to keep only specific rows of data (for example, only sales from the “US” region, or transactions over $1,000), you use the Row Filter node.

Step-by-Step Configuration

  1. Search for the Row Filter node in the Node Repository (bottom-left panel) and drag it onto your canvas.
  2. Connect the output of your previous node (like an Excel Reader or a Joiner node) to the input of the Row Filter node.
  3. Double-click the Row Filter node to open its configuration window.
  4. In the top section, select the Column to test. For example, choose “Region”.
  5. In the middle section, define your matching criteria. You can type “US” and select “Include rows by attribute value”. (Alternatively, you can choose to Exclude rows if you want to remove specific data).

Click OK, right-click the node, and select Execute (or press F8). Your dataset is now perfectly filtered, and the best part is that this rule is now saved permanently in your workflow.

2. How to Remove Duplicates (Duplicate Row Filter Node)

Getting rid of duplicate entries is a critical step in data preparation. In KNIME, we handle this using the Duplicate Row Filter node.

Step-by-Step Configuration

  1. Find the Duplicate Row Filter node and connect it to your workflow.
  2. Double-click to open the configuration menu.
  3. You will see two side-by-side lists of columns. In the “Include” box on the right, select the column(s) that should be evaluated for duplicates. For example, if you want a unique list of customers, move “Customer ID” to the Include box.
  4. Go to the Advanced tab. Here, you can tell KNIME which row to keep when it finds a duplicate. You can choose to keep the First row, the Last row, or even keep the row with the highest value in another column (like the most recent transaction date).

Execute the node, and your dataset is instantly scrubbed of all duplicate records.

Why Automating This is a Game Changer

As a non-programmer moving into business automation, this is where you start seeing the real magic of KNIME.

In Excel, you have to perform these clicks manually every time you receive a new file. But in KNIME, once you configure the Row Filter and Duplicate Row Filter nodes, your workflow remembers the rules. Next week, all you have to do is drop in your new raw data, click “Execute All,” and your clean dataset is ready in seconds.

Conclusion: Your Next Steps

Congratulations! You now know how to merge data, filter out what you don’t need, and ensure every record is unique. You are already miles ahead of standard manual data processing.

However, what happens if your data comes in looking messy? Maybe some text is uppercase, some is lowercase, or dates are formatted incorrectly. Data cleaning is an essential skill for any DX professional.

Ready to tackle messy data? Join me in our next tutorial where I explain how to change data types and clean messy strings in KNIME without writing any code!

Copied title and URL