Manage a clean projectpreparation clean
- Generate results
- Clean data
- Process cleaned data
- Revert changes
- Regenerate results
- Edit a clean project
- Archive a clean project
After creating a clean project, you can begin the cleaning process to get a more ready-to-use set of data. In this article, you will learn how to load data for cleaning, perform data cleaning operations, and publish the cleaned data back to CluedIn.
The following diagram illustrates the main actions available within the clean project. Each of these actions is described in detail in the following sections of this article. In addition to these actions, you can always edit or archive the clean project if necessary.
Generating results retrieves the values that match the criteria from the clean project filter and loads them into the clean application.
To generate the results
In the upper-right corner of the clean project, select Generate Results, and then confirm your choice.
You can track the result generation status in the progress bar. When the results are generated, the status of the clean project becomes Ready for clean, which means that you can start to clean the data.
Clean project and real-time data
When you generate the results for the clean project, CluedIn captures a snapshot of data for you to clean. It’s important to note that you’re working with this snapshot, not real-time data.
When new data with the same issues appears in CluedIn, you’ll need to clean it as well. You can regenerate the results in the same clean project or rely on rules generated by CluedIn based on your actions in the clean project. You’ll find more details about generating rules from the clean projects in Process cleaned data.
After generating the results, you can open the clean application to start fixing data issues.
To open the clean application
In the upper-right corner of the clean project, select Clean.
The clean application containing records that need to be cleaned opens in a new tab of your browser.
To work with the clean application
Transform the contents of the cells in a column:
- In the column heading, expand the menu, and then select Edit cells > Common transforms. Then, choose the transformation option (for example, collapse consecutive whitespace or transform the text to title case).
Edit the contents of a column using a text facet:
In the column heading, expand the menu, and then select Facet > Text facet.
A text facet collects the total contents of cells in a column and matches them up. To edit an entry in the facet display, hover over the facet and select Edit. You can then enter a new value that will be applied to all identical cells in the column, allowing for efficient mass editing. This is a great option for fixing typos, whitespace, and other issues.
Cluster and edit the contents of a column:
In the column heading, expand the menu, and then select Edit cells > Cluster and edit.
This feature helps you find groups of different cell values that might be alternative representations of the same thing. You can explore different methods and key functions to see how many clusters can be found in your data. The following screenshot shows that all values in a cluster will be changed to the new value.
Edit one cell at a time:
Hover over that cell and select Edit. A pop-up window appears where you can edit the contents of the cell. You can apply your changes to all identical cells in the same column.
Use this option sparingly. The most efficient way to clean your data is through automated and bulk operations.
When you have cleaned the data, send it back to CluedIn. Processing cleaned data replaces incorrect values in golden records with the corrected ones.
To process cleaned data
In the upper-right corner of the clean project in CluedIn, select Process.
If you want to automatically fix the same data issues that might appear in future, leave the Enable rules auto generation checkbox selected.
Confirm that you want to process the data.
You can track the processing status in the progress bar. When the cleaned data is processed, the status of the clean project becomes Processed. It means that the corrected values have been submitted to the golden records in CluedIn.
Automatic rules generation
Once you fix a data issue, it’s likely you’ll need to fix the same issue again in the future. To save time and ensure data accuracy, you can generate rules based on your actions in the clean project. There are two ways of generating rules in the clean project:
By selecting the Enable rules auto generation checkbox in the processing confirmation dialog.
Be selecting Generate rules on the Rules tab of the clean project.
In both cases, rules will be generated only if you made significant changes in the clean application (for example, changed all values in the column to upper case). If you edited individual cells, rules won’t be generated. You can find generated rules in the appropriate tab in the clean project.
The rules from the clean projects are also stored in Management > Rule Builder > Data Part Rules. Note that the rule is initially inactive. To apply the rule to the records that will be loaded to CluedIn in future, activate the rule. This way, the values that correspond to the configuration parameters from the rule will be automatically fixed when the records are processed.
Clean project and streams
If the data from the clean project is streamed to the export target, the stream will automatically receive the cleaned data after processing. This ensures that the data in the export target is automatically updated with the cleaned values.
You can undo changes made to values in your golden records after processing the cleaned data. Reverting these changes restores the values to their pre-cleaning state.
To revert changes
In the upper-right corner of the clean project, select Revert Changes, and then confirm your choice.
As a result, the changed values are removed, returning the values to their pre-cleaning state. The status of the clean project becomes Ready to process, which means that your previous changes are saved in the clean application.
Note: Reverting is not tracked anywhere.
Consider the following actions that you can do after reverting changes in the clean project:
If you want to continue working with the same set of data, go to the clean application and make the needed changes.
If you want to get the latest data, regenerate the results, and start the data cleaning process. Keep in mind that regenerating the results in this case removes your progress in the clean application.
Generating results for the clean project creates a snapshot of the data. If new data is loaded into CluedIn, it won’t be automatically added to the clean project. To add the latest data to the clean project, you should regenerate the results.
Note: Regenerating the results will cause you to lose all previous progress in the clean project. So, if you have already cleaned the data and want to publish your changes, process the cleaned data first, and then regenerate the results.
To regenerate the results
In the upper-right corner of the clean project, select Regenerate, and then confirm your choice.
As a result, the data that matches the filter from the clean project is loaded to the clean application. Next, you can start the data cleaning process.
You can edit a clean project to make necessary changes in project name, description, filters, and properties that you need to clean.
Note: If you change filters or properties, all previous cleaning progress will be lost and the project will be regenerated.
To edit a clean project
In the upper-right corner of the clean project, select Edit.
Make the needed changes.
Select Save. If you changed filters or properties, confirm that you want to save changes.
If you changed the name or description, the status of the clean project remains the same. Otherwise, it becomes Ready for clean.
You can archive a clean project if you no longer need it or if you created it by mistake. You can also consider archiving a clean project if you’re confident that you won’t need to run it again in the future or if you’ve already generated rules to address the same data quality issues.
Archiving does not affect the cleaned data that has been submitted to CluedIn. After the clean project is archived, it cannot be unarchived.
Archived projects remain on the Clean projects page and they are available for viewing.
To archive a clean project
In the upper-right corner of the clean project, select Edit > Archive. Then, confirm that you want to archive the project.
The status of the clean project becomes Archived, and you can no longer work with the project.