Building Customer 360 using Graph Databases > Lab3 - Harmonize Customer Data with AWS Glue > 3b - Fuzzy matching with Glue ML Transform > Create, Teach, and Tune ML Transform

Create, Teach, and Tune ML Transform

1. Create FindMatches ML Transform

1a) Navigate to AWS Glue Console -> On the left side, under ETL -> Jobs -> ML Transforms

1b) Click on Add Transform

AddMl

1c) Specify c360-ml-transform as Transform name

1d) IAM Role containing tag “GlueServiceRoleLab”

1e) Expand Task Run Properties section

1f) Select Worker Type as G.2X (Recommended)

1g) Enter Number of Workers as 20

1h) Glue Version as Spark 2.4 (Glue Version 2.0)

1i) Keep other values as default and click on Next

ConfigMl

1j) Select merged_auto_property as a Data Source and click Next

Datasource

1k) Select id as a primary key in the next page

PrimaryKey

1l) In the Tune Transform step, you can tune performance and cost metrics available for ML Transform. We will stay with default tradeoffs for a balanced approach.

We have specified these values to achieve balanced results. If needed, you can later tweak these values by selecting the transform and using the Tune menu.

TuneTransform

1m) Review the values and click Finish

Review

The ML transform is created with Status Needs training

TransformNeedsTraining

2. Teach transform to identify the duplicates

In this step we will teach the transform by providing labelled examples of matching and non-matching records. You can create your labeling set yourself or allow AWS Glue to generate the labeling set based on heuristics. AWS Glue extracts records from your source data and suggests potential matching records. The file will contain approximately 100 data samples for you to work with.

Note We recommend using the “Generate the Labeling file” feature to create the initial training set to teach your Transform for your fuzzy matching use-cases. Please refer to the APPENDIX at the end of this Lab on how to generate labels using ML Transform and prepare them for teaching it. In this lab, we will use pre-created labeled files in the interest of time.

2a) Select c360-ml-transform Transform

2b) Select Action -> Teach transform

TeachTransform

2c) Select I have labels and click Upload labeling file from S3

UploadLabels

Note: Two labeled files have been pre-created for this lab - We will upload these files to teach the ML transform as shown below:

LabeledFiles

2d) Navigate to folder label in your S3 bucket, select labeled file (Label-1-iteration.csv), and click Upload

UploadLabels2

The labeled file will start uploading automatically and upload status is reported:

UploadLabels

2e) Next you are taken to Estimate quality metrics (optional) page - This step provides an opportunity to estimate quality metrics at each iteration of uploading labeled files - Click Finish to skip this step

EstimateQualityMetrics

2f) Upload the second labeled file by repeating 2c) and subsequent steps for label-2-iteration.csv

UploadLabels-iter2

2g) Verify that the ML Transform Status is Ready for use - Note that Label Count is 200 because we successfully uploaded two labelled files to teach the transform - Now it can be used in a Glue ETL job for fuzzy matching of full dataset

MLTransformReadyForUse

3. Tune the Transform (Optional)

Expand me...

This completes preparation of ML Transform. Go to Create and Run ETL Job with ML Transform

APPENDIX - Steps to Generate a label file and prepare it for ML Transform

Expand me...

A. Select I do not have labels and click on Generate labeling file

B. Select the S3 location until label folder and append “/download” to it, this is where you want to keep the generated labeling file and click Generate

C. It would take few mins for AWS Glue to generate the labeling file. Once enabled click on Download labeling file.

D. In case you want to take a look at the similar labelling file that gets generated, navigate to Amazon S3 Console -> «S3Bucket»/label/ and download the “Label-1-Iteration.csv” file.

E. The labelled data file that is generated has the label column empty as shown below –

Notice that there are 2 additional columns added, labelling_set_id and label. You will need to populate the label column explicitly by marking the records that are a real match with the same value. Each labelling set should contain positive and negative match examples. Once label column is populated, the file is ready for teaching and training ML transform. Let’s go a little deeper into its structure, so that you know how to prepare and label data for your matching projects. The label column is empty in the generated file and you need to fill it as shown in the example below:

The entire training dataset is divided into labeling sets. Each labeling set displays a labeling_set_id value. This identification simplifies the labeling process, enabling you to focus on the match relationship of records within the same labeling set, rather than having to scan the entire file. You would assign labels according to which records should match based on the attribute values.

If you specify the same label value for two or more records within a labeling set, you teach FindMatches to consider these records a match. On the other hand, when two or more records have different labels within the same labeling set, FindMatches learns that these records aren’t considered a match. FindMatches evaluates record relationships only between records within the same labeling set, not across labeling sets. Plan to label a few hundred records to achieve modest match quality. Plan to label a few thousand records to achieve high match quality. The ML transforms learns and gets better over time as additional labels are uploaded to capture new matching and non-matching cases or updates.