You can easily observe data patterns and potential duplicates by querying data in Data Lake using Amazon Athena.
1. Perform following steps to verify your Athena configuration and downloading CSV.
1a) Login to your Event Engine account and Navigate to Athena console
1b) In Athena Console, select database: c360_workshop_db (or whatever you named it)
1c) Under Tables, hover over to the right of source_auto_customer, click on … control and select Preview table
1d. Query will run automatically and display results (10 rows by default)
1e) Remove the limit clause in the select statement, re-run the query, and click on the Download the results icon in the Results section to download the result set as a CSV
NOTE: After landing on Athena console, if you get an error or query doesn’t run, click on the Set up a query result location in Amazon S3 link and enter the value as s3://«S3BucketName»/athena-results/
Note the trailing slash in the above path!
2. Examine potential duplicates in merged_auto_property dataset that we prepared in the previous lab.
2a) Run the query and download the dataset as CSV from Athena Console.
select * from merged_auto_property
2b) Open the downloaded file in excel, sort by ssn and observe the duplicates
As highlighted, these are couple of examples of exact match and fuzzy match. There are other fuzzy matches also. Can you spot them?!
Note The fuzzy matches are easy to spot by a human expert - data steward with domain knowledge of how this data was generated, cleansed and processed in the various source Systems. While human expert can identify those duplicates on a small dataset, it becomes an intractable task when dealing with thousands of records. Glue ML transform builds on this intuition and provides an easy to use ML-based algorithm to automatically apply this approach to very large volume of records efficiently!
Next go to Create ML Transform