Creating Glue Studio Job

Prerequisites : This lab requires Prerequisites and Lab1 to be completed before you can continue. Please continue if you have already completed the prerequisites.

PART A:Create Glue Studio job merging for Auto and Property customer table.

1. Navigate to the AWS Glue Console Step - 1

2. Under ETL , Click on AWS Glue Studio. Select Jobs jobs Step - 2

3. Chose Source as “AWS Glue Data Catalog” Step - 3 Step - 4

4. Click Create.

5. You will get the Glue workspace as below. Step - 5

6. Select Node-“Data Source- Data Catalog” - Red shown in picture and click on “Data source properties” shown in Green select Database as “c360_workshop_db” and table as “source_auto_customer” Step - 6

Click “Node Properties”. Change “NAME” to “Source-Auto-DC” Step - 6

7. Select “Transform-ApplyMapping” node and click on Transform , change below

  • Source key “policyid” -> Target key “id” and Data type -> String
  • Source key “fname” -> Target key “first_name”
  • Source key “lname” -> Target key “last_name”
  • Source key “work” -> Target key “company”
  • Source key “birthdate” check drop
  • Source key “dob” -> Target key “date_of_birth”
  • Source key “phone” -> Target key “home_phone”
  • Source key “priority” check drop
  • Source key “policysince” check drop
  • Source key “createddate” check drop

Step - 7

click on “Node Properties”, change the “Name” to “ApplyMapping-Auto”

Step - 7

8. Click on Source and Add another Source Node Step - 8

9. select “AWS Glue Data catalog” and “Data source properties”, select Database as “c360_workshop_db” and table as “source_property_customer” Step - 9

Click “Node Properties”. Change “NAME” to “Source-Property-DC” Step - 9

10. While selecting “Source-Property-DC”. Click on “Transform” and add ApplyMapping. Step - 10

11. In “Transform-ApplyMapping” , click “Transform”. Add below chnages

  • Source key “customer_id” -> Target key “id” and Data type -> String
  • Source key “job” check drop
  • Source key “social” -> Target key “ssn”
  • Source key “email” check drop
  • Source key “industry” check drop
  • Source key “city” check drop
  • Source key “state” check drop
  • Source key “zipcode” check drop
  • Source key “netnew” check drop
  • Source key “sales_rounded” check drop
  • Source key “sales_decimal” check drop
  • Source key “priority” check drop
  • Source key “industry2” check drop Step - 11

click on “Node Properties”, change the “Name” to “ApplyMapping-Property”

Step - 11

Click “Data Target - S3 Bucket”, click Remove Step - 11

12. Select “ApplyMapping-Auto” Node and add a “Transform” node of type “SQL”

Note: The “Transform” name changed from “Spark SQL” to “SQL” after the screenshot below was captured.

Step - 11

In code block add below

select concat(id,'-auto') as id,first_name,last_name,company,ssn,date_of_birth,home_phone from myDataSource

Step - 12

click on “Node Properties”, change the “Name” to “Spark SQL-Auto”

Step - 7

13. Select “ApplyMapping-Property” Node and add “transform” select “Spark SQL”. Step - 11

In code block add below

select concat(id,'-property') as id,first_name,last_name,company,ssn,date_of_birth,home_phone from myDataSource

Step - 11

click on “Node Properties”, change the “Name” to “Spark SQL-Property”

Step - 7

14. Select “Spark SQL-Property” node and add “transform” select “Spark SQL”. Step - 13

on “Spark SQL” click on “Node Properties” goto “Node parents” check “Spark SQL-Auto” and “Spark SQL-Property” Step - 13

Next Click on “Transform”,

  • In Input sources of “Spark SQL-property” -> Spark SQL aliases to “myDataSourceProperty”.
  • In Input sources of “Spark SQL-Auto” -> Spark SQL aliases to “myDataSourceAuto”. Step - 16

next in Code block paste the below query

select id,first_name,last_name,company,ssn,date_of_birth,home_phone from myDataSourceProperty
union all
select id,first_name,last_name,company,ssn,date_of_birth,home_phone from myDataSourceAuto

16. Select last “Spark SQL” node and click “Target” Node as Amazon S3. Step - 18

in “Data Target Properties” select format as “parquet”, “Compression Type” as snappy “S3 Target Location” -> mod-xxxx-s3bucketstack-xxxx-s3bucket-xxxx Chose.

Append “merged_auto_property” to “S3 Target Location” example :-

  • s3://mod-xxxx-s3bucketstack-xxxx-s3bucket-xxxx/merged_auto_property/

In “Data Catalog update options” select

  • “Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions”

For Database -> c360_workshop_db , Table -> “merged_auto_property” Step - 18 Step - 18

17. Add the job name as Glue_merge_auto_property Step - 19

18. Click on Job details . Chose the IAM role as “GlueServiceRoleLab” and click save. Step - 20

19. Click Run job on the top.

20. Click on the “Runs” tab for job execution status and wait for job “Run status” to change to Succeeded Step - 21

21. Verify going back to AWS Glue Console Step - 1 Click on Tables Step - 22

22. You can also verify “merged_auto_property” in Athena and “preview the table”

Now you can Goto Lab 3b Glue ML Transformation