Using Presto Query Log to decide whether to shutdown EMR Presto
*
Background
Before tyring to stop an EMR cluster that runs Presto, check the status of the query execution on the Presto.
*
Source code
https://github.com/aws-samples/emr-presto-query-event-listener
*
NB
如果不使用AWS global,Presto无法运行
Till now, below commands have been runs successfully under AWS global (us-west-2 in this test, other AWS global have not been tried, but should be OK).
*
The QueryEventListener.jar under the root directory of the aforementioned repo is safe to be used. You can either download the whole repo from GitHub and then extract from it, or you can download this specific file from GitHub.
*
Replace the string "replace-with-your-bucket" in the below bash script, with the S3 bucket name under your account for this test purpose. Here I use "awsemrprestoqlog" to replace it.
*
#!/bin/bash IS_MASTER=true if [ -f /mnt/var/lib/info/instance.json ] then if grep isMaster /mnt/var/lib/info/instance.json | grep true; then IS_MASTER=true else IS_MASTER=false fi fi sudo mkdir -p /usr/lib/presto/plugin/queryeventlistener sudo /usr/bin/aws s3 cp s3://replace-with-your-bucket/QueryEventListener.jar /tmp sudo cp /tmp/QueryEventListener.jar /usr/lib/presto/plugin/queryeventlistener/ if [ "$IS_MASTER" = true ]; then sudo mkdir -p /usr/lib/presto/etc sudo bash -c 'cat <<EOT >> /usr/lib/presto/etc/event-listener.properties event-listener.name=event-listener EOT' fi*
Save the above file with filename "eventlistenerbootstrap.sh"
*
将上文下载的QueryEventListener.jar和eventlistenerbootstrap.sh上传到被替换了bucket名的位置"s3://replace-with-your-bucket/"下
*
(Optional) Launch EC2 with EMR access as EC2 instance profile. Here I used an Amazon Linux OS.
*
运行命令:
aws emr create-cluster --name "ClusterWithPrestoLogging" --release-label emr-5.10.0 --applications Name=Hive Name=Presto --use-default-roles --instance-count 2 --instance-type m4.large --ec2-attributes KeyName=keyName,SubnetId=subnet-ID --log-uri s3://aws-logs-ACCOUNTID-REGIONCODE/ --bootstrap-actions Path=s3://awsemrprestoqlog/eventlistenerbootstrap.sh,Name=BootstrapActionPrestoLogging,Args=[] --region REGIONCODE
Note
REGIONCODE形如us-west-2
替换掉awsemrprestoqlog
替换掉上文中的subnet-ID
替换掉上文中的keyName
*
Wait until the status of the EMR cluster changed to "Waiting Cluster ready"
SSH to login to the EMR master node
*
[root@ip-172-31-26-41 ~]# hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
*
Below command can be successfully run only on AWS global EC2 (us-west-2 in this test)
CREATE EXTERNAL TABLE wikistats ( language STRING, page_title STRING, hits BIGINT, retrived_size BIGINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' LOCATION 's3://support.elasticmapreduce/training/datasets/wikistats/';OK
Time taken: 9.789 seconds
*
Below command will cause error on AWS China EC2
hive> CREATE EXTERNAL TABLE wikistats ( > language STRING, > page_title STRING, > hits BIGINT, > retrived_size BIGINT > ) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ' ' > LINES TERMINATED BY '\n' > LOCATION 's3://support.elasticmapreduce/training/datasets/wikistats/'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 5A6639E4FD62B96B), S3 Extended Request ID: C+O/d/LN/nwD4JZT8g0mZG8w47H3c+986TLCi0ni6+ZAzz/wZo3MoZwHsAQZIObGGMcdPbx9Sd4=)*
Prefixing the S3 bucket name with the region name would cause error on AWS global EC2 (for some AWS global region, e.g. us-west-2): (Below has NOT been proved to make Presto successfully run)
hive> CREATE EXTERNAL TABLE wikistats ( > language STRING, > page_title STRING, > hits BIGINT, > retrived_size BIGINT > ) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ' ' > LINES TERMINATED BY '\n' > LOCATION 's3://us-west-2.support.elasticmapreduce/training/datasets/wikistats/'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: FFCD315D56476F4F), S3 Extended Request ID: noP22XoFM/vl9N9+1WqsYSJzgnS8ZuGqMy0QEtJNLj5nok33jYlTwETTPbdGMcnfRLbOqA5Tcfg=)*
For cn-north-1 AWS China Beijing region: (Below has NOT been proved to make Presto successfully run)
CREATE EXTERNAL TABLE wikistats ( language STRING, page_title STRING, hits BIGINT, retrived_size BIGINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' LOCATION 's3://cn-north-1.support.elasticmapreduce/training/datasets/wikistats/';*
For cn-northwest-1 China Ningxia region: (Below has NOT been proved to make Presto successfully run)
hive> CREATE EXTERNAL TABLE wikistats ( > language STRING, > page_title STRING, > hits BIGINT, > retrived_size BIGINT > ) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ' ' > LINES TERMINATED BY '\n' > LOCATION 's3://cn-northwest-1.support.elasticmapreduce/training/datasets/wikistats/';
OK Time taken: 10.131 seconds*
The S3 bucket name seems to have some convention within certain AWS internal team. Till now, I did not find a place where AWS explicitly told their naming convention for the bucket name for the purpose as in the blog post.
For eu-central-1 region, adjust the bucket name to s3://eu-central-1.support.elasticmapreduce/
https://github.com/aws-samples/emr-bootstrap-actions/tree/master/spark
*
hive> exit;
*
[root@ip-172-31-19-227 presto]# presto-cli --server localhost:8889 --catalog hive --schema default
*
https://aws.amazon.com/blogs/big-data/custom-log-presto-query-events-on-amazon-emr-for-auditing-and-performance-insights/
presto:default> SELECT * from wikistats LIMIT 10; # This query is added afterwards, and therefore the bottom log content does not reflect this query.
language | page_title ----------+---------------------------------------------------------------------------------------------------------- ru | %D0%9E%D1%81%D0%B5%D1%82%D0%B8%D0%BD%D1%81%D0%BA%D0%B0%D1%8F_%D0%BA%D1%83%D1%85%D0%BD%D1%8F ru | %D0%9E%D1%81%D0%B5%D1%82%D0%B8%D0%BD%D1%81%D0%BA%D0%B0%D1%8F_%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%82%D ru | %D0%9E%D1%81%D0%B5%D1%82%D0%B8%D0%BD%D1%8B ru | %D0%9E%D1%81%D0%B5%D1%82%D1%80%D0%BE%D0%B2%D1%8B%D0%B5 ru | %D0%9E%D1%81%D0%B5%D1%86%D0%BA%D0%B0_%D0%90%D0%B3%D0%BD%D0%B5%D1%88%D0%BA%D0%B0 ru | %D0%9E%D1%81%D0%B8%D0%BD%D0%BD%D0%B8%D0%BA%D0%BE%D0%B2%D1%81%D0%BA%D0%B8%D0%B9_%D1%82%D1%80%D0%B0%D0%BC%D ru | %D0%9E%D1%81%D0%B8%D0%BF ru | %D0%9E%D1%81%D0%B8%D0%BF%D0%BE%D0%B2%D0%B8%D1%87%D0%B8 ru | %D0%9E%D1%81%D0%B8%D0%BF%D0%BE%D0%B2,_%D0%9F%D1%91%D1%82%D1%80_%D0%9E%D1%81%D0%B8%D0%BF%D0%BE%D0%B2%D0%B8 ru | %D0%9E%D1%81%D0%B8%D0%BF_(%D0%AE%D0%BB%D0%B8%D0%B0%D0%BD)_%D0%98%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2%D0%B8%D1%8 (10 rows) Query 20181003_130713_00009_mzvgs, FINISHED, 1 node Splits: 117 total, 21 done (17.95%) 0:05 [7.08K rows, 303KB] [1.3K rows/s, 55.6KB/s]*
https://aws.amazon.com/blogs/big-data/analyze-data-with-presto-and-airpal-on-amazon-emr/
*
SELECT language, page_title, SUM(hits) AS hits FROM default.wikistats WHERE language = 'en' AND page_title LIKE '%Amazon%' GROUP BY language, page_title ORDER BY hits DESC LIMIT 10;*
language | page_title | hits ----------+--------------------------+------ en | Amazon.com | 4136 en | Amazon_Kindle | 2968 en | Amazon_River | 2666 en | Amazon_Rainforest | 2010 en | Amazons | 1536 en | Amazon | 840 en | Amazon_rainforest | 817 en | Amazon_S3 | 576 en | Amazon_Women_in_the_Mood | 503 en | Amazon_Basin | 491 (10 rows) Query 20180928_074659_00008_qhrkg, FINISHED, 1 node Splits: 247 total, 247 done (100.00%) 1:13 [128M rows, 5.64GB] [1.77M rows/s, 79.7MB/s]*
presto:default> exit;
*
[root@ip-172-31-26-41 ~]# cd /var/log/presto
*
[root@ip-172-31-26-41 presto]# ll
total 2448
-rw-r--r-- 1 presto presto 148878 Sep 28 15:04 http-request.log
-rw-r--r-- 1 presto presto 2123248 Sep 28 15:03 launcher.log
-rw-r--r-- 1 presto presto 2843 Sep 28 15:02 queries-2018-09-28T14:54:13.0.log
-rw-r--r-- 1 presto presto 0 Sep 28 14:54 queries-2018-09-28T14:54:13.0.log.lck
-rw-r--r-- 1 presto presto 114389 Sep 28 15:02 server.log
*
对应的日志
[root@ip-172-31-26-41 presto]# cat queries-2018-09-28T14:54:13.0.log
*
Sep 28, 2018 2:57:14 PM com.amazonaws.QueryEventListener.QueryEventListener queryCreated INFO: ---------------Query Created---------------------------- Query ID: 20180928_145714_00000_h54yn Query State: QUEUED User: root Create Time: 2018-09-28T14:57:14.815Z Principal: Optional.empty Remote Client Address: Optional[127.0.0.1] Source: Optional[presto-cli] User Agent: Optional[StatementClient/0.187] Catalog: Optional[hive] Schema: Optional[default] Server Address: 172.31.26.41 Sep 28, 2018 2:57:20 PM com.amazonaws.QueryEventListener.QueryEventListener queryCreated INFO: ---------------Query Created---------------------------- Query ID: 20180928_145720_00001_h54yn Query State: QUEUED User: root Create Time: 2018-09-28T14:57:20.670Z Principal: Optional.empty Remote Client Address: Optional[127.0.0.1] Source: Optional[presto-cli] User Agent: Optional[StatementClient/0.187] Catalog: Optional[hive] Schema: Optional[default] Server Address: 172.31.26.41 Sep 28, 2018 2:57:20 PM com.amazonaws.QueryEventListener.QueryEventListener queryCompleted INFO: ---------------Query Completed---------------------------- Query ID: 20180928_145714_00000_h54yn Create Time: 2018-09-28T14:57:14.815Z User: root Complete: true Remote Client Address: Optional[127.0.0.1] Sep 28, 2018 2:57:22 PM com.amazonaws.QueryEventListener.QueryEventListener splitCompleted INFO: ---------------Split Completed---------------------------- Query ID: 20180928_145720_00001_h54yn Stage ID: 20180928_145720_00001_h54yn.1 Task ID: 0 Sep 28, 2018 2:57:22 PM com.amazonaws.QueryEventListener.QueryEventListener queryCompleted INFO: ---------------Query Completed---------------------------- Query ID: 20180928_145720_00001_h54yn Create Time: 2018-09-28T14:57:20.670Z User: root Complete: true Remote Client Address: Optional[127.0.0.1] Sep 28, 2018 2:58:37 PM com.amazonaws.QueryEventListener.QueryEventListener queryCreated INFO: ---------------Query Created---------------------------- Query ID: 20180928_145837_00002_h54yn Query State: QUEUED User: root Create Time: 2018-09-28T14:58:37.153Z Principal: Optional.empty Remote Client Address: Optional[127.0.0.1] Source: Optional[presto-cli] User Agent: Optional[StatementClient/0.187] Catalog: Optional[hive] Schema: Optional[default] Server Address: 172.31.26.41 Sep 28, 2018 3:02:45 PM com.amazonaws.QueryEventListener.QueryEventListener queryCompleted INFO: ---------------Query Completed---------------------------- Query ID: 20180928_145837_00002_h54yn Create Time: 2018-09-28T14:58:37.153Z User: root Complete: true Remote Client Address: Optional[127.0.0.1]*
The shutdown decision logic based on Presto Query logs could follow below process:
*