Using Presto Query Log to decide whether to shutdown EMR Presto
*
Background
Before tyring to stop an EMR cluster that runs Presto, check the status of the query execution on the Presto.
*
Source code
https://github.com/aws-samples/emr-presto-query-event-listener
*
NB
如果不使用AWS global,Presto无法运行
Till now, below commands have been runs successfully under AWS global (us-west-2 in this test, other AWS global have not been tried, but should be OK).
*
The QueryEventListener.jar under the root directory of the aforementioned repo is safe to be used. You can either download the whole repo from GitHub and then extract from it, or you can download this specific file from GitHub.
*
Replace the string "replace-with-your-bucket" in the below bash script, with the S3 bucket name under your account for this test purpose. Here I use "awsemrprestoqlog" to replace it.
*
#!/bin/bash
IS_MASTER=true
if [ -f /mnt/var/lib/info/instance.json ]
then
if grep isMaster /mnt/var/lib/info/instance.json | grep true;
then
IS_MASTER=true
else
IS_MASTER=false
fi
fi
sudo mkdir -p /usr/lib/presto/plugin/queryeventlistener
sudo /usr/bin/aws s3 cp s3://replace-with-your-bucket/QueryEventListener.jar /tmp
sudo cp /tmp/QueryEventListener.jar /usr/lib/presto/plugin/queryeventlistener/
if [ "$IS_MASTER" = true ]; then
sudo mkdir -p /usr/lib/presto/etc
sudo bash -c 'cat <<EOT >> /usr/lib/presto/etc/event-listener.properties
event-listener.name=event-listener
EOT'
fi
*Save the above file with filename "eventlistenerbootstrap.sh"
*
将上文下载的QueryEventListener.jar和eventlistenerbootstrap.sh上传到被替换了bucket名的位置"s3://replace-with-your-bucket/"下
*
(Optional) Launch EC2 with EMR access as EC2 instance profile. Here I used an Amazon Linux OS.
*
运行命令:
aws emr create-cluster --name "ClusterWithPrestoLogging" --release-label emr-5.10.0 --applications Name=Hive Name=Presto --use-default-roles --instance-count 2 --instance-type m4.large --ec2-attributes KeyName=keyName,SubnetId=subnet-ID --log-uri s3://aws-logs-ACCOUNTID-REGIONCODE/ --bootstrap-actions Path=s3://awsemrprestoqlog/eventlistenerbootstrap.sh,Name=BootstrapActionPrestoLogging,Args=[] --region REGIONCODE
Note
REGIONCODE形如us-west-2
替换掉awsemrprestoqlog
替换掉上文中的subnet-ID
替换掉上文中的keyName
*
Wait until the status of the EMR cluster changed to "Waiting Cluster ready"
SSH to login to the EMR master node
*
[root@ip-172-31-26-41 ~]# hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
*
Below command can be successfully run only on AWS global EC2 (us-west-2 in this test)
CREATE EXTERNAL TABLE wikistats ( language STRING, page_title STRING, hits BIGINT, retrived_size BIGINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' LOCATION 's3://support.elasticmapreduce/training/datasets/wikistats/';OK
Time taken: 9.789 seconds
*
Below command will cause error on AWS China EC2
hive> CREATE EXTERNAL TABLE wikistats (
> language STRING,
> page_title STRING,
> hits BIGINT,
> retrived_size BIGINT
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ' '
> LINES TERMINATED BY '\n'
> LOCATION 's3://support.elasticmapreduce/training/datasets/wikistats/';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 5A6639E4FD62B96B), S3 Extended Request ID: C+O/d/LN/nwD4JZT8g0mZG8w47H3c+986TLCi0ni6+ZAzz/wZo3MoZwHsAQZIObGGMcdPbx9Sd4=)
*Prefixing the S3 bucket name with the region name would cause error on AWS global EC2 (for some AWS global region, e.g. us-west-2): (Below has NOT been proved to make Presto successfully run)
hive> CREATE EXTERNAL TABLE wikistats (
> language STRING,
> page_title STRING,
> hits BIGINT,
> retrived_size BIGINT
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ' '
> LINES TERMINATED BY '\n'
> LOCATION 's3://us-west-2.support.elasticmapreduce/training/datasets/wikistats/';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: FFCD315D56476F4F), S3 Extended Request ID: noP22XoFM/vl9N9+1WqsYSJzgnS8ZuGqMy0QEtJNLj5nok33jYlTwETTPbdGMcnfRLbOqA5Tcfg=)
*For cn-north-1 AWS China Beijing region: (Below has NOT been proved to make Presto successfully run)
CREATE EXTERNAL TABLE wikistats ( language STRING, page_title STRING, hits BIGINT, retrived_size BIGINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' LOCATION 's3://cn-north-1.support.elasticmapreduce/training/datasets/wikistats/';*
For cn-northwest-1 China Ningxia region: (Below has NOT been proved to make Presto successfully run)
hive> CREATE EXTERNAL TABLE wikistats (
> language STRING,
> page_title STRING,
> hits BIGINT,
> retrived_size BIGINT
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ' '
> LINES TERMINATED BY '\n'
> LOCATION 's3://cn-northwest-1.support.elasticmapreduce/training/datasets/wikistats/';
OK Time taken: 10.131 seconds*
The S3 bucket name seems to have some convention within certain AWS internal team. Till now, I did not find a place where AWS explicitly told their naming convention for the bucket name for the purpose as in the blog post.
For eu-central-1 region, adjust the bucket name to s3://eu-central-1.support.elasticmapreduce/
https://github.com/aws-samples/emr-bootstrap-actions/tree/master/spark
*
hive> exit;
*
[root@ip-172-31-19-227 presto]# presto-cli --server localhost:8889 --catalog hive --schema default
*
https://aws.amazon.com/blogs/big-data/custom-log-presto-query-events-on-amazon-emr-for-auditing-and-performance-insights/
presto:default> SELECT * from wikistats LIMIT 10; # This query is added afterwards, and therefore the bottom log content does not reflect this query.
language | page_title ----------+---------------------------------------------------------------------------------------------------------- ru | %D0%9E%D1%81%D0%B5%D1%82%D0%B8%D0%BD%D1%81%D0%BA%D0%B0%D1%8F_%D0%BA%D1%83%D1%85%D0%BD%D1%8F ru | %D0%9E%D1%81%D0%B5%D1%82%D0%B8%D0%BD%D1%81%D0%BA%D0%B0%D1%8F_%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%82%D ru | %D0%9E%D1%81%D0%B5%D1%82%D0%B8%D0%BD%D1%8B ru | %D0%9E%D1%81%D0%B5%D1%82%D1%80%D0%BE%D0%B2%D1%8B%D0%B5 ru | %D0%9E%D1%81%D0%B5%D1%86%D0%BA%D0%B0_%D0%90%D0%B3%D0%BD%D0%B5%D1%88%D0%BA%D0%B0 ru | %D0%9E%D1%81%D0%B8%D0%BD%D0%BD%D0%B8%D0%BA%D0%BE%D0%B2%D1%81%D0%BA%D0%B8%D0%B9_%D1%82%D1%80%D0%B0%D0%BC%D ru | %D0%9E%D1%81%D0%B8%D0%BF ru | %D0%9E%D1%81%D0%B8%D0%BF%D0%BE%D0%B2%D0%B8%D1%87%D0%B8 ru | %D0%9E%D1%81%D0%B8%D0%BF%D0%BE%D0%B2,_%D0%9F%D1%91%D1%82%D1%80_%D0%9E%D1%81%D0%B8%D0%BF%D0%BE%D0%B2%D0%B8 ru | %D0%9E%D1%81%D0%B8%D0%BF_(%D0%AE%D0%BB%D0%B8%D0%B0%D0%BD)_%D0%98%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2%D0%B8%D1%8 (10 rows) Query 20181003_130713_00009_mzvgs, FINISHED, 1 node Splits: 117 total, 21 done (17.95%) 0:05 [7.08K rows, 303KB] [1.3K rows/s, 55.6KB/s]*
https://aws.amazon.com/blogs/big-data/analyze-data-with-presto-and-airpal-on-amazon-emr/
*
SELECT language,
page_title,
SUM(hits) AS hits
FROM default.wikistats
WHERE language = 'en'
AND page_title LIKE '%Amazon%'
GROUP BY language,
page_title
ORDER BY hits DESC
LIMIT 10;
*language | page_title | hits ----------+--------------------------+------ en | Amazon.com | 4136 en | Amazon_Kindle | 2968 en | Amazon_River | 2666 en | Amazon_Rainforest | 2010 en | Amazons | 1536 en | Amazon | 840 en | Amazon_rainforest | 817 en | Amazon_S3 | 576 en | Amazon_Women_in_the_Mood | 503 en | Amazon_Basin | 491 (10 rows) Query 20180928_074659_00008_qhrkg, FINISHED, 1 node Splits: 247 total, 247 done (100.00%) 1:13 [128M rows, 5.64GB] [1.77M rows/s, 79.7MB/s]*
presto:default> exit;
*
[root@ip-172-31-26-41 ~]# cd /var/log/presto
*
[root@ip-172-31-26-41 presto]# ll
total 2448
-rw-r--r-- 1 presto presto 148878 Sep 28 15:04 http-request.log
-rw-r--r-- 1 presto presto 2123248 Sep 28 15:03 launcher.log
-rw-r--r-- 1 presto presto 2843 Sep 28 15:02 queries-2018-09-28T14:54:13.0.log
-rw-r--r-- 1 presto presto 0 Sep 28 14:54 queries-2018-09-28T14:54:13.0.log.lck
-rw-r--r-- 1 presto presto 114389 Sep 28 15:02 server.log
*
对应的日志
[root@ip-172-31-26-41 presto]# cat queries-2018-09-28T14:54:13.0.log
*
Sep 28, 2018 2:57:14 PM com.amazonaws.QueryEventListener.QueryEventListener queryCreated
INFO: ---------------Query Created----------------------------
Query ID: 20180928_145714_00000_h54yn
Query State: QUEUED
User: root
Create Time: 2018-09-28T14:57:14.815Z
Principal: Optional.empty
Remote Client Address: Optional[127.0.0.1]
Source: Optional[presto-cli]
User Agent: Optional[StatementClient/0.187]
Catalog: Optional[hive]
Schema: Optional[default]
Server Address: 172.31.26.41
Sep 28, 2018 2:57:20 PM com.amazonaws.QueryEventListener.QueryEventListener queryCreated
INFO: ---------------Query Created----------------------------
Query ID: 20180928_145720_00001_h54yn
Query State: QUEUED
User: root
Create Time: 2018-09-28T14:57:20.670Z
Principal: Optional.empty
Remote Client Address: Optional[127.0.0.1]
Source: Optional[presto-cli]
User Agent: Optional[StatementClient/0.187]
Catalog: Optional[hive]
Schema: Optional[default]
Server Address: 172.31.26.41
Sep 28, 2018 2:57:20 PM com.amazonaws.QueryEventListener.QueryEventListener queryCompleted
INFO: ---------------Query Completed----------------------------
Query ID: 20180928_145714_00000_h54yn
Create Time: 2018-09-28T14:57:14.815Z
User: root
Complete: true
Remote Client Address: Optional[127.0.0.1]
Sep 28, 2018 2:57:22 PM com.amazonaws.QueryEventListener.QueryEventListener splitCompleted
INFO: ---------------Split Completed----------------------------
Query ID: 20180928_145720_00001_h54yn
Stage ID: 20180928_145720_00001_h54yn.1
Task ID: 0
Sep 28, 2018 2:57:22 PM com.amazonaws.QueryEventListener.QueryEventListener queryCompleted
INFO: ---------------Query Completed----------------------------
Query ID: 20180928_145720_00001_h54yn
Create Time: 2018-09-28T14:57:20.670Z
User: root
Complete: true
Remote Client Address: Optional[127.0.0.1]
Sep 28, 2018 2:58:37 PM com.amazonaws.QueryEventListener.QueryEventListener queryCreated
INFO: ---------------Query Created----------------------------
Query ID: 20180928_145837_00002_h54yn
Query State: QUEUED
User: root
Create Time: 2018-09-28T14:58:37.153Z
Principal: Optional.empty
Remote Client Address: Optional[127.0.0.1]
Source: Optional[presto-cli]
User Agent: Optional[StatementClient/0.187]
Catalog: Optional[hive]
Schema: Optional[default]
Server Address: 172.31.26.41
Sep 28, 2018 3:02:45 PM com.amazonaws.QueryEventListener.QueryEventListener queryCompleted
INFO: ---------------Query Completed----------------------------
Query ID: 20180928_145837_00002_h54yn
Create Time: 2018-09-28T14:58:37.153Z
User: root
Complete: true
Remote Client Address: Optional[127.0.0.1]
*The shutdown decision logic based on Presto Query logs could follow below process:
*