Using Presto Query Log to decide whether to shutdown EMR Presto

2018年09月28日

*
Background
Before tyring to stop an EMR cluster that runs Presto, check the status of the query execution on the Presto.
*
Source code
https://github.com/aws-samples/emr-presto-query-event-listener
*
NB
如果不使用AWS global,Presto无法运行

Till now, below commands have been runs successfully under AWS global (us-west-2 in this test, other AWS global have not been tried, but should be OK).
*
The QueryEventListener.jar under the root directory of the aforementioned repo is safe to be used. You can either download the whole repo from GitHub and then extract from it, or you can download this specific file from GitHub.
*
Replace the string "replace-with-your-bucket" in the below bash script, with the S3 bucket name under your account for this test purpose. Here I use "awsemrprestoqlog" to replace it.
*

#!/bin/bash
IS_MASTER=true
if [ -f /mnt/var/lib/info/instance.json ]
then
        if grep isMaster /mnt/var/lib/info/instance.json | grep true;
        then
        IS_MASTER=true
        else
        IS_MASTER=false
        fi
fi
sudo mkdir -p /usr/lib/presto/plugin/queryeventlistener
sudo /usr/bin/aws s3 cp s3://replace-with-your-bucket/QueryEventListener.jar /tmp
sudo cp /tmp/QueryEventListener.jar /usr/lib/presto/plugin/queryeventlistener/
if [ "$IS_MASTER" = true ]; then
sudo mkdir -p /usr/lib/presto/etc
sudo bash -c 'cat <<EOT >> /usr/lib/presto/etc/event-listener.properties
event-listener.name=event-listener
EOT'
fi
*
Save the above file with filename "eventlistenerbootstrap.sh"
*
将上文下载的QueryEventListener.jar和eventlistenerbootstrap.sh上传到被替换了bucket名的位置"s3://replace-with-your-bucket/"下
*
(Optional) Launch EC2 with EMR access as EC2 instance profile. Here I used an Amazon Linux OS.
*
运行命令:
aws emr create-cluster --name "ClusterWithPrestoLogging" --release-label emr-5.10.0 --applications Name=Hive Name=Presto --use-default-roles --instance-count 2 --instance-type m4.large --ec2-attributes KeyName=keyName,SubnetId=subnet-ID --log-uri s3://aws-logs-ACCOUNTID-REGIONCODE/ --bootstrap-actions Path=s3://awsemrprestoqlog/eventlistenerbootstrap.sh,Name=BootstrapActionPrestoLogging,Args=[] --region REGIONCODE
Note
REGIONCODE形如us-west-2
替换掉awsemrprestoqlog
替换掉上文中的subnet-ID
替换掉上文中的keyName
*
Wait until the status of the EMR cluster changed to "Waiting Cluster ready"
SSH to login to the EMR master node
*
[root@ip-172-31-26-41 ~]# hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
*
Below command can be successfully run only on AWS global EC2 (us-west-2 in this test)
CREATE EXTERNAL TABLE wikistats (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
LOCATION 's3://support.elasticmapreduce/training/datasets/wikistats/';
OK
Time taken: 9.789 seconds
*
Below command will cause error on AWS China EC2
hive> CREATE EXTERNAL TABLE wikistats (
    > language STRING,
    > page_title STRING,
    > hits BIGINT,
    > retrived_size BIGINT
    > )
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ' '
    > LINES TERMINATED BY '\n'
    > LOCATION 's3://support.elasticmapreduce/training/datasets/wikistats/';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 5A6639E4FD62B96B), S3 Extended Request ID: C+O/d/LN/nwD4JZT8g0mZG8w47H3c+986TLCi0ni6+ZAzz/wZo3MoZwHsAQZIObGGMcdPbx9Sd4=)
*
Prefixing the S3 bucket name with the region name would cause error on AWS global EC2 (for some AWS global region, e.g. us-west-2): (Below has NOT been proved to make Presto successfully run)
hive> CREATE EXTERNAL TABLE wikistats (
    > language STRING,
    > page_title STRING,
    > hits BIGINT,
    > retrived_size BIGINT
    > )
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ' '
    > LINES TERMINATED BY '\n'
    > LOCATION 's3://us-west-2.support.elasticmapreduce/training/datasets/wikistats/';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: FFCD315D56476F4F), S3 Extended Request ID: noP22XoFM/vl9N9+1WqsYSJzgnS8ZuGqMy0QEtJNLj5nok33jYlTwETTPbdGMcnfRLbOqA5Tcfg=)
*
For cn-north-1 AWS China Beijing region: (Below has NOT been proved to make Presto successfully run)
CREATE EXTERNAL TABLE wikistats (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
LOCATION 's3://cn-north-1.support.elasticmapreduce/training/datasets/wikistats/';
*
For cn-northwest-1 China Ningxia region: (Below has NOT been proved to make Presto successfully run)
hive> CREATE EXTERNAL TABLE wikistats (
    > language STRING,
    > page_title STRING,
    > hits BIGINT,
    > retrived_size BIGINT
    > )
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ' '
    > LINES TERMINATED BY '\n'
    > LOCATION 's3://cn-northwest-1.support.elasticmapreduce/training/datasets/wikistats/';
OK
Time taken: 10.131 seconds
*
The S3 bucket name seems to have some convention within certain AWS internal team. Till now, I did not find a place where AWS explicitly told their naming convention for the bucket name for the purpose as in the blog post.
For eu-central-1 region, adjust the bucket name to s3://eu-central-1.support.elasticmapreduce/
https://github.com/aws-samples/emr-bootstrap-actions/tree/master/spark
*
hive> exit;
*
[root@ip-172-31-19-227 presto]# presto-cli --server localhost:8889 --catalog hive --schema default
*
https://aws.amazon.com/blogs/big-data/custom-log-presto-query-events-on-amazon-emr-for-auditing-and-performance-insights/

presto:default> SELECT * from wikistats LIMIT 10;  # This query is added afterwards, and therefore the bottom log content does not reflect this query.
 language |                                                        page_title
----------+----------------------------------------------------------------------------------------------------------
 ru       | %D0%9E%D1%81%D0%B5%D1%82%D0%B8%D0%BD%D1%81%D0%BA%D0%B0%D1%8F_%D0%BA%D1%83%D1%85%D0%BD%D1%8F
 ru       | %D0%9E%D1%81%D0%B5%D1%82%D0%B8%D0%BD%D1%81%D0%BA%D0%B0%D1%8F_%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%82%D
 ru       | %D0%9E%D1%81%D0%B5%D1%82%D0%B8%D0%BD%D1%8B
 ru       | %D0%9E%D1%81%D0%B5%D1%82%D1%80%D0%BE%D0%B2%D1%8B%D0%B5
 ru       | %D0%9E%D1%81%D0%B5%D1%86%D0%BA%D0%B0_%D0%90%D0%B3%D0%BD%D0%B5%D1%88%D0%BA%D0%B0
 ru       | %D0%9E%D1%81%D0%B8%D0%BD%D0%BD%D0%B8%D0%BA%D0%BE%D0%B2%D1%81%D0%BA%D0%B8%D0%B9_%D1%82%D1%80%D0%B0%D0%BC%D
 ru       | %D0%9E%D1%81%D0%B8%D0%BF
 ru       | %D0%9E%D1%81%D0%B8%D0%BF%D0%BE%D0%B2%D0%B8%D1%87%D0%B8
 ru       | %D0%9E%D1%81%D0%B8%D0%BF%D0%BE%D0%B2,_%D0%9F%D1%91%D1%82%D1%80_%D0%9E%D1%81%D0%B8%D0%BF%D0%BE%D0%B2%D0%B8
 ru       | %D0%9E%D1%81%D0%B8%D0%BF_(%D0%AE%D0%BB%D0%B8%D0%B0%D0%BD)_%D0%98%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2%D0%B8%D1%8
(10 rows)

Query 20181003_130713_00009_mzvgs, FINISHED, 1 node
Splits: 117 total, 21 done (17.95%)
0:05 [7.08K rows, 303KB] [1.3K rows/s, 55.6KB/s]
*
https://aws.amazon.com/blogs/big-data/analyze-data-with-presto-and-airpal-on-amazon-emr/
*
SELECT language, 
    page_title,
    SUM(hits) AS hits 
FROM default.wikistats 
WHERE language = 'en'
AND page_title LIKE '%Amazon%'
GROUP BY language,
    page_title
ORDER BY hits DESC
LIMIT 10;
*
 language |        page_title        | hits
----------+--------------------------+------
 en       | Amazon.com               | 4136
 en       | Amazon_Kindle            | 2968
 en       | Amazon_River             | 2666
 en       | Amazon_Rainforest        | 2010
 en       | Amazons                  | 1536
 en       | Amazon                   |  840
 en       | Amazon_rainforest        |  817
 en       | Amazon_S3                |  576
 en       | Amazon_Women_in_the_Mood |  503
 en       | Amazon_Basin             |  491
(10 rows)

Query 20180928_074659_00008_qhrkg, FINISHED, 1 node
Splits: 247 total, 247 done (100.00%)
1:13 [128M rows, 5.64GB] [1.77M rows/s, 79.7MB/s]
*
presto:default> exit;
*
[root@ip-172-31-26-41 ~]# cd /var/log/presto
*
[root@ip-172-31-26-41 presto]# ll
total 2448
-rw-r--r-- 1 presto presto 148878 Sep 28 15:04 http-request.log
-rw-r--r-- 1 presto presto 2123248 Sep 28 15:03 launcher.log
-rw-r--r-- 1 presto presto 2843 Sep 28 15:02 queries-2018-09-28T14:54:13.0.log
-rw-r--r-- 1 presto presto 0 Sep 28 14:54 queries-2018-09-28T14:54:13.0.log.lck
-rw-r--r-- 1 presto presto 114389 Sep 28 15:02 server.log
*
对应的日志
[root@ip-172-31-26-41 presto]# cat queries-2018-09-28T14:54:13.0.log
*
Sep 28, 2018 2:57:14 PM com.amazonaws.QueryEventListener.QueryEventListener queryCreated
INFO: ---------------Query Created----------------------------
     Query ID: 20180928_145714_00000_h54yn
     Query State: QUEUED
     User: root
     Create Time: 2018-09-28T14:57:14.815Z
     Principal: Optional.empty
     Remote Client Address: Optional[127.0.0.1]
     Source: Optional[presto-cli]
     User Agent: Optional[StatementClient/0.187]
     Catalog: Optional[hive]
     Schema: Optional[default]
     Server Address: 172.31.26.41
Sep 28, 2018 2:57:20 PM com.amazonaws.QueryEventListener.QueryEventListener queryCreated
INFO: ---------------Query Created----------------------------
     Query ID: 20180928_145720_00001_h54yn
     Query State: QUEUED
     User: root
     Create Time: 2018-09-28T14:57:20.670Z
     Principal: Optional.empty
     Remote Client Address: Optional[127.0.0.1]
     Source: Optional[presto-cli]
     User Agent: Optional[StatementClient/0.187]
     Catalog: Optional[hive]
     Schema: Optional[default]
     Server Address: 172.31.26.41
Sep 28, 2018 2:57:20 PM com.amazonaws.QueryEventListener.QueryEventListener queryCompleted
INFO: ---------------Query Completed----------------------------
     Query ID: 20180928_145714_00000_h54yn
     Create Time: 2018-09-28T14:57:14.815Z
     User: root
     Complete: true
     Remote Client Address: Optional[127.0.0.1]
Sep 28, 2018 2:57:22 PM com.amazonaws.QueryEventListener.QueryEventListener splitCompleted
INFO: ---------------Split Completed----------------------------
     Query ID: 20180928_145720_00001_h54yn
     Stage ID: 20180928_145720_00001_h54yn.1
     Task ID: 0
Sep 28, 2018 2:57:22 PM com.amazonaws.QueryEventListener.QueryEventListener queryCompleted
INFO: ---------------Query Completed----------------------------
     Query ID: 20180928_145720_00001_h54yn
     Create Time: 2018-09-28T14:57:20.670Z
     User: root
     Complete: true
     Remote Client Address: Optional[127.0.0.1]
Sep 28, 2018 2:58:37 PM com.amazonaws.QueryEventListener.QueryEventListener queryCreated
INFO: ---------------Query Created----------------------------
     Query ID: 20180928_145837_00002_h54yn
     Query State: QUEUED
     User: root
     Create Time: 2018-09-28T14:58:37.153Z
     Principal: Optional.empty
     Remote Client Address: Optional[127.0.0.1]
     Source: Optional[presto-cli]
     User Agent: Optional[StatementClient/0.187]
     Catalog: Optional[hive]
     Schema: Optional[default]
     Server Address: 172.31.26.41
Sep 28, 2018 3:02:45 PM com.amazonaws.QueryEventListener.QueryEventListener queryCompleted
INFO: ---------------Query Completed----------------------------
     Query ID: 20180928_145837_00002_h54yn
     Create Time: 2018-09-28T14:58:37.153Z
     User: root
     Complete: true
     Remote Client Address: Optional[127.0.0.1]
*
The shutdown decision logic based on Presto Query logs could follow below process:


*

Category: big_data Tags: public

Upvote


Downvote