Analyzing CloudFront Access Logs Using Athena Partitioned Tables

Published in

AWS Tip

2 min readFeb 16, 2025

If you are still using S3 Legacy settings for CloudFront Access Logs, it’s advisable to migrate to updated S3 configuration since you can partition your log data and analyze it more efficiently.

In this post, we’ll see how we can utilize Athena Partitioned Tables to query CloudFront access logs.

Step 1: Select your Cloudfront distribution->Logging->Add->Amazon S3 and specify bucket name. For partitioning, specify AWSLogs/{account-id}/CloudFront/{DistributionId}/{yyyy}/{MM}/{dd}/{HH}/ as pattern. Enable Hive-compatible file name format and set output format to Parquet.

Step 2: Make some requests to your distribution so that access logs are generated. Log file format will be like this

s3://vinayak-cf-access-logs/AWSLogs/aws-account-id=<ACCOUNT_ID>/CloudFront/DistributionId=E3EJA0CG95HSWP/year=2025/month=02/day=16/hour=10/

Step 3: Create Athena table with following DDL statement. Replace S3 URI with yours.

CREATE EXTERNAL TABLE `cf_access_logs10`(
  `date` string, 
  `time` string, 
  `x_edge_location` string, 
  `sc_bytes` string, 
  `c_ip` string, 
  `cs_method` string, 
  `cs_host` string, 
  `cs_uri_stem` string, 
  `sc_status` string, 
  `cs_referrer` string, 
  `cs_user_agent` string, 
  `cs_uri_query` string, 
  `cs_cookie` string, 
  `x_edge_result_type` string, 
  `x_edge_request_id` string, 
  `x_host_header` string, 
  `cs_protocol` string, 
  `cs_bytes` string, 
  `time_taken` string, 
  `x_forwarded_for` string, 
  `ssl_protocol` string, 
  `ssl_cipher` string, 
  `x_edge_response_result_type` string, 
  `cs_protocol_version` string, 
  `fle_status` string, 
  `fle_encrypted_fields` string, 
  `c_port` string, 
  `time_to_first_byte` string, 
  `x_edge_detailed_result_type` string, 
  `sc_content_type` string, 
  `sc_content_len` string, 
  `sc_range_start` string, 
  `sc_range_end` string)
PARTITIONED BY ( 
  `year` int, 
  `month` int, 
  `day` int, 
  `hour` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://vinayak-cf-access-logs/AWSLogs/aws-account-id=<AWS_ACCOUNT_ID>/CloudFront/DistributionId=E3EJA0CG95HSWP'

Step 4: Now select your table and load partitions.

Step 5: Now run the queries with partitioned column used in where clause

SELECT * FROM "default"."cf_access_logs10" where hour=10
SELECT * FROM "default"."cf_access_logs10" where hour=11

and you’ll see the difference in data scanned because now our query is leveraging partitions to scan necessary data only.

Analyzing CloudFront Access Logs Using Athena Partitioned Tables

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in AWS Tip

Written by Vinayak Pandey

No responses yet