Analyzing CloudFront Access Logs Using Athena Partitioned Tables

Vinayak Pandey
AWS Tip
Published in
2 min readFeb 16, 2025

If you are still using S3 Legacy settings for CloudFront Access Logs, it’s advisable to migrate to updated S3 configuration since you can partition your log data and analyze it more efficiently.

In this post, we’ll see how we can utilize Athena Partitioned Tables to query CloudFront access logs.

Step 1: Select your Cloudfront distribution->Logging->Add->Amazon S3 and specify bucket name. For partitioning, specify AWSLogs/{account-id}/CloudFront/{DistributionId}/{yyyy}/{MM}/{dd}/{HH}/ as pattern. Enable Hive-compatible file name format and set output format to Parquet.

Step 2: Make some requests to your distribution so that access logs are generated. Log file format will be like this

s3://vinayak-cf-access-logs/AWSLogs/aws-account-id=<ACCOUNT_ID>/CloudFront/DistributionId=E3EJA0CG95HSWP/year=2025/month=02/day=16/hour=10/

Step 3: Create Athena table with following DDL statement. Replace S3 URI with yours.

CREATE EXTERNAL TABLE `cf_access_logs10`(
`date` string,
`time` string,
`x_edge_location` string,
`sc_bytes` string,
`c_ip` string,
`cs_method` string,
`cs_host` string,
`cs_uri_stem` string,
`sc_status` string,
`cs_referrer` string,
`cs_user_agent` string,
`cs_uri_query` string,
`cs_cookie` string,
`x_edge_result_type` string,
`x_edge_request_id` string,
`x_host_header` string,
`cs_protocol` string,
`cs_bytes` string,
`time_taken` string,
`x_forwarded_for` string,
`ssl_protocol` string,
`ssl_cipher` string,
`x_edge_response_result_type` string,
`cs_protocol_version` string,
`fle_status` string,
`fle_encrypted_fields` string,
`c_port` string,
`time_to_first_byte` string,
`x_edge_detailed_result_type` string,
`sc_content_type` string,
`sc_content_len` string,
`sc_range_start` string,
`sc_range_end` string)
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`hour` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://vinayak-cf-access-logs/AWSLogs/aws-account-id=<AWS_ACCOUNT_ID>/CloudFront/DistributionId=E3EJA0CG95HSWP'

Step 4: Now select your table and load partitions.

Step 5: Now run the queries with partitioned column used in where clause

SELECT * FROM "default"."cf_access_logs10" where hour=10
SELECT * FROM "default"."cf_access_logs10" where hour=11
s

and you’ll see the difference in data scanned because now our query is leveraging partitions to scan necessary data only.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Published in AWS Tip

Best AWS, DevOps, Serverless, and more from top Medium writers .

Written by Vinayak Pandey

Experienced Cloud Engineer with a knack of automation. Linkedin profile: https://www.linkedin.com/in/vinayakpandeyit/

No responses yet

Write a response