Analyzing CloudFront Access Logs Using Athena Partitioned Tables
If you are still using S3 Legacy settings for CloudFront Access Logs, it’s advisable to migrate to updated S3 configuration since you can partition your log data and analyze it more efficiently.
In this post, we’ll see how we can utilize Athena Partitioned Tables to query CloudFront access logs.
Step 1: Select your Cloudfront distribution->Logging->Add->Amazon S3 and specify bucket name. For partitioning, specify AWSLogs/{account-id}/CloudFront/{DistributionId}/{yyyy}/{MM}/{dd}/{HH}/ as pattern. Enable Hive-compatible file name format and set output format to Parquet.

Step 2: Make some requests to your distribution so that access logs are generated. Log file format will be like this
s3://vinayak-cf-access-logs/AWSLogs/aws-account-id=<ACCOUNT_ID>/CloudFront/DistributionId=E3EJA0CG95HSWP/year=2025/month=02/day=16/hour=10/
Step 3: Create Athena table with following DDL statement. Replace S3 URI with yours.
CREATE EXTERNAL TABLE `cf_access_logs10`(
`date` string,
`time` string,
`x_edge_location` string,
`sc_bytes` string,
`c_ip` string,
`cs_method` string,
`cs_host` string,
`cs_uri_stem` string,
`sc_status` string,
`cs_referrer` string,
`cs_user_agent` string,
`cs_uri_query` string,
`cs_cookie` string,
`x_edge_result_type` string,
`x_edge_request_id` string,
`x_host_header` string,
`cs_protocol` string,
`cs_bytes` string,
`time_taken` string,
`x_forwarded_for` string,
`ssl_protocol` string,
`ssl_cipher` string,
`x_edge_response_result_type` string,
`cs_protocol_version` string,
`fle_status` string,
`fle_encrypted_fields` string,
`c_port` string,
`time_to_first_byte` string,
`x_edge_detailed_result_type` string,
`sc_content_type` string,
`sc_content_len` string,
`sc_range_start` string,
`sc_range_end` string)
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`hour` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://vinayak-cf-access-logs/AWSLogs/aws-account-id=<AWS_ACCOUNT_ID>/CloudFront/DistributionId=E3EJA0CG95HSWP'
Step 4: Now select your table and load partitions.

Step 5: Now run the queries with partitioned column used in where clause
SELECT * FROM "default"."cf_access_logs10" where hour=10
SELECT * FROM "default"."cf_access_logs10" where hour=11

and you’ll see the difference in data scanned because now our query is leveraging partitions to scan necessary data only.
