Managing large amounts of data on AWS S3 can be challenging, especially when you need to delete massive files quickly. In my scenario, I had 13 TB of files to delete and needed to speed up the process. Python, along with the asyncio library, provided an efficient solution. Here’s a guide on how to leverage Python and asyncio to delete files from S3 quickly.
Table of Contents
What is Python?
Python is a high-level, interpreted programming language known for its readability and versatility. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming. Python’s extensive standard and community-driven libraries make it popular for various applications, from web development to data science and automation.
What is Asyncio?
asyncio is a Python library that provides asynchronous programming support, allowing you to write concurrent code using the async/await syntax. Asynchronous programming is ideal for I/O-bound tasks, such as network requests, where waiting for a response can be done without blocking the execution of other tasks. This can lead to significant performance improvements in scenarios involving many I/O operations, like our use case of deleting files from S3.
Setting Up Your Environment
To get started, ensure you have Python installed. You can download Python from the official website. Additionally, you’ll need to install the aioboto3 library, which is an asynchronous version of the boto3
library for AWS services.
You can install aioboto3 using pip:
pip install aioboto3
The Code
Here’s the complete Python script to delete files from S3 using asyncio:
bucket_names = [
"my-bucket-1",
"my-bucket-2"
]
import aioboto3
import asyncio
from botocore.exceptions import ClientError
async def list_objects(s3_client, bucket_name, prefix):
objects = []
try:
paginator = s3_client.get_paginator('list_objects_v2')
async for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix):
for obj in page.get('Contents', []):
objects.append(obj['Key'])
except ClientError as e:
print(f"{bucket_name} {prefix}: error listing objects: {e}")
return objects
async def delete_objects(s3_client, bucket_name, objects):
try:
delete_requests = [{'Key': obj} for obj in objects]
await s3_client.delete_objects(
Bucket=bucket_name,
Delete={'Objects': delete_requests}
)
print(f"{bucket_name}: deleted {len(objects)} objects")
except ClientError as e:
print(f"{bucket_name}: error deleting objects: {e}")
async def batch_delete(bucket_name, prefix, aws_access_key_id, aws_secret_access_key, aws_region):
session = aioboto3.Session(
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=aws_region
)
async with session.client('s3') as s3_client:
while True:
objects = await list_objects(s3_client, bucket_name, prefix)
if not objects:
print(f"{bucket_name} {prefix}: No more objects to delete.")
break
tasks = []
for i in range(0, len(objects), 10):
batch = objects[i:i + 10]
tasks.append(delete_objects(s3_client, bucket_name, batch))
await asyncio.gather(*tasks)
await asyncio.sleep(1) # optional delay between batches
if __name__ == "__main__":
aws_region = input("Enter the AWS region: ")
aws_access_key_id = input("Enter your AWS Access Key ID: ")
aws_secret_access_key = input("Enter your AWS Secret Access Key: ")
prefix = ""
for bucket_name in bucket_names:
asyncio.run(batch_delete(bucket_name, prefix, aws_access_key_id, aws_secret_access_key, aws_region))
How to Run the Script
- Save the code to a file, e.g., s3_delete.py.
- Install the required library using pip:
pip3 install aioboto3
. - Run the script:
python3 s3_delete.py
. The script will prompt you to enter your AWS region, access key ID, and secret access key. Ensure you have the necessary permissions to list and delete objects in the specified S3 buckets.
Explanation
- Listing Objects: The
list_objects
function usesaioboto3
to asynchronously paginate through the objects in the specified S3 bucket and prefix, collecting their keys. - Deleting Objects: The
delete_objects
function sends a batch delete request to S3, simultaneously deleting up to 10 objects. - Batch Deletion: The
batch_delete
function manages the asynchronous listing and deletion process. It creates a session with AWS credentials, lists objects in batches and deletes them concurrently.
By using asyncio
and aioboto3
, this script efficiently handles the deletion of large numbers of files from S3, significantly speeding up the process compared to a synchronous approach.
Wrapping Up
Python and asyncio
provide powerful tools for managing and automating tasks involving large-scale data operations. This script demonstrates how to leverage these tools to delete files from AWS S3 quickly and efficiently. Whether dealing with lots of data or smaller datasets, this approach can help you save time and resources.
Feel free to adapt and expand this script to suit your specific needs. Happy coding!