I’m a data engineer, and have been spending a lot of time using Google’s Python cloud client libraries like gcloud and gsutil. I started a project that required summarizing files in a large cloud bucket (hundreds of terabytes), trying to answer questions like: what kinds of files are there? Which sub-directories are largest? Are there any duplicate files? I quickly became frustrated with the options from Google to complete this task on the command line, especially given tools like aws s3 ls that includes a --summarize option.
I started using the most obvious command, gsutil ls, to capture information about this massive cloud bucket like storage size, sub-directory size, file extensions, duplicate files, etc. While this command provides nearly all of this information for each individual blob, capturing it for summarization required a lot of grep and sed and xargs commands piped together. Not to mention the fact that for large cloud buckets, gsutil ls can take a few hours.
I quickly turned to Python and used the existing google-cloud-storage library to return this information in a better framework. But, I still wanted the option of a cloud bucket summary on the command line. So, I built the gsummarize CLI to handle this very case.
The tool has two main functions: summarize all of files and sub-directories in a provided Google Cloud Storage bucket by their storage size (and optionally their file extension), and compare CRC32C hash values of all files in a bucket to determine duplicate files. I also added features for authenticated users to output summary CSVs to Google Cloud Storage URIs, and to summarize/deduplicate sub-directories of buckets as well.
Now, myself and my team use gsummarize for a quick command line summary of Google Cloud Storage buckets. The source code is available on GitHub here: https://github.com/nashbio/gsummarize. Of course this tool is a work in progress and an open source project, so all suggestions and feedback are welcome. I hope you find it useful!