Wiki

Clone wiki

squeegee / Home

⚠ Atlassian is no longer developing this tooling.: Recent enhancements to the AWS Cost Usage Report and many developments with Cost Explorer has been able to replace many of the uses of this tooling. Feel free to fork and maintain your own copy of this code if it has been useful too you.

Squeegee

squeegee-spot@2x.png

This project is aimed at teams who would like to perform analytics on their AWS billing data. The base stack will create shared resources such as IAM Roles for use by AWS Lambda, etc. These resources will then be used by nested stacks that configure individual pipelines for loading data into a ideal format for use inside AWS Athena.

I have done my best to refer to things as follows:

  • Cost and Usage Report S3 Bucket: S3 Bucket that AWS drop your cost and usage report
  • Analytics Bucket: The S3 Bucket Squeegee will write files into

Prerequisites

Cost and Usage Report

The Cost and Usage reports need to be configured inside your master payer account, this can be done via the AWS Billing Console by following the AWS documentation. For this pipeline you need to configure your reports as:

  • Time unit: hourly
  • Include: Resource IDs
  • Enable support for: Redshift Manifest, QuickSight Manifest
  • Report path prefix: (recommended something like cost_reports)

It is best that the Cost and Usage Report S3 Bucket is located in one of the supported regions.

Glue Database

In order for your table to be created you need to configure an AWS Datacatalog Database. If you do not have an existing database you would like to use then access the AWS Glue Console and create a new database. We recommend creating a new database called "squeegee".

Output S3 Bucket

Create a new S3 bucket (Analytics Bucket) in your account that the transformed cost and usage report files will be delivered into. We recommend this is in the same region as you plan to deploy Squeegee.

Base Resources

Roles

Squeegee-LambdaExecutionRole - role used by sub-stack lambda functions. The base stack creates this role and attaches an initial inline policy. Some nested stacks will add more permissions to this role if they need.

Initial policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "logs:CreateLogGroup",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "xray:PutTraceSegments",
                "xray:PutTelemetryRecords"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:DescribeTable"
            ],
            "Resource": "arn:aws:dynamodb:<region>:<account-id>:table/SqueegeeMutexTable",
            "Effect": "Allow"
        },
        {
            "Action": [
                "batch:DescribeJobs",
                "batch:SubmitJob",
                "batch:ListJobs"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Squeegee-BatchInstanceRole -

Managed Policy: arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

Initial policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "logs:CreateLogGroup",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "xray:PutTraceSegments",
                "xray:PutTelemetryRecords"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:DescribeTable"
            ],
            "Resource": "arn:aws:dynamodb:<region>:<account-id>:table/SqueegeeMutexTable",
            "Effect": "Allow"
        },
        {
            "Action": [
                "batch:DescribeJobs",
                "batch:SubmitJob",
                "batch:ListJobs"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Squeegee-BatchJobRole -

Managed policy: arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

Initial policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::<cur-bucket>",
                "arn:aws:s3:::<cur-bucket>/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<analytics-bucket>",
                "arn:aws:s3:::<analytics-bucket>/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "glue:*",
                "datacatalog:*"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "sts:AssumeRole"
            ],
            "Resource": "arn:aws:iam::<account-id>:role/Squeegee-GlueManagementRole-*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "iam:PassRole"
            ],
            "Resource": "arn:aws:iam::<account-id>:role/Squeegee-GlueCrawlerRole-*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "batch:DescribeJobs",
                "batch:SubmitJob",
                "batch:ListJobs"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:DescribeTable"
            ],
            "Resource": "arn:aws:dynamodb:<region>:<account-id>:table/SqueegeeMutexTable",
            "Effect": "Allow"
        }
    ]
}

Squeegee-BatchServiceRole -

Managed policy: arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

Squeegee-BatchSpotFleetIamRole -

Managed policy: arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole

Squeegee-GlueCrawlerRole -

Initial policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "glue:*",
                "datacatalog:*"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:ListAllMyBuckets"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<analytics-bucket>",
                "arn:aws:s3:::<analytics-bucket>/*"
            ],
            "Effect": "Allow"
        }
    ]
}

Squeegee-GlueManagementRole -

Initial policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "glue:*",
                "datacatalog:*"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:ListAllMyBuckets"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": "glue.amazonaws.com"
                }
            },
            "Action": [
                "iam:PassRole"
            ],
            "Resource": "arn:aws:iam::<account-id>:role/Squeegee-GlueCrawlerRole-*",
            "Effect": "Allow"
        }
    ]
}

DynamoDB Table

Some of the Squeegee stacks use this DynamoDB table to implement a Mutex Lock using the dyndb-mutex Library. This table has a small provisioned throughput of read 2/write 2 which should operate well inside your free tier.

Batch Environment

Compute Environment

Squeegee will create a managed Batch Compute Environment which is configured to use SPOT Fleet with a Desired vCPU count of 0. This will not bring up any EC2 instances until there is jobs in the queue. The Max vCPUs is 64 which should do most companies Squeegee tasks fine. If you are a very large enterprise you may need to increase this vCPU Max to allow more EC2 resources to be started to get your processing done in a reasonable time.

Job Queue

Squeegee uses one job queue for all stacks connected to the compute environment defined above

Job Definitions

Squeegee uses a single Batch Job Definition SqueegeeRunner. This one job definition will run the docker container defined in the CloudFormation template which can handle all of the possible Squeegee tasks.

Nested Stacks

  1. VPC Stack
  2. Cost and Usage Report Transform
  3. Pricing API Transform

Redash

We recommend you checkout Redash as a visualisation solution. In order to assist you in testing it out we have generated a Cloudformation Stack. https://bitbucket.org/atlassian/squeegee/wiki/redash

Alternate Solutions

I have been made aware of another solution out there: https://aws.amazon.com/blogs/big-data/query-and-visualize-aws-cost-and-usage-data-using-amazon-athena-and-amazon-quicksight/

While this looks very similar in how it operates it has some major differences.

  • Squeegee Creates parquet files *much more efficent, cost effective and faster than CSV
  • Squeegee Can work on small companies CUR files right up to very large enterprises
  • Squeegee Uses AWS Glue for table creation
  • Squeegee Keeps all columns of the CUR file - unless you specify to drop selected columns
  • Squeegee Can handle cross region and cross account deployments
  • Squeegee Can re-process any months data so its possible to load in all of your CUR files

Updated