Wiki

Clone wiki

squeegee / Home

Squeegee

squeegee-spot@2x.png

This project is aimed at teams who would like to perform analytics on their AWS billing data. The base stack will create shared resources such as IAM Roles for use by AWS Lambda, etc. These resources will then be used by nested stacks that configure individual pipelines for loading data into a ideal format for use inside AWS Athena.

Quickstart

Supported Regions

Region Quickstart Link
us-east-1 Launch Stack
us-east-2 Launch Stack
us-west-2 Launch Stack
ap-southeast-2 Launch Stack
eu-central-1 Launch Stack
eu-west-1 Launch Stack

In its configuration Squeegee will deploy the base stack and the Cost and Usage Report Transform nested stack. For the quickstart it is recommended that you deploy this inside your master payer account under one of the supported regions above. It is possible to deploy this accross accounts and/or regions but that is a more complex topic. If for testing you would like to deploy this to an isolated account then you would need to setup some method to copy the cost and usage reports into the isolated accounts s3 bucket (or do this manually for testing).

I have done my best to refer to things as follows:

  • Cost and Usage Report S3 Bucket: S3 Bucket that AWS drop your cost and usage report
  • Analytics Bucket: The S3 Bucket Squeegee will write files into

Prerequisites

Cost and Usage Report

The Cost and Usage reports need to be configured inside your master payer account, this can be done via the AWS Billing Console by following the AWS documentation. For this pipeline you need to configure your reports as:

  • Time unit: hourly
  • Include: Resource IDs
  • Enable support for: Redshift Manifest, QuickSight Manifest
  • Report path prefix: (recommended something like cost_reports)

It is best that the Cost and Usage Report S3 Bucket is located in one of the supported regions.

Glue Database

In order for your table to be created you need to configure an AWS Datacatalog Database. If you do not have an existing database you would like to use then access the AWS Glue Console and create a new database. We recommend creating a new database called "squeegee".

Output S3 Bucket

Create a new S3 bucket (Analytics Bucket) in your account that the transformed cost and usage report files will be delivered into. We recommend this is in the same region as you plan to deploy Squeegee.

Stack Deployment

  • Login to your Master Payer AWS Account
  • Open one of the QuickStart URLS https://bitbucket.org/atlassian/squeegee/wiki/Home#markdown-header-quickstart
  • Fill out the following parameters:
    • Stack name - The name of the new CloudFormation Stack (Default: Atlassian-Squeegee)
    • AnalyticsBucket - The name of your Analytics Bucket created in the prerequisites. (eg. my-analytics-bucket).
    • AnalyticsBucketRegion - The region your Analytics Bucket was created in.
    • MutexTableName - Name of the DynamoDB table to create for Squeegee to use for Locking.
    • DockerImageName - Name of the docker image to use in Squeegee Batch Jobs (Default: Use upstream image from Dockerhub "Recommended").
    • CreateBatchLogGroup - If you would like Squeege Stack to delete the BatchLogGroup when it is deleted set this to True. Recommended False.
    • VPCId - A VPCId which Squeegee will deploy compute resources into (Default: "create" will create a VPC)
    • Subnets - Two subnet IDs seperated by a comma or if VPCId set to "create" empty.
    • GlueDataCatalogName - The Datacatalog database name created in the prerequisites.
    • GlueRegion - The region you want glue configured in.
    • GlueCrawlerNamePrefix - A prefix to add to Glue Crawlers created by Squeegee.
    • GlueCrawlerTablePrefix - A prefix to add to tables created by Squeegee.
    • GlueManagementRoleARN - ARN of the role to use to manage Glue (Default: "create" will create a role for you)
    • GlueCrawlerRoleARN - ARN of the role used by Glue Crawlers to read S3 (Default: "create" will create a role for you).
    • DeployPricingDataStack - Should Squeegee Deploy the Pricing API data transform Stack (Default: False).
    • DeployCURDataStack - Should Squeegee Deploy the CUR data transform Stack (Default: True).
    • CURBucket - Name of the S3 bucket configured to receive the Cost and Usage Reports Cost and Usage Report S3 Bucket.
    • CURBucketRegion - Region the Cost and Usage Report S3 Bucket is located.
    • CURDataKeyPrefix - Key prefix used when loading the transformed data into the output Analytics Bucket (eg. cost_usage_report).
    • CURDropColumns - Column names that Squeegee should drop from the CUR. Useful to avoid duplicate tag names.
    • BatchJSONLogging - Should Squeegee use JSON logging. Useful when you ingest logs into a service like Splunk.
    • PricingDataKeyPrefix - Key prefix for the output pricing data transform files to be put into S3.
    • SqueegeeS3BucketName - The bucket hosting the Squeegee Templates (Leave this as default).
    • SqueegeeS3KeyPrefix - The key prefix to the templates in s3 (Leave this as default).
    • SqueegeeS3Region - The region of the bucket containing the Squeegee Templates (Leave this as default).
  • Tick the acknowledgement about IAM resource creation
  • Select Create

Post Stack Deployment Steps

So the stack has deployed the following: a Lambda function, an SNS Topic, trigger from SNS to Lambda, DynamoDB table, VPC, Subnets, RouteTables, Batch Compute Environment, Batch Job Definitions and needed IAM Roles. What is left to do is connect your Cost and Usage report S3 bucket to the SNS topic. We do not force this connection as we do not want to interupt any existing config you have on your Cost and Usage report S3 bucket. If you do not have any S3 Put events configured on your Cost and Usage report S3 bucket you can safely configure one as follows:

  • Open the S3 Console
  • Select your Cost and Usage report S3 bucket (where AWS drop the CUR files)
  • Go to Properties
  • Under Advanced settings select Events
  • Select Add notification
    • Name: CostReportDrop
    • Events: ObjectCreate (All)
    • Prefix: The prefix you gave your Cost and Usage report configuration (eg. cost_reports)
    • Suffix: .json
    • Send to: SNS Topic
    • SNS: CostUsageReportNotifications
  • S3 Event Config

If you have an existing event configured on your Cost and Usage Report S3 Bucket you will need to determine the best way to reconfigure your s3 bucket to allow you to trigger the Lambda function.

All that is left to do now is to wait for the next Cost and Usage report delivery (usually 12 hourly). However, you can trigger a deployment by downloading the latest manifest found in your Cost and Usage report S3 bucket found under (cur_prefix)/(cur_report_name)/YYYYMMDD-YYYYMMDD/(cur_report_name)-Manifest.json and uploading it again.

If all goes well you will find parquet files inside your Analytics Bucket. You will see a AWS Glue Crawler configured in your account and a table added to your AWS Datacatalog database. See Cost and Usage Report Transform for more details on what you can use this data for.

Quickstart help

If you got stuck at any point check here for tips on how to resolve.

No SNS Notification when setting up the Cost and Usage Report S3 Bucket Event

Your Cost and Usage report Cost and Usage Report S3 Bucket is in a different region to that of the CloudFormation stack. You will need to manually configure SNS in that region and configure it to trigger the lambda function in the other region. See Documentation

Internal Server Error from Glue Crawler

If you specify a database that is inside Athena and you have not upgraded Athena to Data Catalog then the crawler will fail. You need to upgrade your Athena Catalog

Unable to find database in Athena console

If you have not upgraded Athena to the new Data Catalog then you will not be able to see its databases in the console. You need to upgrade your Athena Catalog

Performing Updates

Steps:

  • Open the AWS CloudFormation Console
  • Select the correct region
  • Select the base Squeegee stack (without the NESTED label)
  • Check under stack Parameters for the current value of "SqueegeeBuild" Parameter, Note this down incase you want to rollback.
  • Select Stack
  • Go to Actions -> Update Stack
  • Select Update
  • Specify an Amazon S3 template URL
Region Link
us-east-1 https://dbr-pp-deployments-us-east-1.s3.amazonaws.com/aws_billing_analytics_deployment/latest/squeegee.yaml
us-east-2 https://dbr-pp-deployments-us-east-2.s3.amazonaws.com/aws_billing_analytics_deployment/latest/squeegee.yaml
us-west-2 https://dbr-pp-deployments-us-west-2.s3.amazonaws.com/aws_billing_analytics_deployment/latest/squeegee.yaml
ap-southeast-2 https://dbr-pp-deployments-ap-southeast-2.s3.amazonaws.com/aws_billing_analytics_deployment/latest/squeegee.yaml
eu-central-1 https://dbr-pp-deployments-eu-central-1.s3.amazonaws.com/aws_billing_analytics_deployment/latest/squeegee.yaml
eu-west-1 https://dbr-pp-deployments-eu-west-1.s3.amazonaws.com/aws_billing_analytics_deployment/latest/squeegee.yaml
  • Specify URL
  • Review Stack Parameters, we try and make the defaults sane for new parameters but you should review them for your environment
  • Select Next until you hit the final wizard page where you need to confirm your IAM Capabilities
  • IAM Capabilities
  • Select Finish

Confirm all the Squeegee stacks complete their update.

Performing Update Rollback

In order to roll back Squeegee we update the cloudformation stack and provide an earlier versions template. Follow the update process above however when asked for the template URL provide the URL in the format: https://dbr-pp-deployments.s3.amazonaws.com/aws_billing_analytics_deployment/BUILD_NUMBER/squeegee.yaml Replace BUILD_NUMBER with the build you want to rollback to.

Base Resources

Roles

Squeegee-LambdaExecutionRole - role used by sub-stack lambda functions. The base stack creates this role and attaches an initial inline policy. Some nested stacks will add more permissions to this role if they need.

Initial policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "logs:CreateLogGroup",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "xray:PutTraceSegments",
                "xray:PutTelemetryRecords"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:DescribeTable"
            ],
            "Resource": "arn:aws:dynamodb:<region>:<account-id>:table/SqueegeeMutexTable",
            "Effect": "Allow"
        },
        {
            "Action": [
                "batch:DescribeJobs",
                "batch:SubmitJob",
                "batch:ListJobs"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Squeegee-BatchInstanceRole -

Managed Policy: arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

Initial policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "logs:CreateLogGroup",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "xray:PutTraceSegments",
                "xray:PutTelemetryRecords"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:DescribeTable"
            ],
            "Resource": "arn:aws:dynamodb:<region>:<account-id>:table/SqueegeeMutexTable",
            "Effect": "Allow"
        },
        {
            "Action": [
                "batch:DescribeJobs",
                "batch:SubmitJob",
                "batch:ListJobs"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Squeegee-BatchJobRole -

Managed policy: arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

Initial policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::<cur-bucket>",
                "arn:aws:s3:::<cur-bucket>/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<analytics-bucket>",
                "arn:aws:s3:::<analytics-bucket>/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "glue:*",
                "datacatalog:*"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "sts:AssumeRole"
            ],
            "Resource": "arn:aws:iam::<account-id>:role/Squeegee-GlueManagementRole-*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "iam:PassRole"
            ],
            "Resource": "arn:aws:iam::<account-id>:role/Squeegee-GlueCrawlerRole-*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "batch:DescribeJobs",
                "batch:SubmitJob",
                "batch:ListJobs"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:DescribeTable"
            ],
            "Resource": "arn:aws:dynamodb:<region>:<account-id>:table/SqueegeeMutexTable",
            "Effect": "Allow"
        }
    ]
}

Squeegee-BatchServiceRole -

Managed policy: arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

Squeegee-BatchSpotFleetIamRole -

Managed policy: arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole

Squeegee-GlueCrawlerRole -

Initial policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "glue:*",
                "datacatalog:*"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:ListAllMyBuckets"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<analytics-bucket>",
                "arn:aws:s3:::<analytics-bucket>/*"
            ],
            "Effect": "Allow"
        }
    ]
}

Squeegee-GlueManagementRole -

Initial policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "glue:*",
                "datacatalog:*"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:ListAllMyBuckets"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": "glue.amazonaws.com"
                }
            },
            "Action": [
                "iam:PassRole"
            ],
            "Resource": "arn:aws:iam::<account-id>:role/Squeegee-GlueCrawlerRole-*",
            "Effect": "Allow"
        }
    ]
}

DynamoDB Table

Some of the Squeegee stacks use this DynamoDB table to implement a Mutex Lock using the dyndb-mutex Library. This table has a small provisioned throughput of read 2/write 2 which should operate well inside your free tier.

Batch Environment

Compute Environment

Squeegee will create a managed Batch Compute Environment which is configured to use SPOT Fleet with a Desired vCPU count of 0. This will not bring up any EC2 instances until there is jobs in the queue. The Max vCPUs is 64 which should do most companies Squeegee tasks fine. If you are a very large enterprise you may need to increase this vCPU Max to allow more EC2 resources to be started to get your processing done in a reasonable time.

Job Queue

Squeegee uses one job queue for all stacks connected to the compute environment defined above

Job Definitions

Squeegee uses a single Batch Job Definition SqueegeeRunner. This one job definition will run the docker container defined in the CloudFormation template which can handle all of the possible Squeegee tasks.

Nested Stacks

  1. VPC Stack
  2. Cost and Usage Report Transform
  3. Pricing API Transform

Redash

We recommend you checkout Redash as a visualisation solution. In order to assist you in testing it out we have generated a Cloudformation Stack. https://bitbucket.org/atlassian/squeegee/wiki/redash

Alternate Solutions

I have been made aware of another solution out there: https://aws.amazon.com/blogs/big-data/query-and-visualize-aws-cost-and-usage-data-using-amazon-athena-and-amazon-quicksight/

While this looks very similar in how it operates it has some major differences.

  • Squeegee Creates parquet files *much more efficent, cost effective and faster than CSV
  • Squeegee Can work on small companies CUR files right up to very large enterprises
  • Squeegee Uses AWS Glue for table creation
  • Squeegee Keeps all columns of the CUR file - unless you specify to drop selected columns
  • Squeegee Can handle cross region and cross account deployments
  • Squeegee Can re-process any months data so its possible to load in all of your CUR files

Updated