Reducing data transfer costs with a Docker registry cache

Reducing costs has been a focus area for us ever since the COVID pandemic started. We recently published a blog post on this topic. One of the prominent contributors to our AWS costs was the data transfer of our internal Docker registry. We use the Oregon region for our development environment & Singapore region for the production environment because of the cost difference between different AWS regions. This multi-region setup led to high costs of data transfer. In this post, we will talk about how we managed to reduce this cost.

Our setup

We run the Docker registry on EC2 instances with AWS S3 as a storage backend for images and AWS CloudFront as CDN for caching images. Docker registry container behind an Nginx proxy for authentication. We use the way recommended in the documentation.

Removing CloudFront middleware

To save data transfer costs from outside the VPC, we disabled CloudFront middleware from the configuration. Removing CloudFront from our configuration impacted our Docker pull times in the Oregon region, but there was no impact in pull times for the Singapore region since the S3 bucket storing all the Docker images is in the Singapore region as well.

We were expecting the data transfer cost to go down but it didn’t create any significant impact since the data transfer happening between VPC & S3 was still over the internet. We decided to set up an S3 VPC endpoint for our VPCs to make data transfer between our VPC & S3 private. After this change, our data transfer costs for the Singapore region reduced quite a bit. This was our first win!

We were still getting charged for data transfer between S3 & Oregon region because the S3 bucket storing Docker images bucket was in the Singapore region and VPC Endpoints for services work for VPCs and AWS services within the same region.

Pro-tip: If you want to use CloudFront in one region but want to use s3 for some regions, you can use ipfilteredby & awsregion to bypass CloudFront and pull from s3 directly. You can check the documentation here for more information.

Setting up a Docker registry as a pull-through cache

After going through the Docker registry’s documentation, we came across how to setup Docker registry as a pull-through cache. This was a simple configuration change. If the image earlier was at my-docker-registry.com/my-image, I can also pull the same image through my-docker-registry-cache.com/my-image.

Docker introduced rate-limiting on Docker pulls from their registry in November 2020. We haven’t tried this out, but a private Docker registry as a pull-through cache can be used to avoid getting rate limited as well as improve image pull times. You need not be concerned about pulling stale images since it checks for stale images by making a HEAD request upstream. And HEAD requests are not counted by Docker for rate limiting. You need to set https://registry-1.docker.io as remoteurl in the configuration for this.

We used a configuration like this:

https://medium.com/media/ebfd7a5b6a2d7dc9b8694959f1539d18/href

Once we had set up the pull-through cache, the next challenge that we faced was how do we change it for applications. Making changes to every codebase and handling a different image path for production and development environments was difficult. We explored two approaches:

  1. Geolocation based DNS records
  2. Mutating webhook in Kubernetes

Geolocation based DNS records

How this approach works is that it will return different responses depending on your location. This approach seemed great but had two issues:

  1. Not every DNS provider supports geolocation based DNS records (e.g. — AWS Route53 supports this but not Cloudflare which is our primary DNS provider)
  2. We could pull images from the Oregon region, but it failed when we tried to push images.

We had some workflows which were building images in the Oregon region and pushing them to our private registry. So we couldn’t go ahead with this approach since it would break existing workflows.

Mutating webhook in Kubernetes

The majority of the pulls in the Oregon region for us were from our Kubernetes cluster. So, we deployed a mutating admission webhook, which replaces my-docker-registry.com with my-docker-registry-cache.com in Docker image URLs. This can be easily achieved using kyverno. You can use this policy as an example to mutate container’s image url.

Conclusion

We deployed this change around May 20. The previous drop was because of image caching on Kubernetes nodes. This setup improved our Docker image pull time and reduced our data transfer costs from our Docker setup by more than 50%. We explored migrating to AWS ECR, but it didn’t support cross-region replication last year when we did this. Since AWS ECR supports cross-region replication now, it might be a good idea to explore that as well now.


Reducing data transfer costs with a Docker registry cache was originally published in Lambda on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Grofers

Leave a Reply

Your email address will not be published.


*