Giving the power of data in hands of your data analyst

Design by Nitesh Jemni

Over the last 3 years, our data analytics team has grown from one central team of 4 to over 100 members. Our analysts are spread across our data team, business functions and data science teams. So needless to say, as both data and the team grew, our ways of working had to change to meet the demand of easy, collaborative data access without the bottlenecks of a central control room.

In this post, we’ll take a look at the components required for building such a Jupyter notebook-centric data platform along with all the open-source tools involved. We will also understand how the components tie together and reflect on some learnings from setting up such a platform here at Grofers.

Motivation

If you are like one of us, who started their data journey in the last 5–6 years, it has definitely been easy for us. With tools like Redshift & S3 (or their counterparts in other cloud offerings), half of your data platform is up and running in some UI clicks. You don’t have to worry about creating and managing the Hadoop infrastructure or worry about scaling things from day one. The other half of the problem is providing your data analysts access to that infrastructure, which you still have to figure out on your own.

In the initial days when we had a small team, our data analysts would spin up Jupyter Notebook servers locally and work on them. Collaboration was mostly limited to the good old fashioned “Look at this code on my laptop” or in some cases, Github. We could not expect our analysts and data scientists to be proficient with Github nor could they experiment effectively since notebooks are mostly treated as a disposable boilerplate. Moving them all to a single EC2 based Jupyterlab server also didn’t work for long. It added even more problems which we are going to talk about later in this blog.

As teams were growing, we understood that we needed to make our platform reliable, secure and scalable. We also wanted to give consistent experience to our data analysts. We started exploring how other companies, that are much ahead in their data journey, solved this problem. There are many ways to skin the cat and we picked what addressed our problems the best.

Problems

There were a lot of issues faced by our data analysts in a single EC2 based Jupyterlab server :

  • Our data analysts couldn’t safely install their dependencies without worrying about breaking stuff for other people
  • Resource sharing became very difficult. People started getting OOM errors and kernel dying because someone else was running resource heavy notebooks
  • At times, notebooks were accidentally altered by folks which added further to confusion. Recovering deleted things was even more difficult
  • They had to keep credentials and API keys in the notebook as plain text, which was unsafe, and had to change it frequently as passwords were getting rotated
  • Scheduling them required crontab to be set up and using nbinteract to convert those notebooks into python code which could then be scheduled. This was a cumbersome process and often required intervention from data engineers.
  • Updating these notebooks was also equally difficult

We knew it was time to update our Jupyter infrastructure as we reached a natural tipping point and the problems we needed to solve were clear to us.

Notebook revolution

Our notebook revolution was achieved through two main components:

  1. We gave isolated sandboxed environments for every user to work with their notebooks using Jupyterhub
  2. We packaged all the dependencies and commonly used utilities into a custom package, and called it pencilbox

Jupyterhub

We decided to go with JupyterHub which is a multi-user version of JupyterLab designed for larger teams. You marry that with Kubernetes and it solves half of the problems we have discussed above. Thanks to the open source community, there is even a complete solution out there which gives a helm chart that enables you to easily deploy JupyterHub in your Kubernetes environment.

The helm chart gives functionality out of the box to easily spawn isolated servers for every user and creates their own home directory on a network file system for persistence. Kubernetes also helps you avoid dependency hell by packaging and pre-building the different environments that serve different use cases. We could now manage server resources along with the environments where depending on the use case, one could choose a different CPU, Memory, and even GPU units as necessary.

We also customised the workflow for our analysts and provided them with options to choose from by overriding the KubeSpawner in the hub config.

These are some examples of the different profiles a user can select based on their use case. These profiles are added to the Jupyterhub profile list and essentially Docker images built on top of the jupyter/datascience-notebook with our utilities added.

singleuser:
profileList:
- display_name: "Pencilbox environment"
description: "To have every bell and whistle, along with a pencilbox for your notebooks"
# Configuration unique to this profile
kubespawner_override:
image: "<registry_url>/pencilbox:version"
# Defines the default profile - only use for one profile
default: true

We also added a bunch of basic notebooks using post hooks which are made available in every users’ home to help them set up their Github SSH & GPG keys easily. All these changes made the entire workflow a breeze for our data analysts.

This setup is now used across Grofers not only by the data science and analytics teams, but anyone who wants to tinker around with data.

Here is the architecture diagram of our complete setup which covers our JupyterHub setup (as well as our Airflow setup, which we are going to talk about in the Part — II of this blog post).

Pencilbox

We also built the library that we discussed above and called it Pencilbox. It was a Swiss Army Toolkit of commonly used utilities. It helped us solve a lot of issues that we wanted to solve.

Fixing access to database credentials

In the previous setup, connecting to a database would go like

import pandas as pd
import sqlalchemy as sqla
connection = sqla.create_engine("db://username:password/database")
df = pd.read_sql("select * from test", connection)

To

import pandas as pd
import pencilbox as pb
connection = pb.get_connection("[Warehouse] Redshift")
df = pd.read_sql("select * from test", connection)

Our utility uses Hashicorp Vault as a backend and provides a secure way to connect. We have also built a data governance using Vault by creating authentication tokens on the fly based on user emails. This would mean only people who are part of certain policies get access to tables (We would cover our data governance in a separate blog in future).

Simplified workflows for Redshift & our Data catalog

People could easily update the data back to Redshift using Pencilbox. They could pass in a pandas DataFrame or S3 path and update data easily and not worry about the underlying mechanism of Redshift unload & copy. Specifying the table & column description also meant that people could easily search these tables in our data catalog.

kwargs = {
"schema_name": schema_name, # Redshift schema name
"table_name": table_name, # Redshift table name
"column_dtypes": [
{
"name": "date",
"type": "timestamp",
"description": "About date column"
}
],
"primary_key": primary_key, # list
"sortkey": sortkey, # list
"incremental_key": incremental_key, # string
"load_type": load_type, # append, rebuild, truncate or upsert
"table_description": table_description
}
pb.to_redshift(df, **kwargs)
# or
pb.to_redshift(filename, **kwargs)

Simplifying day to day use cases for reporting

Most of the notebooks as an output either save the data back to Redshift/S3/Google sheets or send emails. We were seeing a lot of boilerplate code being written multiple times and hence decided to add more things to reduce the effort for our data analysts as well as our data engineers.

# Send Email 
pb.send_email(from_email, to_emails, subject, html_content, files)
# Read data from Google Sheets
df = pb.from_sheets(sheet_id, sheet_name)
# Write back data to Google Sheets
pb.to_sheets(df, sheet_id, sheet_name)
# Send a message to slack
pb.send_slack_message(channel, text)
# Read/Write files from S3
pb.to_s3(filename, s3_bucket, s3_key)
pb.from_s3(s3_bucket, s3_key, filename)

Pencilbox is now being used to power all of our notebooks @ Grofers. It has helped in getting adoption for our data tools. Our data analysts come from various backgrounds and having these utilities have helped them get onboarded to the platform without worrying about underlying infrastructure.

In this post, we talked about how we have solved ad-hoc use cases for our data analysts by building an infrastructure which scales with the needs of individuals. Pencilbox has immensely helped in simplifying our workflows and focusing on things that matter. In part two of this blog, we will talk about another utility that we built which has helped in further reducing the friction that people encountered while scheduling their workflows on Airflow.

This project was bootstrapped by our Bangalore data team and is now maintained by these awesome data engineer folks Akshay, Deepu, Ishank, Pragun, Sangarshanan, Shivashis, Shubham, Sudeepam and Sunil

Satyam Krishna is Engineering Manager for Data Engineering at Grofers. Follow him on Twitter.

Sangarshanan Veera is Software Engineer in the Data Engineering at Grofers. Follow him on Twitter.

We are hiring across various roles! If you are interested in exploring working at Grofers, we’d love to hear from you.


Giving the power of data in hands of your data analyst was originally published in Lambda on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Grofers

Leave a Reply

Your email address will not be published.


*