A while back I was working at a very fast-growing startup with really complex operations. We were hiring new data analysts every month, our operations team kept needing more and more dashboards, and the whole thing was becoming difficult to manage. The typical approach to this is to ramp up hiring in order to increase the team’s capacity.
While that’s true early in the team’s life, it soon becomes more beneficial to make each analyst more efficient than to simply try to add headcount. I saw this first hand when we worked with a leading Data Catalog provider company to set their tool up for our system.
What is data catalog?
Data Catalogs are a relatively new category of tools but they’re absolutely essential to the growth of your team. A typical set of features include:
- Metadata for tables and columns;
- Assigning ownership of data assets to team members;
- Surfacing the most used tables and columns;
- Flagging data assets as verified or unverified;
- Tracking the lineage of each asset (i.e: which assets was it derived from);
The benefits of data catalogs
Before we had a Data Catalog, we spent tons of time on Slack posting questions like “does anyone know if column X in table Y is reliable?”, “does anyone know who owns table Z?”, and so on.
Eliminating the need to spend hours searching for answers to questions like these is exactly why it’s important that you invest in good documentation for your team. Not only does it save you time, it saves you from making lots of errors resulting from using faulty data. It takes an upfront effort to set up a system like this, but in the long run, it always pays off.
Tools to use to build a data catalog
Fortunately for us, this category of tools is growing very quickly. If you’re constrained on your finances but have engineering resources available, an open-source solution like Amundsen - developed at Lyft - is a great option. If you’re willing to trade some cost for much less work, Stemma is now offering a cloud-hosted version of Amundsen too, with additional features that they’re continuing to build on. Other players include Alation, Data World, DataGalaxy, and many more.
But hey, you don’t even necessarily need extra tools to get something basic but incredibly useful in place. To start, you can simply open up a spreadsheet, create one sheet for each important table, one row for each column of that table, and add important information about them (like who the owner is), and use colors to designate verified vs. unverified columns. Even if you end up getting a Data Catalog later, this work will save you time in setting it up.
Ultimately, the point is this: you need to get started on documentation now because it’s guaranteed to pay off for you.