Menu Close

So about those Joins

Jul 6, 2016 by Sameer Al-Sakran

One of the most common feature requests we get is to “support Joins.” We’ve had a long and semi-sordid relationship with joins in Metabase since the project’s inception. We’ve often alluded to how we think relational data should be used in Metabase in GitHub issues, our discussion forum, and on Twitter. Given how often the topic comes up, we figured now is as good a time as any to pull those conversations together.

So what’s a join?

Most right-thinking people have no idea what a join is, aside from maybe that thing a welder did to a couple pipes when they were setting up your parents’ hot tub back in that story from the ‘70s they keep telling you. If you have no idea what a join is, go back to using Metabase or looking at some puppies online.

How we think about joins in Metabase

If you’re still with us, one of the core notions of how Metabase handles joins is that we’re unlikely to ever expose a single “join” operator. Joins are a tool in relational databases/SQL that happen to be used for a great many different applications. They let you represent objects that hold other objects (e.g. playlists of songs), subtract one time series from another, denormalize a set of tables that all describe a single entity, get information from a connected object (e.g. a blog post’s author’s location), etc.

While we still have a lot of work to do, we’re trying to create primitives that more closely reflect what a user is trying to do with a given bit of connected data, rather than exposing a single idiom (with four flavors).

Our goal with Metabase has always been to provide a way for non-technical users to answer their own questions in a self-serve manner. While joins are a great tool that a skilled analyst or programmer might reach for, we will be trying to add features that expose a highly specific, easily understood operation that someone who isn’t SQL fluent would understand.

In the meantime

Okay, so that’s a beautiful vision of the future. But given that we’re understaffed and in need of help, if you need a join to fix a problem RIGHT NOW, what are you supposed to do?

So first off, SQL queries can always be used as a fallback. If there’s a specific question that can’t be answered using our GUI query builder, you can always just write some SQL that includes joins. Once our support for SQL parameters ships in our next version, your users can then use a parameterized SQL query. This should be helpful in getting their recurring questions taken care of, so long as they are relatively uniform.

But what if your data model necessitates joins to ask just about anything remotely useful? Maybe your DBA took a Coursera class on normalization and all of a sudden you can’t do anything without a join? Or you’re hitting a hyper optimized transactional schema of your application directly, rather than a data warehouse with a schema more appropriate for analytical queries?

Pain on write vs Pain on read

Now’s a good time to take a bit of a detour to talk about pain in analytics that is brought about by your data model.

For most applications or companies, there are a large number of perfectly reasonable ways to store your data. Some of them make it really easy to ask analytical questions, because they map more closely with how a semi-quantitative person thinks of the underlying things your data model describes. These tend to be problematic in developing or scaling the application, and can be glibly characterized as “Pain on Write.”

Alternatively, you can chose a data model that makes it easy for your application developers to develop the application, but really hard on anyone trying to use your database for analytics. Glibly, “Pain on Read.” This is by far the most common situation. When people comment on GitHub and say something to the effect of “WE REALLY NEED JOINS,” what they are implicitly saying is that they chose a data model that makes it really hard on anyone using their database for analytics. The easiest release valve for this is to just slap three joins in every query. Once you go down this road, well then, yes, Join Support is a must-have in any analytics tool you use. However, by making queries that an end user would naturally like to ask require joins, your data model is implicitly saying you don’t want end users to ask their own self-serve questions. We think this is quite limiting.

This issue usually comes up as companies are moving from a time where getting the application working and out-the-door is slowly becoming less of a major risk, and the bigger risk is “how are we supposed to run this as a business?“ As more people are involved that are not directly doing product development, analytics becomes more and more important. This is in general a sign that you should be moving from a Pain-on-read worldview to a Pain-on-write one. This is a natural consequence of the fact that there will be significantly more things that read your database rather than write to it.

As an aside — in really large companies, there is a natural tension between data models that are optimized for application development, and data models optimized for analytical use. Typically, this tension is resolved by maintaining two (or more) effective data models. One in the “transactional database,” and one in the “analytics warehouse.” These two are kept in sync by spending hundreds of thousands of dollars (or more) on tool and engineering time. In a real sense, that’s the “Right” solution to this fundamental tension.

The Right Now

So, at this point you’re probably stewing a bit and are still in the “I HAVE A PROBLEM RIGHT NOW AND I NEED JOINS” state. Well, we’ve ranted long enough, so let’s talk solutions.

We’re assuming you’re still in the stage of your company lifecycle where you don’t have a couple of full-time engineers who you can assign to working on ETL, and that you would actually like to let anyone in your company answer their own ad hoc questions.

So rather than spending time constructing complicated queries with a bunch of joins, create SQL Views instead. For most data models, there should be 5–10 denormalized tables that should express 90% of the typical ad hoc queries your coworkers will ask. Take the time you’re spending writing ad hoc queries for them, and throw together a few views that make those queries trivial to ask in Metabase’s GUI query builder. While it will take longer than just whipping up a query, it will most likely save you time in the future.

Won’t it be slow?

Yes it will. Sadly, that’s a function of your data model. If a SQL View-based approach is slow, it’s highly unlikely to be fast when the joins are explicitly done in the end user queries either. To make it fast, either materialize your view, or step your ETL game up.

Won’t it be fragile to changes in the tables?

Yes it will. But, again, that’s a fundamental shortcoming of your data model. It won’t be any less fragile if those same tables are being used in tens or hundreds of queries you don’t control; it just means when they break, other people suffer first.

Isn’t this just a ghetto form of ETL?

Yes, it is. But that’s okay, because it’s probably the least worst options. Bad option #1 is keep using a Pain on Read data model and keep anyone who isn’t an analyst from asking questions. Bad option #2 is use a Pain on Write model that makes app development a pain in the ass and more bug prone. So … yeah, ghetto ETL it is.

And Now for a Shameless Plug

Metabase is the fastest, easiest way to get everyone in your company off your back and asking their own questions. Download it at http://www.metabase.com/start/ and get it up and running in five minutes for everyone on your team!