After helping build out a few startup’s analytics stacks, one starts to see some patterns. Sometimes these patterns are happy ones. Just about everyone loves the moment when you go from having no idea what’s going on, to a foggy idea of what happened last week.
Some, not so awesome. We’ve written previously about ways that organizations misuse data. For most working analysts or anyone charged with setting up analytics at a company, there is a much more fertile source of PTSD - data models or schemas! We’ll share a few common problems that still cause some of us to flinch.
These are strictly speaking, mistakes or anti-patterns at small resource constrained companies. It’s important to note that a lot of these patterns were great ideas at later stages, or in different companies but in an early stage, that is pre, or just barely post, product market fit, they were less than awesome.
Without further ado - the top 5 sources of pain in early stage analytics
Polluting your database with test or fake data
Whether it’s test accounts, staff accounts, different data programs or orders that come in through feline telepathy, many many companies include something that requires certain events or transactions to be ignored in the bulk of common queries.
If this is the case, you’ve introduced a tax on all analytics (and internal tool building) query writing. Often this is balanced against transactional efficiency or (application) developer productivity. Sometimes this tax is worth it , sometimes it isn’t. For very large companies, transactional efficiency is a important enough goal that you can afford to spend a couple engineers’ or analysts’ time to clean up the results.
If you’re small, chances are you can’t and you should probably place the trade-off somewhere else.
Reconstructing sessions post hoc
A sizeable fraction of valuable questions around user behavior, satisfaction,and value revolve around session metrics. Whether they are called “sessions”, “conversations”, “support contacts”, or otherwise, they refer to a number of discrete events related to a user that should be grouped together and treated as a single concept. However, it is frighteningly common for this basic concept in the vocabulary of a business to not get captured in the data model.
It is often reconstructed post hoc, which typically results in a lot of fragility and pain. The exact definition of what comprises a session typically changes as the app itself changes. Additionally, there is typically a lot of context around a user’s session on the client or the server processing their requests. It is far far easier to assign a session, support ticket or conversation id in the app and use that than to try to reconstruct them after the fact.
At scale, delete rows in a database under significant load is a Bad Thing. Soft deletes are a common schema tool that alleviate the performance hits of deletion, and subsequent compaction (or vacuuming). Additionally, they make it easy to un-delete rows or in theory recover deleted data.
On the flip side, they require every single read query to filter for deleted records. If you consider just the application data calls, this might not seem so bad. However, when multiplied across all the analytics queries that will be run, this quickly starts to become a serious drag and yet another place where different users make different assumptions and lead to inconsistent numbers that need to be debugged.
Mis-using semi-structured data
Semi-structured data (eg, fields encoded as JSON) can be useful in situations where there are a number of different structures present over time. As databases get larger they can also be very helpful in avoid the hassles of migrating large tables under heavy read or write load.
However, they also can lead to a lot of heartburn when trying to get data out of a database. Typically semi-structured data has schemas that are only enforced by convention, might change unpredictably, might be off due to transient bugs, and in general require a lot of post-hoc cleaning to be useful.
Additionally sometimes, semi-structured data fields are an excuse to not think about the structure until after you’ve written a feature. In this case, you actually have structured data, it’s just unenforced, prone to bugs, and generally a pain to use. A simple test - if a JSON field has the same 4 fields, it’s not semi-structured, you’re just lazy.
The “right database for the job” syndrome
The presence of a multitude of different databases used in a company’s technology stack usually indicates one of three scenarios. One is a truly huge, complex company that has a wide variety of needs and subdivisions. The second is a highly functional, top performing engineering and ops team working on very difficult problems that needs to highly optimize all aspects of the app. The third, and by far the most common, is a small team lacking discipline or leadership that is constantly putting out fires in technologies they only shallowly understand.
Aside from being an operational nightmare, each additional database is another set of data that needs to be moved into an analytics database, with slightly different semantics, data types and natural data models. Resist the urge to make a small feature easier to implement at the cost of making operations and analytics 1.5x harder across the board.
A useful heuristic
When thinking through how the whether the data model has taken analytical nad transactional needs into account, it’s useful to do the following:
- Collect the top 10 most commonly run update queries from the application
- Collect the top 10 most common read patterns from the application
- Collect the top 10 important metrics the business operations folks care about
In general, you’re looking for a data model that minimizes pain across all of these distinct queries. In general, you’ll maximize overall productivity by accepting a bit of complexity in 1+2 if it makes 3 easier. This is because with most companies there are much fewer workhorse queries on the application side than common analytics or business intelligence questions.
Also, 1+2 are usually done in source control, are wrapped in automated testing, and are generally much more hardened. Queries for 3 are usually scattered, written by many people and generally much less controlled. Anything you can do to make the queries in 3 idiot-proof pays off big in the long run.