Menu Close Log in Get started

Data and Business Intelligence Glossary Terms

A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
V
W
X

A

Aggregation

The act of summarizing data with a mathematical function, such as a averaging the values in a column, or counting the number of rows in a table.

What is an aggregation? An aggregation is the act of summarizing data with a mathematical function, such as averaging the values in a column, or counting the number of rows in a table. The resulting number is often called a metric, which is distinct from metrics in Metabase. Individual values in a field may not hold much meaning on their own, but when we combine these values in some specific way, we can paint a more comprehensive picture of our data. Aggregation is the process of collapsing those values into a single result, and is usually performed in conjunction with grouping — that is, combining multiple rows based on a certain value, like grouping by a dimension (e.g., a product category or country). Aggregations can be calculated on the fly, but you may also want to create summary tables with your results and save the results of those aggregate functions for future use. Summary tables can be especially useful when working with large datasets; since summary tables are precomputed, queries that rely on them can run much faster. Common aggregate functions in SQL Different databases have different sets of functions, but here are some of the most common aggregate functions that you’ll encounter: COUNT() - Counts the number of rows in a table. AVG() – Computes the average of values in a field. MIN() – Identifies the minimum value in a field. MAX() – Identifies the maximum value in a field. SUM() – Returns the sum of values in a field. STDEV() – Calculates the standard deviation of values in a field. Example aggregation Using Metabase’s Sample Database, let’s say we wanted to know the average price of our products, grouped by product category. In this case, we’ll use the Products table. Our SQL query would look like this: SELECT category, avg(price) FROM products GROUP BY category Just like we wanted, we’ve calculated the average of values in the Price column of our Products table, and grouped those averages according to their value in the Category field. If we wanted to execute that same aggregation in Metabase’s query builder, we’d Summarize by Average of Price, and then group by Category, like in the image below: Fig. 1. Performing an aggregation in the query builder: average price of products grouped by product category.

Read More
Area chart

A type of line chart where the space between plotted values and the x-axis is filled in with a solid color.

What is an area chart? An area chart is a type of line chart where the space between plotted values and the x-axis is filled in with a solid color. You can think of an area chart as a cross between a standard line chart and a bar chart. In an area chart, you still have a line that connects data points (usually plotting change over time), but by filling in the area between that line and the x-axis, you’re highlighting both the data’s relationship to zero and the relationship of any subcategories to a total. When to use an area chart Area charts are a great choice when: You’re plotting multiple subsets of data that, together, make up some total value. You need to represent volume and changes in volume over time. Your data contains too many x-axis data points for a stacked bar chart. Consider a different visualization, like a bar or line chart, if: You’re only measuring a single series over time. Your data is highly volatile, subject to frequent and/or significant swings. You’re measuring a large number of subcategories within one chart. Example area chart Similar to a stacked bar chart, stacking data on an area chart works best if you have a small number of categories to track — too many and your area chart will start to get chaotic or unreadable. Figure 1 shows Orders broken by out by Product → Category, visualized as a stacked area chart. Using this visualization, we can understand both the rise and fall of individual product categories and how orders have performed as a whole: Fig. 1. An example area chart.

Read More
Attribute

An attribute is a property that describes or identifies some entity. In some Metabase plans, user attributes are used to restrict which data people can access.

What is an attribute? An attribute is a property that describes or identifies some entity. People in the data world use “attribute” in a few different contexts, so we’ll do our best to disambiguate here. Basically, an attribute is a characteristic of something. That something might be a table, but attribute could also refer to a characteristic of a specific record, like user attributes in Metabase. Attributes in relational databases In relational databases, people often use attribute synonymously with column or field, like how a product’s Category is an attribute of (or describes) that product. This usage of attribute comes up a lot within the context of data modeling and when designing entity relationship diagrams. Example attribute Here’s a look at the People table in Metabase’s Sample Database, which includes fields like ID, Name, Address, City, State, and so on: Fig. 1. A look at the People table. Each of these fields is an attribute — the values in those fields describe something about the record they’re associated with, in this case a “person” in the People table. User attributes in Metabase ⚠️ Sync user attributes This feature is only available on Pro and Enterprise plans (both self-hosted and on Metabase Cloud). Attribute can also refer to a distinct variable value that gets associated with a certain user, like a User_ID. That structure is known as a key-value pair, sometimes referred to as an attribute-value pair. In Metabase, some plans allow you to set user attributes yourself (or pass them to Metabase via SSO). You can use these user attributes to set up custom destinations on a dashboard, for example by using a user ID to parameterize a URL when that user clicks on a chart. User attributes are also an important part of data sandboxes, which give you granular control over the data that people using your Metabase instance can access. Since data sandboxes are associated with individual users, setting distinct user attributes lets Metabase know exactly how to filter a table depending on who views it.

Read More

B

BI tool

An app designed for people to look at data without relying on code.

What is a BI tool? A BI tool is an app designed for people to look at data without relying on code. These apps allow people to visualize and share data as tables, charts, and dashboards. BI tools plug into existing data sources at your organization, such as your data warehouse, CRM, or event analytics service. Common BI tools Spreadsheet applications A classic example of a BI tool is a spreadsheet app like Microsoft Excel or Google Sheets, where you can visualize data in tables, pivot tables, or charts, and share the results as individual files (or links to those files in the cloud). BI platforms A BI tool is often thought of as an app that is only used to visualize data and make reports. A BI platform like Metabase is type of BI tool that can handle additional tasks adjacent to reporting, such as data modeling, data cataloging, version control, and permissions management. How do BI tools fit into a data stack? Fig. 1. BI tools fit under the analytics component of a modern data stack. BI tools are one of many analytics tools that can be set up at the user-facing end of a data stack. BI platforms in particular can handle some of the same tasks as other parts of your stack. Here’s how you can expect the pieces to interact: BI tools vs. databases BI tools aren’t data sources — they don’t replace production databases or data warehouses for storing data owned by your organization. BI tools pull information from databases by running queries and displaying the results. BI tools vs. ETLs BI tools don’t replace ETLs (or ELTs) for ingesting or transforming large amounts of data on a schedule. However, like ETLs, some BI platforms can handle data modeling and data stitching (joining data from different databases) by running queries on the fly. BI tools vs. event analytics services Event or web analytics services like Google Analytics, Segment, or Amplitude collect usage data from your product. Although these services come with their own interface to visualize and share that data, they aren’t considered BI tools. You can think of them as mini data stacks that can be used standalone. Event analytics services can be integrated into your core data stack by combining them with a central BI tool. You can download event data from the service and move it into the data warehouse connected to your BI tool, or if supported, wire up the service to connect directly to your BI tool. In Metabase, you can add Google Analytics as a data source. BI tools vs. open source coding tools Open source tools like Jupyter Notebook and RShiny use programming languages like Python and R to work with data. They can be used to build reports and dashboards for analytics, but they aren’t considered BI tools because they rely on code rather than a visual interface.

Read More
Bar chart

A data visualization that uses rectangles that are proportional in size to the values they measure.

What is a bar chart? A bar chart is data visualization that uses rectangles (or bars) along axes that are proportional in size to the values being measured. These bars are scaled to the values associated with discrete, categorical data. Bar charts are useful when you want to compare fixed values across categories. One axis will contain the categories being measured, while the other will show their corresponding numerical values. Fig. 1. A bar chart showing the average rating of products by category. Figure 1 shows a standard column chart where the bars are displayed vertically, but you can also visualize your data with horizontal bars, known as a row chart. Figure 2 displays that same data — average product rating by category — in a row chart. Fig. 2. The same data visualized with horizontal bars, called a row chart. You aren’t limited to a single dimension with bar charts; they can be great for multi-series charting. For example, maybe you want to compare how orders performed by quarter based on user source. In this scenario, you could use a standard multi-series bar chart or a stacked bar chart, where user sources are stacked proportional to their values within a single bar, each bar displaying data from one quarter. And if you’d rather see those as relative percentages of within their stacks, consider a 100% stacked bar chart. Bar charts are widely adopted and easy to interpret, making them a good option for quick-reference comparisons.

Read More
Bin

A single range of continuous values used to group values in a chart.

What is a bin? A bin is a single range of continuous values used to group values in a chart. Binning data helps simplify data visualizations, so people can get a sense of their data’s distribution and easily spot outliers. You most often see bins used with histograms, but they aren’t exclusive to histograms and can be useful with other visualizations like line charts or pie charts. If a measure in your dataset contains a lot of unique values, plotting each individual data point on a chart can look cluttered and may not be the best representation of your data. When you bin that data, those values get grouped into equal-sized intervals (like 1–10, 11–20, 21–30, and so on), and your resulting chart will show a count of values within each bin. Data binning example Figure 1 shows the price of products in Metabase’s Sample Database, displayed as a histogram. Fig. 1. The prices of products in our Sample Database, shown as a histogram. Metabase automatically generates bins based on how the data is distributed. The bins here are the price ranges; we can see that we have more products in the $37.50–50.00 price range than any other. Metabase auto-binned these values, but we can also select the number of bins we wanted (either 10, 50, or 100) to tweak this chart further. If your bin size is too small, you’ll have too many and likely end up with a visualization that’s hard to interpret. However, too few bins will give you an incomplete or overly-compressed picture of your data’s distribution, so play around and figure out what works best for your data.

Read More

C

Card

A component of a dashboard that displays data or text.

What is a card? A card is a component of a dashboard that displays data or text. Metabase dashboards are made up of cards, with each card displaying some data (visualized as a table, chart, map, or number) or text (like headings, descriptive information, or relevant links). Cards and questions Cards on a dashboard are more than just mini versions of the questions you’ve asked. If you’re adding saved questions to a dashboard and calling it a day, you may be missing out on a lot that cards can do. You can combine multiple saved questions in a single card, as long as they share a dimension. Your cards don’t have to contain different questions either. It may be useful to put include same card on a dashboard multiple times, like if you want to visualize one question as both a line chart and a bar chart. Editing cards on a dashboard When editing a dashboard, you can: Arrange and resize cards on the dashboard’s grid. Change a card’s visualization options without affecting the underlying question. Connect cards to dashboard filters to filter the question’s results. Set a card’s click behavior to change what happens when someone clicks on a card. And while it’s great for those cards on your dashboards to look slick and visually appealing, it’s more important that you’re conveying the information that people need to see, without too much added fluff. Example of cards on a dashboard The dashboard in figure 1 includes text cards that act as headings and descriptions, as well as cards with numbers, trends, a line chart, and a region map: Fig. 1. Cards with questions and text on a dashboard. Cards and the Metabase API In the Metabase API, the api/card route refers to questions, rather than cards on a dashboard. You can still use the API to edit and get information about the cards on your dashboards (like with api/dashboard), but keep that distinction in mind, and check out the API documentation for a full list of routes and endpoints.

Read More
Collection

A set of items in Metabase, including questions, models, dashboards, and other collections.

What is a collection? In Metabase, a collection is a set of items — questions, models, dashboards, and subcollections — that are stored together for some organizational purpose. You can think of collections like folders within a file system. The root collection in Metabase is called Our Analytics, and it holds every other collection that you and others at your organization create. You may keep a collection titled “Operations” that holds all of the questions, dashboards, and models that your organization’s ops team uses, so people in that department know where to find the items they need to do their jobs. And if there are specific items within a collection that your team uses most frequently, you can pin those to the top of the collection page for easy reference. Pinned questions in a collection will also render a preview of their visualization. Fig. 1. A look at a collection in Metabase. Administrators can set collection permissions to determine which collections (and by extension, their contents) are viewable by which groups in Metabase. In addition to the top-level Our Analytics collection and whatever collections your organization creates, each Metabase user has their own Personal collection. These are semi-private, in that admin users can view the contents of everyone’s personal collections. Personal collections are a good place to store questions and dashboards that you haven’t quite perfected yet, and you can move them to a public collection whenever you’re ready.

Read More
Column

A list of values, usually belonging to a particular field, displayed vertically in a table.

What is a column? A column is a list of values, usually belonging to a particular field, displayed vertically in a table. In a relational database table, values within a column each correspond to a different record. Values in a column share a data type. That is, if a column’s data type is Integer, that means every value within that column must be an integer. There may be other constraints too, related to formatting, character length, or whether or not that value is mandatory. Example column Here’s an image of the Orders table in Metabase’s Sample Database, with the Created At column highlighted. The Created At column’s data type is DateTime, and the values in this column each correspond to the timestamp of a single order. Fig. 1. A look at the Orders table, with the Created At column highlighted. Columns vs fields While columns and fields aren’t technically the same thing, it’s usually okay to use these terms interchangeably. See Columns vs. Fields. However, keep in mind that columns don’t always directly correspond to fields in a database. For example, you may want to create a custom column in Metabase that contains calculated values, like one that displays the percentage discount for each order. You’d create this custom column by telling Metabase to calculate the Discount divided by the Subtotal and display the resulting value in a new column. Columnar storage While many traditional relational databases store data as rows and are typically best suited for holding transactional data, some databases (like data warehouses optimized for analytics) utilize columnar storage. Columnar databases (also known as column-oriented databases) physically store a column’s values together, rather than indexing based on entire rows. This can drastically speed up analytical queries and aggregate functions, since those queries will be able to retrieve similar data from the same location on a disk, rather than executing large reads across the database to pull a column’s values from individual records.

Read More
Concat

A function that links different units of data together to treat it as one unit of data.

What is concat? Concat (or concatenate) is a function that links different units of data together to treat it as one unit of data. The word comes from the Latin root con- for “together”, and caten- for “chain”. For example, you can concatenate the name of a city with the name of a state to make a linked unit of information called a location. In most BI tools, you can assume that the concat function only works with the text data type (often called a string). If you want to use concat with other data types, like numbers or dates, you’ll have to convert them to text first. How to use concat The following examples show you how concat can be used to link together city and state values from two different columns, so they can be treated as a single location in a new column. Concat in a spreadsheet In a spreadsheet app like Excel or Google Sheets, you’ll usually write the concat (or concatenate) formula directly into a spreadsheet cell using cell references, and “fill down” that column with your formula. You can also include other pieces of text inside quotation marks. For example: Cell A1: “San Francisco” Cell B1: “California” Cell C1: =concat(A1, ", ", B1) → “San Francisco, California” Note that the use of ", " in the concat formula to add a space and comma between the city and the state. Concat in Metabase In Metabase, you can write a custom expression for concat inside the field formula for a custom column in the notebook editor. For example, you can use the expression concat([City], ", ", [State]) to add a new column for Location in the People table that comes with the Metabase sample database. Concat in SQL If you’re writing SQL, you can include the CONCAT function inside a SELECT statement: SELECT CONCAT(City, ", ", State) AS "Location" FROM People; The Metabase concat expression is translated into a SQL CONCAT function before running against your database. Note that the SQL functions JOIN, UNION, and MERGE might sound like they work similarly to CONCAT, but they are used to combine rows and columns from different tables, rather than text strings from different columns.

Read More

D

Dashboard

A data visualization tool that holds important charts and text, collected and arranged on a single screen.

What is a dashboard? A dashboard is a data visualization tool that holds important charts and text, collected and arranged on a single screen. Dashboards provide a high-level, centralized look at KPIs and other business metrics, and can cover everything from overall business health to the success of a specific project or campaign. The term comes from the automotive dashboard, which — like its business intelligence counterpart — provides status updates and warnings about important functions (just for things like low brake fluid instead of how your recent marketing campaign performed). Dashboard vs. report Dashboards aren’t exactly the same as reports, though you’ll sometimes hear people refer to dashboards as reports. The difference is that dashboards tend to be easier to read and understand at a glance, while traditional reports provide a more detailed look at a subject. Unlike traditional reports, dashboards are viewable on a single screen and often incorporate some interactive elements. You’re probably not going to print out a dashboard to read, which would make a lot more sense for a traditional report that draws on static, historical data. However, just like with traditional reports, you can send out updated dashboards according to a set schedule, like with dashboard subscriptions in Metabase. Dashboards in Metabase In Metabase, dashboards are made up of cards that contain either questions or text. You have a lot of options when creating and editing dashboards in Metabase, like: Arranging and resizing cards to fit your desired dashboard design. Making your dashboards interactive by setting custom click behavior and linking one dashboard to another. Adding filter widgets and wiring them up to specific fields on individual cards. Using Markdown to annotate your dashboard with text or GIFs. Sharing your dashboard with a link or by embedding it in your website or app. Example dashboard Figure 1 shows an example of a dashboard in Metabase with three question cards and three filter widgets. If someone inputs a customer ID, customer name, or date into one of the filter widgets, the charts will adjust accordingly to fit reflect that added filter. Fig. 1. A dashboard with three question cards, and three filter widgets.

Read More
Data dictionary

A document that describes the tables, fields, and other elements in a database and explains their meaning and origin.

What is a data dictionary? A data dictionary is a document that describes the tables, fields, and other elements in a database and explains their meaning and origin. Data dictionaries are repositories for a database’s metadata, storing the administrative information that people need to understand and make use of that data. Think of them like a typical dictionary, but instead of every word in a language, data dictionaries contains definitions and about the objects that make up your database. An up-to-date and comprehensive data dictionary helps make sure that everyone stays on the same page about what certain fields or tables mean in practice. Data dictionaries can also help make sure that different departments are all using those terms consistently. Data dictionaries are usually a separate file or set of files stored alongside the database they describe. While some aspects of your database’s data dictionary may be accessible to all database users (like important descriptions that everyone needs to know), other parts may only be viewable by database administrators (like technical details about the physical implementation of your database). Data dictionary in Metabase In Metabase, the data reference section acts as a data dictionary. What goes in a data dictionary? Data dictionaries collect and store metadata associated with a database, usually information like: Table and field descriptions Data types Integrity constraints Naming conventions Locations of files While the exact formatting of your data dictionary will depend on your organization and the complexity of your dataset, it’s common for data dictionaries to be formatted as a table or series of tables, with fields for metadata like field name, description, data type, character length, and whether null values are permitted. You can make a data catalog with a simple spreadsheet, within your relational database software, or even as a text document. Data dictionary vs. schema vs. data catalog There’s some overlap here with a database’s schema, but generally speaking a schema defines the structure of the database and how tables and their fields fit together, while a data dictionary provides contextual information about that data. Maybe you’ve heard about data catalogs too, another similar concept. Some organizations utilize data catalogs to better facilitate discovery and analysis of their data; they’re like data dictionaries with some added features and functionality, taking things a step further than the traditional document-based data dictionary.

Read More
Data lake

A data lake is a place to store both structured and unstructured information, typically as files or blobs.

What is a data lake? A data lake is a place to store both structured and unstructured information, typically as files or blobs. You can think of a data lake as a dumping ground for all of your data, regardless of structure, format, or intended use. The idea of a “lake” is largely marketing jargon, but the aquatic comparison comes from the idea that information in a data lake flows in a more “natural” state than that of the more rigid and hierarchical data warehouse. And because they can hold raw data that doesn’t need to adhere to a specific schema, data lakes tend to be cost-effective when scaling to store substantial amounts of information (into the petabytes). Since there’s no need to define a schema at the start, data lakes can be straightforward to set up; you can load data in for a specific use or just to keep it on hand for the future, even if you aren’t yet sure what kinds of queries you’ll need to run on it. However, once you do get things set up, configuring the tools you’ll need to actually make your data lake useful can get complex and expensive — typically requiring the expertise of data engineers. Those engineers will set up ETLs as needed, or even train machine learning models on parts of your data lake. Data lakes rely on a schema-on-read system, meaning data only gets verified against a schema once it’s pulled from that data lake for querying, rather than when it’s first written. This does mean, however, that pulling from and making use of a data lake takes more work. And just because a data lake allows for greater flexibility doesn’t mean you should thrown all data governance out the window; the information that goes into your lake should still be of good quality, cleaned and annotated so that your ETLs or query engine (and by extension, the people who need the data) can make good use of it. When to use a data lake If you need to analyze huge volumes of semi-structured and/or unstructured information (like if you’re an IoT company) then a data lake may be a good fit. Since there’s no need to enforce an overarching schema when data is written, data lakes can also be an effective solution if you’re dealing with many different types of data sources at once — like streaming data, structured application databases, data from IoT devices, social media, or web traffic. Ultimately, organizations with complex data needs may not rely exclusively on a data lake or a data warehouse (or even a data lakehouse), and instead construct data architectures that can incorporate both, taking into account the organization’s overall strategy, the needs of the people who’ll use it, and the types of queries those people will need to execute. Setting up a data lake Let’s say you want to set up a data lake. In broad strokes, the process will look something like this: Choose a cloud storage provider. There are data lake services out there that can help you set up the various layers and tools you’ll need, but at its core your “lake” is your storage layer — wherever you’re keeping that structured and unstructured data together (like in AWS S3 or Microsoft Azure). Identify your data sources. These may be structured (like an application database), semi-structured (like XML or JSON files), or unstructured (like social media posts, images, or text documents). Clean up and ingest data from those sources. At this stage, you’ll annotate those data sources (especially the semi-structured and unstructured ones), adding metadata and tagging and classifying them based on the types of questions you’re likely to ask of that data. Once that data has been cleaned up, those annotated copies get loaded into your data lake, probably in a columnar format like parquet that’s better for analytical queries. Create ETLs as needed and query your data lake. Because of their mix of formats and often-unstructured nature, engineers and data scientists are usually the ones directly accessing a data lake. People like your data analysts will query the data lake through the use of query engines like Presto or SparkSQL, which run ETLs over the data lake, structuring it on a regular schedule so the data can be queried via SQL. Those queries get executed on the cleaned-up, annotated, columnar copies of your data, rather than on the raw data sources themselves (both the raw data and the cleaned-up date are stored in your data lake).

Read More
Data model

Any pattern that organizes and labels information.

What is a data model? The term data model is used to describe any pattern that organizes and labels information. People will use “data model” as a generic way to refer to concepts like schemas, derived tables (views), or ERDs. A good data model helps people find things faster. For example, a mall directory is a data model that organizes information about the stores in the mall. It groups and labels the stores by category or location, and explains how the stores are related to one another by displaying them on a map. This model makes it easier for people to find out where to go, compared to wandering through the mall on their own, or reading through a random list of store names. Data modelling example To make decisions during data modelling, it’s best to start by figuring out what people want to look for, and why. Let’s say we want to create a data model for storing information about movies, to help people look for new things to watch. You can think of this data model as a template that can be filled out for any movie. The template should do two things: Represent the parts of a movie that are useful for finding a specific one. For example, people might search for a movie they want to watch by title, director, genre, or actor. Describe the relationships between the parts, so that it is easy to look up one group of information based on another. For example, the template should make sure that any movie title is associated with at least one director. The simplest type of data model groups related parts together into one template, and includes some information about how to fill it out. For example, the template below can be used as a data model for any movie. Movies Title: Any text (mandatory). Director: List of names (mandatory). Genre: Any text (optional). Cast: List of names (optional). The model can be expanded by adding more parts related to movies such as Release year or Run time. We can also expand on existing parts if they are useful for looking things up. For example, people might want to search for a movie by specific information about the actors in them, such as any acting awards they’ve won. Since Cast only keeps track of actor names, we can split out award information into a new data model. Acting Awards Award: Name of acting award (mandatory). Award year: Year (mandatory). Actor: First name and last name (optional). Since actor names appear in both models (either under Cast or under Actor), there is a relationship to connect the Movies model and the Acting Awards model. When both templates are filled out with real movie and awards information, people will be able to look up a movie by a particular award. The written templates above are a basic way to think about breaking down information for data models, but there are many best practices that you can follow depending on the use case. You can find examples of common data model formats in the next section. Common data models Schemas Schemas are a conceptual data model. They are used by people who work with databases. Information is represented by named columns and data types. Relationships are described by structures such as tables or JSON objects. ERDs ERDs are a visual data model. ERDs are used by people who need to talk about information management and architecture. Information is represented by different shapes, such as rectangles or diamonds. Relationships are described by different lines, such as arrows or dashed lines. Metabase models A Metabase model is a data model that you can create and save from a question or SQL query. Information is represented by named columns and any associated metadata. Relationships are described by the logic used in the question or SQL query. How people actually use the term data model You might find that different teams use the term “data model” informally to mean different things: People who write SQL may use it to refer to derived tables or views. Programmers may use it to refer to a schema or ERD. The Data Model in Metabase If you’re a Metabase admin, you’ll have access to the Data Model page in Metabase. The changes you make here will affect the way that data appears across all of Metabase. What’s the difference between the Data Model page and a Metabase model? The Data Model sits on top of the raw data warehouse tables connected to Metabase. It is a layer of modelling you can use to clean up the tables that your organization can see. You can think of it as a way to “translate” information between the data world and the business world by assigning human-readable names and saving common definitions of segments or metrics. Metabase models sit on top of the Data Model. They can be created by anyone with permissions to use the underlying database tables.

Read More

E

ERD

An ERD, or entity relationship diagram, is a graphical representation of how tables in a database connect to each other.

What is an ERD? An ERD, or entity relationship diagram, is a graphical representation of how tables in a database connect to each other. ERDs show a database’s structure (or schema) at a high-level. ERDs are a useful tool when designing a new data model or identifying issues within an existing schema. Entity relationship diagrams are basically just boxes (your entities, or tables) connected with lines (the relationships between them). Your database software may have some built-in functionality to create ERDs, but you can also use whatever design software you like best or go the analog route and draw out your ERD on a piece of paper. The how is less important; what really matters is making sure that your diagram is accurate and logical, so you can design the most effective database for your specific use case. Example ERD Here’s an example of an ERD showing Metabase’s Sample Database: Fig. 1. A simple ERD of Metabase's Sample Database. The four tables, Orders, Products, People, and Reviews, are our entities, and the connecting lines show the three one-to-many relationships between them. ERD design and notation When sketching out an entity relationship diagram, each box should contain information like that table’s name, field, and key information (primary and foreign keys). You’ll notice in the example above that each table’s key information is indicated with (PK) and (FK) next to field names. The type of line between each entity illustrates the kind of relationship each table has to another. Different organizations and industries use different conventions for ERD notation, but one of the most common is crow’s foot notation, named because the three-pronged symbol (the one that gets used for “to many”) looks a bit like the foot of a bird. Figure 2 shows the common symbols used in crow’s foot notation and their corresponding relationship types: Fig. 2. The lines used in crow's foot notation to represent different types of table relationships.

Read More
Embedding

Placing some functionality of one app inside another. Metabase uses iframes to embed questions, dashboards, or (in some plans) the full Metabase application.

What is embedding? Embedding is the process of placing some functionality of one application inside another. In analytics, this usually means integrating data visualizations into a parent application, allowing people to view charts within the context of the their own application. Embedding can save time and resources for the parent app too, allowing teams to draw on existing analytics tools rather than building everything from scratch themselves. While not the only way to embed something, embedding in Metabase involves using an iframe (an inline frame) to place a question, dashboard, or (in some plans) the full Metabase app within another application. Embedding Metabase charts and dashboards Embedding is more than just placing a static image of a chart into your site or app. Instead, that iframe creates a nested browser within your main browser or app that points to its own, separate URL. This way, the embedded Metabase chart or dashboard stays up to date. When you view an embedded chart, you’re still seeing the Metabase chart itself — just nested in the parent application. Depending on security configurations, your individual embedded charts and dashboards are either public or secure embeds. You can also configure or lock parameters to affect what people see on those charts, like in figure 1: Fig. 1. Making parameters editable and enabling dark mode before publishing a dashboard for embedding. Embedding the full Metabase application ⚠️ Full-app embedding This feature is only available on Pro and Enterprise plans (both self-hosted and on Metabase Cloud). With some plans, you’re able to embed the full Metabase experience within your application. Full-app embedding is particularly useful for multi-tenant analytics, like providing your customers specific reports that they can view and interact with all while remaining in your app.

Read More

F

Filter

A filter is a predicate expression that limits the results of a query based on some stated criteria.

What is a filter? A filter is a predicate expression that limits the results of a query based on some stated criteria. For example, you may want to limit the records in your Orders table so that you only see orders where the value of the Total field is over 100. We can use a predicate expression, Total > 100, to filter the orders. For each record, the query evaluates whether that expression resolves true or false, and narrows the results accordingly. So in this case if the record has a total greater the 100, that record is included in the results. In SQL, queries are filtered using the WHERE clause, like WHERE Total > 100. You can also filter aggregations in SQL using the HAVING clause, like HAVING AVG(rating) > 3.5. Filters in Metabase Filter the results of your questions. Add filters to your dashboards. Set up cross-filtering so dashboard filters update when someone clicks on a card. Link filters on a dashboard to limit results based on the value of another filter. Configure smart field filters in native SQL queries that know which filtering options to present based on field type and column data. Create filter widgets that act as a search function on a dashboard, like for a lookup tool. Example filter in Metabase Figure 1 shows the Products table in Metabase’s Sample Database, with a filter added that narrows the results to only include products where the Title field contains the word “Hat”: Fig. 1. A question in Metabase with one filter added.

Read More

G

H

I

J

Join

The combination of results from two tables in a relational database.

What is a join? A join is the combination of results from two tables in a relational database. While the word “join” makes it sound like you’re merging the tables themselves, a join actually takes the rows from two (or more) different tables and returns a new set of rows that combines columns from those tables, using entity keys and foreign keys to determine which rows are related. Types of joins There are four types of SQL joins: Left outer join: select all records from Table A, along with records from Table B that meet the join condition, if any. Right outer join: select all records from Table B, along with records from Table A that meet the join condition, if any. Inner join: only select the records from Table A and B where the join condition is met. Full outer join: select all records from both tables, whether or not the join condition is met. Example joins in Metabase Metabase defaults to left outer joins for questions asked in the query builder, but inner joins are the default for native SQL queries (that is, if you just use JOIN in your query rather than specifying which type of join). Let’s say we wanted to return results from both the People and Orders tables in Metabase’s Sample Database, like a table of includes an order ID, the name of the person who placed that order, and their user ID. Query builder join Figure 1 shows what this join would look like in Metabase’s notebook editor. We’d also want to pick which columns are visible, so we aren’t shown every column from both tables. Fig. 1. A join in the query builder. Native SQL query join If we were to write this same query in SQL, it may look something like this: SELECT orders.id AS "Order ID", people.name AS "Name", people.id AS "User ID" FROM people JOIN orders ON people.id = orders.user_ID Here we’ve identified where the join happens (in this case, joining at People → ID and Orders → User_ID, an entity key and foreign key).

Read More

K

L

M

Metadata

Information that describes data to make it easier to find, manipulate, and make use of that data.

What is metadata? Metadata is information that describes data to make it easier to find, manipulate, and make use of that data. Metadata examples Think about a file on your computer, like a digital image or text document. Among many other attributes, that file has a name, file type, extension, size, and timestamps noting when it was created, last opened, and last modified. This is all metadata — none of those attributes are really the file itself, but they do tell you important things about the file. Understanding and keeping track of this metadata tells both you and your computer how that file should be sorted and handled, like indicating to your computer what software to use when opening that file. Metadata exists beyond the analytics world, and is found pretty much everywhere. It’s important in a wide variety of industries, from photography to libraries to broadcast television, since any organization that handles or generates data needs to be able to find and organize it. Metadata is sometimes human-readable (like the title of a book or field names in a database), but can also to be machine-readable, like an XML or JSON file. Metadata in relational databases and data warehouses In a relational database, metadata includes all the information that make up that database’s schema, like the following: Table names Field names Entity keys Foreign keys Data types Views Integrity constraints However, there’s more to database metadata than just its schema. User information, business definitions, table and field descriptions, database size, and storage information are all important pieces of metadata too. Depending on how your database is configured, you may store some metadata within the database itself (like table and field names), or in a separate file or set of files that contain all of a database’s metadata. This is known as a data dictionary. In a data warehouse, metadata acts like an index or table of contents, defining all of the objects stored within that data warehouse, as well as information about the various ETL jobs that manipulate data so it can be useful for those who need it. Metadata about an ETL would likely include the name of the job, its purpose, when and how often it runs, which data the job uses, and where that data ends up. And if that job is properly annotated with plenty of useful metadata, it’s then easier for you or a coworker to understand what exactly the job does and why. Using metadata in Metabase Metadata plays a big part in Metabase! For example, designating a column’s field type (itself a form of metadata) gives Metabase an idea of what that field actually means, so Metabase can know how to format that field or what kind of visualization to show you. Models make use of metadata too. Annotating columns with a description when creating a model can go a long way in helping people better understand your data. Figure 1 shows how those descriptions show up when hovering over a column in that model: Fig. 1. Viewing the Products table's metadata in the data reference section. Finally, you can always view table metadata in the data reference section of Metabase’s data browser. Figure 2 shows the how that looks for the Sample Database’s Products table. As you can see, this view provides useful information like column names, descriptions, field types, and data types: Fig. 2. Viewing the Products table's metadata in the data reference section.

Read More
Metric

A metric is a calculation performed on a measure. In Metabase, a capital-M Metric is a saved aggregation with or without filters based on one table.

What is a metric? A metric is a calculation performed on a measure. Metrics are quantitative attributes of data, with some summarization applied. Metric vs. measure You’ll see the terms metric and measure used interchangeably, and they’re pretty similar concepts, both referring to some numerical value that’s part of (or drawn from) your data. However, there’s an important distinction: measures are raw, unaggregated data, while metrics are aggregated (or summarized) data. For example, while a field like Discount is a measure, the standard deviation of that Discount field would be a metric. Some people will also use “metric” to mean a computation of measures that’s specifically related to performance goals, like CRR (customer churn rate) or NRR (net revenue retention). By this definition, a metric is basically a KPI (key performance indicator), depending on whether or not someone has designated that metric as “key.” Example metric If we wanted to determine the average of order subtotals in Metabase’s Sample Database, we’d do so by summarizing, like in figure 1: Fig. 1. Summarizing the Orders table by subtotal average, a metric. In this case, Subtotal is a measure, but average subtotal is our metric. Metrics in Metabase In Metabase, a capital-M Metric is a saved aggregation based on one table, with or without filters applied. If there are certain aggregations that you and your team need to reference and use on a regular basis (like revenue), you may want to create a metric in Metabase so you can access it when asking questions, without rebuilding that aggregation yourself every time.

Read More

N

Normalization

The process of structuring information in a relational database to reduce redundancy.

What is normalization? Data normalization is the process of structuring information within a relational database to reduce redundancy. Normalizing data ensures that the tables in a database function as efficiently as possible, eliminating ambiguity so that each table serve a singular purpose. When moving from a denormalized database to a normalized one, you’ll probably need to break out your existing tables to create additional, smaller tables. Those new tables will have a narrower focus, and will link to other tables through entity and foreign keys. With normalization comes the added benefit of reducing your overall database size and easing database maintenance, since you’re no longer storing the same information in several places. Example of data normalization The normalization process is carried out according to rules that build upon each other, known as normal forms. First normal form (1NF) states that fields should not store multiple values within a single cell, and that each field within a table should be unique. Here’s an example: Denormalized table Product_ID Product_name Product_color1 Product_color2 P001 Knit cardigan Pink Maroon P002 Bootcut jeans Navy   P003 Linen vest Camel Off-white P004 Running sneakers Orange   You’ll notice that we have two fields containing similar information, about product color. To bring this table in accordance with 1NF, we need to break this table out into two separate tables that can be joined together. Normalized product name table Product_ID Product_name P001 Knit blazer P002 Bootcut jeans P003 Linen vest P004 Running sneakers Normalized product color table Product_ID Product_color P001 Pink P001 Maroon P002 Navy P003 Camel P003 Off-white P004 Orange Check out our Learn lesson for examples of 2NF and 3NF. While normal forms beyond these three exist, their use is largely theoretical and the first three should be sufficient for most practical database needs.

Read More
Notebook editor

The GUI interface for asking questions in Metabase, where you can filter, summarize, join tables, and create custom columns.

What is the notebook editor? The notebook editor is the GUI interface for asking questions in Metabase, where you can filter, summarize, join tables, and create custom columns. The notebook editor is part of Metabase’s query builder. There are three main steps to asking a question in Metabase using the notebook editor: Data: Choose the table, model, or saved question that will be the basis of your question. Technically this is the only required step for using the notebook editor. Filter: Add filters to limit the results of your question. Summarize: Aggregate your data and group it by one or more dimensions. You can preview the results of your question after each of these steps. In addition to filtering and summarizing, the notebook editor lets you order and limit the results of your question, create custom columns, and join tables. And if you want make adjustments after visualizing your question, you can always come back to the notebook editor to make those changes. Example question in Metabase’s notebook editor Figure 1 shows and example question in the notebook editor. Starting with the Sample Database’s Products table, we’ve filtered our data so we only see products where the Price is greater than $50, and then grouped that result by Category: Fig. 1. Asking a question in the notebook editor. When we click Visualize, we’ll see a table showing how many products in each category have a price above $50.

Read More

O

P

Parameter

A special type of variable that specifies an input to a query.

What is a parameter? A parameter is a special type of variable that specifies an input to a query. Setting a parameter lets end users input a value (like in a dashboard or report) to change what data that query returns, typically filtering by a measure or dimension. The parameter passes that value through to the query being run, and the results of that query will depend on whatever value that person entered. Parameter vs. variable vs. argument You may see “parameter” used interchangeably with “variable” or “argument,” so it’s worth pointing out some distinctions here. A parameter is a type of variable; it’s just one where some specific input value gets passed along to the program or query being run. Not all variables are parameters, though — you may also have variables that are set within your program or query and can’t be modified by anyone on the other side. An argument refers to the value itself that gets passed along when your program or query runs. For example, if you set a parameter as {{productID}}, and enter a value of 34, your argument is 34. A parameter defines that there will be an input value, but the input value itself is the argument. So yes, technically these terms all differ, but it’s okay to use them interchangeably, as long as you’re generally referring to a place or container to pass values into so you can filter results. Parameters in Metabase In Metabase, you can set a parameter using a filter widget or via a URL. Parameters come into play in Metabase in a few different ways: SQL templates By adding parameters to SQL queries in Metabase, you can create SQL templates that add filter widgets to those queries, allowing people to easily change that parameter’s value when they run that query. If we wanted to created a SQL template on a query that counted the number of customers in each state using the Sample Database’s People table, we’d use: SELECT count(*) FROM people WHERE state = {{State}} By wrapping {{State}} in double curly brackets, we’ve created a parameter that adds a filter widget to this question, letting people input the state they want without needing to change the text of the query itself, like in figure 1: Fig. 1. Creating a SQL template that adds a filter widget to a query. Dashboard filters Dashboard filters let you set parameters that get applied to a dashboard. For example, you can create a dashboard filter that lets people input a State value and link that that filter to the State column in the questions or cards on your dashboard. Then when people enter the value they want (like North Carolina, shown in figure 2), they’ll see those cards change accordingly. Fig. 2. A dashboard with a filter applied on the State column. When you enter a value into a dashboard filter, you’ll notice that the URL changes to include that value. Custom destinations You can also insert parameters into a URL to dictate what happens when people click on a chart in a dashboard. For example, you can set a custom destination by using values from the card’s results to construct a URL that directs people to another dashboard or external site with that ID as part of the URL. Maybe you have a dashboard containing questions that track how different products in your inventory have sold, and a dashboard filter that lets people input the product they want to check in on. You could take things a step further by passing that product’s ID onto a custom destination, parameterizing a URL with that ID value and sending people to that product’s page on your website. When you visit that site, its URL may look something like this: https://www.your-website.com/products/id?productID=34 In this case, that productID=34 in the URL is your parameter. Embedding When embedding Metabase questions and dashboards in your app, you can set parameters to customize what different users see when viewing those embeds.

Read More
Pivot table

A data visualization that summarizes rows and columns of a table and lets you rotate (pivot) the columns.

What is a pivot table? A pivot table is a data visualization tool that summarizes rows and columns of a table and lets you rotate (“pivot”) the columns to view those summaries in different ways. The summary rows are usually subtotals or grand totals, though they can also be other metrics like averages. This ability to rotate columns by 90 degrees, so that the values in that column become the columns themselves for the pivoted tables, can be really helpful when trying to analyze data across multiple dimensions, like time, location, and category. Example pivot table If we want to see how orders perform over the days of the week, broken out by different product categories, a pivot table is a great choice, as it’ll give us an easy-to-digest glimpse at a lot of numerical data. That pivot table may look something like this: Fig. 2. A pivot table containing information about how four product categories performed on different days of the week. It’s nice to have those summary rows that include totals for each day of the week, but it’s still not easy to compare category sales on different days. However, if we pivot the Created At column so that its values become the column headings, our result looks like this: Fig. 2. Our same pivot table, with the Created At column pivoted, giving us a better look at all of the data in our table. Now we can quickly compare orders across day and product category, while still seeing the totals for both, without having to scroll through a long list of rows.

Read More
Predicate

An expression that evaluates to either true or false, like quantity > 0. True and false values are known as Boolean values.

What is a predicate? In SQL, a predicate is a type of conditional expression that evaluates to either true or false, like quantity > 0. Including a predicate in your query narrows down your results by filtering out unwanted rows based on whether that expression returns true or false. Predicate expressions all contain some sort of comparison element, like =, >, or <. When evaluated, the resulting true and false values are known as boolean values, though not all databases support boolean values as a data type. Not all databases support the same list of predicates either, especially predicates beyond mathematical comparisons (like BETWEEN or ISNULL), so check out your database’s documentation to know for sure which predicates will work for your use case. Null values: not zero, just not there While predicates typically evaluate to one of two boolean values (like true or false), if the field being evaluated lacks a value entirely, it’s known as null. That doesn’t mean its value is zero, but that rather that is no value present in that field. If your predicate expression requires that quantity > 0, then a row without values will not return true or false, but rather will return null. Example predicate An example of a predicate is condition that follows WHERE in a simple SQL SELECT query, like so: SELECT * from orders WHERE subtotal > 35 In this case, our predicate expression is subtotal > 35. Each row in the Orders table has a value in the Subtotal field, and for each row, this predicate evaluates whether it’s true or false that the subtotal is greater than $35. From there, our query returns only those rows with a subtotal greater than $35. In Metabase’s query builder, you use predicates when filtering your data. You can also write your own predicates in the notebook editor using custom expressions. In the question below, we’re filtering the People table in the Sample Database to only show us records where the State field equals Montana, or state = MT: Fig. 1. A predicate expression (or filter) in Metabase's query builder that will return only records where the State field equals Montana (MT).

Read More

Q

Query builder

The graphical interface for asking questions in Metabase.

What is the query builder? In Metabase, the query builder is the graphical interface for asking questions. If you aren’t a SQL person or just prefer to analyze your data using buttons and dropdowns instead of code, the query builder’s got you covered. And if you aren’t exactly sure what you’re trying to figure out about that data, those buttons and dropdowns can give you some ideas, like listing options for the filters and groupings you can add to your starting table, model, or saved question. Asking questions with Metabase’s query builder There are a couple ways you can use the query builder to ask questions about your data: Construct a question on the fly by browsing your data and adding filters and summarizations using the sidebars to the right of your data visualization. Create your question from scratch using Metabase’s notebook editor. The notebook editor offers more flexibility for constructing a query: in addition to the regular filtering and summarizing options, you can use custom expressions to create more sophisticated custom filters and summarizations. You can also join tables and create custom columns in the notebook editor, previewing your results at each step before visualizing the final product. These paths aren’t mutually exclusive — you can start in the notebook editor, visualize your data, use the sidebars to tweak your question, return to the notebook editor to make additional changes, and so on. Example: using the query builder We’ll use the query builder to construct a question using Metabase’s Sample Database, starting with the notebook editor, then adding a second filter once we’ve visualized our data. Let’s say we want to know how our large orders (that is, orders with a Subtotal greater than $100) are broken out by Product → Category. Figure 1 shows how we’d construct this question in the notebook editor: Fig. 1. The query builder's notebook editor. Once we visualize that data, let’s add another filter so we’re only viewing full-price orders (orders where there was no discount applied). Figure 2 shows what our query builder looks like just before adding that second filter: Fig. 2. Adding a second filter while visualizing our data in the query builder.

Read More
Question

In Metabase, a question is a query, its results, and its visualization.

What is a question? In Metabase, a question is a query, its results, and its visualization. If you’re trying to figure something out about your data in Metabase, you’re probably either asking a question or viewing a question that someone else on your team created. In everyday usage, question is pretty much synonymous with query. What you can do with questions in Metabase You can ask questions in Metabase using the graphical query builder or the native query editor, and then do things like: Save your question to a collection so that you can come back to or build on it later. Add that question to relevant dashboards. Questions on a dashboard are known as cards. Set up email or Slack alerts on your question. Share the results of your question by sending links to people on your team — even to questions that you haven’t saved. Download the results of your question as CSV, XLSX, or JSON. Convert your saved question to a model. Example question Figure 1 shows a question based on Metabase’s Sample Database — the average rating of our company’s Products, broken out by Category. Here we’ve visualized this question as a bar chart: Fig. 1. An example question with one summarization, visualized as a bar chart. And figure 2 shows what this same question looks like as a table: Fig. 2. The same question, visualized as a table. Questions and the Metabase API In the Metabase API, you can edit and get information about questions in your Metabase instance using the api/card route.

Read More

R

Relational database

A collection of tabular data, or the application that manages the storage and retrieval of tabular data.

What is a relational database? A relational database is a collection of tabular data, or the application that manages the storage and retrieval of tabular data. Relational databases contain tables, made up of columns (also known as fields) and rows (also known as records). You’ll establish relationships between tables in a database by assigning a single field to two or more tables. For one of those tables, that field will be designated as an entity key, while for the other(s) it’ll be a foreign key. With these relationships in place, you can query data (probably using SQL) across tables without having to reorganize or duplicate that data. Introduced in the early 1970s, relational databases remain a (if not the) dominant model for structuring data today. While technically a relational database refers to your data itself and a relational database management system (RDBMS) refers to the software application you use to manage that data, in reality people use the terms interchangeably. The relational model is so prevalent that in many contexts, the word “database” itself implies a relational one, unless otherwise specified. Example relational database Metabase’s Sample Database (the one you see used in examples throughout our docs and tutorials) is an H2 relational database. Figure 1 shows a look at the four tables in the Sample Database: Fig. 1. Metabase's Sample Database (a relational database) contains four tables: Products, Orders, People, and Reviews.

Read More

S

SQL

A standardized and widely-used language for accessing and manipulating data in a relational database.

What is SQL? Structured query language (known as SQL) is a standardized and widely-used language for accessing and manipulating data in a relational databases. Using SQL involves writing and executing structured commands, known as statements, that communicate to a database what information you need or what you want to change. SQL is a published ANSI and ISO standard, meaning there are established rules about what exactly the language includes and how it works. However, SQL-based database systems (like PostgreSQL, MySQL, SQL Server, etc.) each have slightly different functions and their own syntactical quirks — no major databases conform 100% to the official written standard. Using SQL, you can: Create and configure databases, tables, and indexes Insert, update, and delete information in databases Retrieve information from databases (commonly known as querying) Set and adjust database permissions Is it pronounced “S.Q.L.” or “sequel”? Opinions are split on the matter of pronunciation, with some of those opinions very strongly held. When computer scientists Donald Chamberlin and Raymond Boyce first developed the language specification in the early 1970s, they called it “SEQUEL” (pronounced “sequel”), but changed the language’s name to SQL when faced with a trademark dispute. The ANSI and ISO standards dictate that the official pronunciation is the initialism (“S.Q.L.”), but both pronunciations are common today. So go with whatever sounds best to you — just don’t be surprised when someone disagrees with you. Querying databases with SQL No matter how advanced or complex, all SQL queries involve telling a database to return certain columns from a table (or tables), and then optionally specifying conditions about which rows should be included in those results and how they should be presented. SQL is case-insensitive, but you’ll often see people capitalizing reserved words (e.g., functions and clauses like SELECT, WHERE, HAVING, or ORDER BY). You can format your SQL statements as a single line if you like, though people will usually break their queries out onto separate lines for readability. Example SQL query Here’s a SQL query that asks Metabase’s Sample Database to return a table of orders where the order subtotal was greater than $100: SELECT * FROM orders WHERE subtotal > 100 We can break this query down into three statements: SELECT * tells the database to return every column in the table. FROM orders tells the database which table that is. WHERE subtotal > 100 tells the database to filter results and only return rows where the value in the Subtotal field is greater than 100. The example query above is a pretty simple one; more advanced queries can include joins, aggregations, CTEs, and other tools for pulling and organizing data. SQL in Metabase You don’t have to write SQL when asking questions in Metabase (that’s what the query builder is for), but if you prefer SQL queries, the native query editor is there for you, along with features like: SQL variables (including field filters) SQL snippets SQL snippet controls (available in some plans) And if you do opt for using the query builder to ask questions in Metabase, you can always view the underlying SQL that powers your question or convert it to a native SQL query.

Read More
SSO

An authentication (auth) setup that lets people use one login to access independent apps.

What is SSO? SSO is an authentication (auth) setup that lets people use one login to access independent apps. It’s a bit like using your passport to get into different countries. With SSO, you don’t need to use one login for each account, just as you don’t need to travel with different pieces of ID specific to each country. For example, you might have two different logins that are used with two different sign-in pages for your email and your online banking. If the IT teams at your email provider and bank each set up SSO, you would be able to use one sign-in page and login to access both websites. Fig. 1. You'll probably recognize this Sign in with Google prompt from other apps. This is an example of SSO set up with Google sign-in. How does SSO work? Since digital authentication can’t be done in person, your identity on the internet is checked by a service that asks you for proof of something you know (like a password), or proof of something you physically have (like sending a code to your phone). These pieces of proof are called identity factors. The most common way to authenticate is through a single identity factor, such as a password. If you add another factor, like a phone number, you get two-factor authentication (2FA). Naturally, you can continue adding factors, (an email, a verification app, etc.), to get multi-factor authentication (MFA). Instead of using information that you know or have, SSO uses an identity factor called an authentication token that belongs to the SSO provider. An authentication token is a unique, anonymous piece of information that’s generated when you sign in to the SSO provider. The token is temporarily stored in your browser (like a browser cookie) or on the provider’s servers, and is only valid for a period of time — usually until you close the browser, or within an expiry window set by your security team. When you go to an app that is set up with SSO, it’ll automatically ask for the authentication token from the SSO provider, instead of asking you to log in. If the token is still valid (you’ve signed in to the SSO provider in the same session, and it hasn’t expired), your identity is considered authenticated, and you’ll be allowed into the app. If the token has become invalid, you’ll be prompted to sign in with the SSO provider to create a fresh one. Where does SSO fit into the bigger picture? Authentication only deals with who you are (identity). From there, other services keep track of where you are allowed to go, and what you can do once you get there (access management). Identity and access management (IAM) is the umbrella term for these tools and processes. SSO and other parts of an IAM toolkit are often packaged up by identity providers (IdPs), such as Okta, Auth0, or OneLogin, and implemented as part of cloud security. SSO in Metabase Setting up SSO in Metabase means that people won’t need to create a separate Metabase username and password to access your organization’s data. They can simply log in through the same account as your chosen identity provider. The open-source edition of Metabase can be set up with Google SSO or LDAP. Paid editions of Metabase work with SAML and JWT standards (in addition to Google SSO and LDAP). SSO can also be combined with data sandboxing in Metabase paid plans to define the data that people can see and interact with, based on user attributes such as their department or role.

Read More
Schema

The design or structure that defines the organization of a dataset, including its tables, columns, relationships, data types, and integrity constraints.

What is a schema? A schema is the design or structure that defines the organization of a dataset: which columns are grouped into tables, how those tables relate to each other, and the rules and data types that define those columns. Schema is an overloaded term; it’s an abstract word that has accumulated a lot of different definitions, and as a result can be confusing to sort out. Depending on the context, schema can mean: The overall structure, specification, or “blueprints” of your database A diagram that demonstrates how tables in your database relate to each other A single collection of tables (among many) within your database Finally, schema sometimes means something specific to whichever database platform you’re using, like in Oracle, where schema refers to all objects in a database created by the same user. Schema as overall structure: design and implementation Once you’ve figured how your data fits together from a high-level standpoint (that is, your conceptual data model), the next step is creating a schema that reflects that data model, bringing it from the abstract to a database that your organization can use and populate with information. Broadly speaking, this process is made up of two major steps: Design: map out the structure of your database, creating an entity relationship diagram (ERD) in the process. Implementation: Use that ERD to generate the SQL commands that, when run in your database, will create your desired schema. What your schema design process looks like depends on whether you’re dealing with transactional or analytical databases, and whether you’re starting from scratch or have already begun collecting data. Regardless of at which point you’re designing schemas, you’ll have to think deeply about the needs of your organization and what questions you anticipate asking of your data. Schema-on-write vs. schema-on-read Most traditional relational databases use a schema-on-write system, where data gets verified and formatted into a schema before it’s written to that database. Since the data being written must conform to whatever specific data integrity rules you’ve established (like requiring that all values in a field be unique, not accepting null values in a field, or formatting dates a certain way), adding this new data to your database can be slow. However, the read times are fast, since that data has already been verified. In a schema-on-read system, data (like in a data lake) is only verified once it’s been read, or pulled from that database. Schema-on-read systems tend to be more flexible, as you can store unstructured data without worrying about it conforming to a rigid data model. In this case, writing data is faster (since that data doesn’t need to be verified as it gets loaded in), but queries take more time to execute. Whether you opt for a schema-on-write or schema-on-read strategy will depend on your organization’s needs and specific use cases. If having meticulously structured and consistent datasets is important to your organization, a schema-on-write system may be your best bet. By contrast, if you regularly need to pull in a wide variety of data without always knowing exactly what it that data looks like, you may want to use a schema-on-write system. Logical and physical schemas Regardless of whether you’re working with a schema-on-write or schema-on-read system, you’ll also need to think about database structure and its implementation — that is, your logical and physical schemas. Logical schemas define the structure of your data, while the actual implementation of that structure (like how and where you store the files and code that make up your database) belongs to a physical schema. Logical schemas Logical schemas are created by diagramming how tables and their fields relate to each other. In creating a logical schema, you’ll establish tables, relationships, fields, and views, answering questions like: What data are we collecting, or do we want to collect? What tables does your database (or individual schema within it) need? How do those tables relate to each other? What fields does does each table need? What are the data types for each of those fields? Which fields are required? Schema as diagram: mapping out entities and relationships In answering these and other questions, you’ll likely sketch out an entity relationship diagram (ERD) that defines each table, its fields, their integrity constraints, and the relationships between those tables, including the primary and foreign keys that establish those connections and whether those relationships between tables are one-to-one, one-to-many, or many-to-one. Visualizing your tables and how they relate to each other can also bring to light any major omissions or conflicts. And yes, sometimes you’ll see these diagrams themselves referred to as schemas. The image below shows an entity relationship diagram of a schema with two tables, PRODUCTS and MANUFACTURER. The “(PK)” and “(FK)” notations tell us which fields are primary and foreign keys, and the line linking these tables indicates a one-to-many relationship, in that one manufacturer can be linked to many products. Fig. 2. An entity relationship diagram of a schema with two tables. You can map out your schema on paper or using design software that can directly translate your diagram to the SQL commands that you’ll need to implement your database. At this point your schema is platform-agnostic; mapping out those rules and relationships doesn’t tie you to any single database software. Physical schemas Once you’ve identified the logical configuration of your database, you’ll create a physical schema to implement it into a specific RDBMS, defining where your database files will live as well as their storage allocation on a disk. Schema as one collection of tables among many While a single collection of tables may be sufficient if your database only sees a few users and contains data that everyone needs to access, you may find that a relying on a single schema in your database doesn’t cut it for your organization. If you’re handling data across a lot of tables (think in the dozens, hundreds, or thousands), grouping those tables into separate schemas will help from an organizational standpoint, making it so you can store similar information together while retaining the ability to query across schemas when necessary. Keeping multiple schemas within a database can be helpful from a security standpoint as well, like separating tables that hold sensitive information into a schema that only those who need to can access, usually in combination with views. Schema design for transactional vs. analytical databases When thinking about schemas for transactional databases (also known as operational databases), your data will need to be normalized to some extent and adhere to data integrity standards, since efficiency and performance for those small transactions and OLTP are crucial. Designing a schema for an analytical databases will look different. For starters, you’ve probably already collected raw data, possibly from multiple sources, and now need to impose some structure in order to analyze it. In this case, redundancy is okay, as analytical databases place greater emphasis on explorability and less on performance. Here your schema can also be more loosely defined, as no fixed patterns (like normalization) are needed. Schema design for analytical databases is more about understanding where data from your various sources lives, and knowing what tables you’ll need to join to answers questions you have. Star schema One common structure you’ll see applied to analytical databases is a star schema, which separates data into fact tables (that is, quantitative data) that relate to multiple dimension tables describing those facts. In a simple implementation of a star schema, several dimension tables all surround and relate to a single fact table, looking like a star in diagram form with the fact table at its center, like so: Fig. 2. An entity relationship diagram of a simple star schema. Tables within a star schema are typically denormalized, which leads to better performance for analytical queries. Creating a database schema Most database platforms (such as Redshift and PostgreSQL) use “schema” to mean the configuration of a dataset and non-nested groupings of tables and other named objects within that dataset, though Oracle defines schema as all of the objects created by and belonging to a single database user. To create a schema within your RDBMS, use the query CREATE SCHEMA, like in this example, where we create a schema with two tables that are linked by the customer_id field: CREATE SCHEMA new_schema; CREATE TABLE new_schema.orders ( order_id product_id customer_id subtotal order_date ) CREATE TABLE new_schema.customers ( customer_id customer_name customer_address customer_email ); This is a very simple schema; we didn’t specify data types or any other constraints on the fields in our tables. If we wanted to require the customer_id field in the customers table and indicate that its data type is an integer, we’d format that field like this: customer_id INT NOT NULL Note that in MySQL, CREATE SCHEMA is synonymous with CREATE DATABASE.

Read More
Serialization

The process of storing the state of an application or other object as a file so that it can be reloaded (or deserialized) later.

What is serialization? Serialization is the process of storing the state of an application or other object as a file, so that it can be reloaded (or deserialized) later. In the serialization process, that complex object gets converted to a stream of data, one that’s much easier to transfer over a network or copy to a new instance. Think of serialization like taking a snapshot of some application, but with the power to turn that flattened snapshot back into a live object whenever you’re ready. Serialization comes in handy if you need to manage multiple instances of an app, like development or staging instances in addition to the ones you use for production. Serialization can also be used to customize an initial state of an application, like preloading a Metabase instance with connection details to a database, predefined groups, dashboards, and so on. Serialization in Metabase ⚠️ Serialization This feature is only available on Pro and Enterprise plans (both self-hosted and on Metabase Cloud). Metabase’s serialization feature (available in some plans) lets you capture and export the contents of your Metabase so that you can reload them into another instance — or several. Those Metabase artifacts you capture when serializing are collectively known as a data dump, containing YAML files that hold your collections, dashboards, saved questions, and some system settings — check the documentation for the full list.

Read More
Summary table

The result of an aggregation that gets saved in a database or data warehouse so that people can work with those precomputed metrics.

What is a summary table? A summary table is the result of an aggregation that gets saved in a database or data warehouse so that people can work with those precomputed metrics. The term “summary table” can get confusing, since some people use “summary table” to describe any result of an aggregate function, like the table that you get after filtering and grouping by some measures and dimensions. By this definition, a summary table is basically the same thing as a pivot table, minus the pivoting. The difference here comes down to whether or not those tables get saved within your data warehouse. Creating summary tables in your data warehouse can make it easier for people to generate reports without having to query raw data. In this sense, summary tables function a lot like materialized views (which don’t necessarily aggregate data). Example: summary tables in a data warehouse For example, maybe you’re working with an analytical database that uses a star schema setup, with a fact table containing tens of thousands of individual orders records, surrounded by dimension tables that describe those orders. If someone at your organization wants to generate a weekly report containing sales data by product category from the last seven days, calculating that from your raw fact and dimension tables every time will be inefficient and costly. Instead, creating a summary table lets you join those tables and aggregate that data far less often. Then in the future when someone creates that report, they can do so using the summary table as a base, rather than needing to calculate those numbers every time from scratch. While there’s some maintenance associated with summary tables (like making sure your data refreshes on a schedule or adjusting the filters and groupings if they aren’t exactly what people need), they still tend to be a highly efficient way of working with large datasets.

Read More

T

V

View

A query and its results that function like a virtual table in your database.

What is a view? A view is a query and its results that function like a virtual table in your database. Databases compute views on demand, meaning they aren’t precomputed or materialized, and for that reason don’t occupy any storage space in the database. You can think of views as virtual or logical tables. Database views let you combine information from multiple tables and format that information however best makes sense for the people who need to query it. You (or a database administrator) can create a view that hides unnecessary fields in a cluttered table or joins tables to bring together relevant data. By using the view as the starting point, people won’t need to run the same complex queries every time just to get to their actual question about the data. The downside to querying views is that those queries can be time-consuming to run, especially if that view is the result of several tables or multiple joins. Database administrators also use views for security purposes, like creating views that hide certain fields that exist in the base table. That way, other users can still access and query the data they need without gaining access to sensitive fields or rows. View vs. materialized view If views are virtual tables (computed as needed), then materialized views are like regular tables in a database. While views require that a query be rerun every time that view is referenced, a materialized view is a precomputed view that’s saved in the database. So, materialized views take up space in your database, but since the database doesn’t have to compute materialized views each time, they perform much faster when querying than standard database views (it’s like querying a normal table). When you should (and shouldn’t) use database views Creating views in your database is a good idea if: You need to access the results of a complex query on a regular basis and don’t want to type that query out every time. You’re looking to strengthen database security by restricting access to sensitive information. You want to create custom columns without altering the underlying structure of your database. You’d like to simplify the appearance of your tables by hiding fields that are unlikely to be queried. However, if the underlying structure of your database is subject to change, you probably don’t want to rely on views; the moment a field name changes, the query you’ve established as a view may break. Your BI tool probably has features that function sort of like views too, whether it’s a model, saved question, or SQL snippet. The important distinction here is that those are all features that exist with the world of the BI tool, whereas views (materialized or not) are built into your database itself. Example view Using Metabase’s Sample Database, let’s say we wanted to create a view based on the People table that our team in Pennsylvania could use to access information like the names, addresses, birthdays, and emails of our PA-based customers, but not user passwords. We’d create that view in our database by running query shown below, which creates the view, names it pennsylvania_customers, includes only the columns we want from the People table, and only displays records where the value in the State field is the abbreviation for Pennsylvania (PA). CREATE VIEW pennsylvania_customers AS SELECT id address email name city state birth_date zip created_at FROM people WHERE state = 'PA' Then for future queries, our team in Pennsylvania could access the information they need about their customer base by querying pennsylvania_customers as their starting point. While views are a pretty fundamental feature of any SQL-based database or data warehouse, the specifics of creating, materializing, and maintaining them may differ depending on which database software or data warehouse you use.

Read More

W

X