Data Dictionaries Explained
Data professionals—whether they’re database administrators (DBAs), developers, or data scientists—work in a wide and varied landscape usually in flux and filled with challenges. These challenges could range from changing business requirements to keeping up with the sheer velocity at which technology evolves.
It’s also critical for these professionals to understand their organization’s data and how it applies to a given application or business unit. Better outcomes usually come from employing data dictionaries throughout the organization. Through many years of experience in IT professions, I’ve seen the utilization of data dictionaries range from “not at all” to “I’m documenting every possible data attribute known to humankind.” In my experience, data-related projects with data dictionaries as part of the process are far more likely to be successful than projects without them (even extremely populated data dictionaries are more useful than nothing at all). Trust me on this.
What Are Data Dictionaries?
Data dictionaries are essentially documentation about the data itself. This documentation includes data elements, such as names, definitions, or various attributes about the data. In simplistic terms, data dictionaries contain metadata: data about the data.
Data dictionaries don’t include the data values themselves, although the data dictionary may define the range of known values for some fixed data sets, such as ZIP codes. The data dictionary helps define the expected range of values and their meaning.
For example, let’s look at a data column of “FirstName” with the value of “Jane.” A data dictionary would not reflect the value of “Jane” but facets of value “Jane.” The value is a string data type with a length of four characters. The data dictionary might look at actual values or use the underlying definition of the storage mechanism holding the value. This choice depends on the organization’s needs.
Data attributes can include several descriptions, including but not limited to the following:
- What data type is it?
- How big is the data?
- How is the data stored?
- When was the data created?
- Can the value be null?
- How does the data relate to other data?
- What is the description of the data?
Figure 1, taken from the AdventureWorks2017 sample database in SQL Server, illustrates a table comprised of various columns. Each column has different names, data types, lengths, and whether the values can be null. In most respects, the database is a self-documenting data dictionary because objects are created with certain attributes, such as name, data type, or length of the column, if applicable.
Though this data dictionary exists within the database itself, if you have a database with many tables and many columns, it can be cumbersome to meaningfully utilize the information. Many organizations extract the data elements into a central repository, which provides a more holistic view of their data landscape.
An extended data dictionary is a specific type of data dictionary built to hold additional metadata beyond the basics. Some examples of this metadata include column updatability, minimum and maximum value, and expected growth rate of the range of data in the column. Having this additional data can improve the development process by allowing developers to make quicker and easier decisions.
Data dictionaries can come in the form of text documents, spreadsheets, and data modeling tools, or they can derive from metadata stored within the source database.
Active vs. Passive Data Dictionaries
Although the data contained within a dictionary varies from organization to organization, there are two essential types of data dictionaries: active and passive. Both types can help provide a more concise view of the data and its associated attributes across the entire enterprise.
Active Data Dictionary Definition
An active data dictionary operates within the context of a particular database. Such dictionaries serve two purposes: a physical definition of the object, which stores the data, and a built-in data dictionary. As the column definitions evolve due to organizational requirements, the data dictionary is updated automatically. An example is the effect on the data dictionary when the following T-SQL statement executes against the table shown in Figure 1:
ALTER TABLE HumanResources.Employee ALTER COLUMN JobTitle NVARCHAR(100)
This command expands the maximum number of characters stored by the JobTitle column from 50 to 100. Once this statement is executed, the data dictionary is automatically updated to reflect the change in the column.
Passive Data Dictionary Definition
A passive data dictionary functions as a central repository for metadata across many databases and tables. Most often, an independently running process synchronizes the data elements of the tables and columns with the dictionary at regular intervals to ensure the document is up-to-date.
A passive data dictionary can be shared more easily among groups within an organization than an active data dictionary. This expanded access allows cross-functional groups, such as data governance groups, to use the information when making decisions about the data and the direction of the organization.
Extended data dictionaries, which retain information about the data not indicated within the active data dictionary, are a form of passive data dictionary. You must store these attributes outside of the physical definition of the database because the relational database management system (RDBMS) can’t indicate them.
Data Dictionary Benefits
Data dictionaries can provide structure to application development and support the implementation and enforcement of data standards across an organization.
Data dictionaries can also offer a means of communication in simple terms. Individuals who don’t know how to look at the structure of the database—or don’t have the means to do so—can consult the data dictionary to explore the elements of the data more easily.
Data dictionaries can also help drive data standards. Data standards play an important part in any enterprise application landscape, ensuring data across multiple systems remains consistent. If an application treats a data value differently than another application, it can become confusing for end users or downstream consumers of the data. If an organization has data standards, it can use the data dictionary to validate whether the data is being maintained properly and accurately. This reassurance can be vital as organizations grow in both employee count and data volume. The more data an organization has, the more crucial data accountability becomes.
For example, suppose an organization has decided all data columns designating an address share these specific attributes:
- A string data type
- A maximum length of 150 characters
- The label “Address1”
Via applications, such as Azure Data Studio or using a script to return the attributes of any given column, organizations can more easily verify these data attributes. Organizations may also choose to track additional data attributes in the same manner to support data integrity efforts. In my experience, knowing what to include within a data dictionary is more of an art than a science, and it’s fine-tuned over time. Data professionals must understand what attributes are required to be included in the dictionary to ensure it applies to the organization. Attributes related to the business, such as owner, consumer, expiration dates, and so on, are all elements capable of being effectively customized for a given application or business requirement.
What Are Data Dictionaries Used For?
Data and system administrators can analyze data dictionaries to extrapolate future storage consumption. For example, suppose a system of record contains data of a certain type, size, and quantity. When supplied with an approximate growth rate (e.g., sales transactions are expected to increase by 25%), you can calculate the amount of disk storage required.
Some data dictionaries can help you better understand the consumers of the data. For example, take an organization built to provide a service for the clients of its clients. The clients of the clients consume data the parent organization provides. A data dictionary—most likely an extended data dictionary type—could contain an appropriate field to track entities consuming the data, how the data is being transported, how often it’s being consumed, and any other appropriate information specific to downstream consumers.
A data dictionary can also serve as a gateway between technologists and business owners to ensure both sides of the equation understand terms and definitions, which can help promote a symbiotic relationship between the two faces of the organization. If the two sides can’t agree on common terminology and concepts, forward progress could be in jeopardy. I’ve seen this happen, especially in larger organizations where business requirements continuously evolve and teams become siloed. As the teams diverge, their unique definitions of the data diverge at the same time. By injecting properly documented data dictionaries, teams can fix the issue and bring all players back to the table with the same definition of the data across the board.
Why Are Data Dictionaries Important?
In today’s world, where raw and processed data is exploding in volume, data dictionaries help data professionals understand at a higher level how the various data attributes are deployed and subsequently used. Data dictionaries will emerge as a more critical part of the data landscape as organizations grow and evolve. After all, the data is the business. Without understanding key facets of the data, the business won’t know what direction to go.
When looking at creating and maintaining data dictionaries for your organization, look at SolarWinds® Database Mapper. This tool can help you organize and streamline the creation and ongoing maintenance of the data dictionary as you move down the path of expanding the data landscape. You can read this detailed guide on how to use the data dictionary in SolarWinds Database Mapper. Database Mapper is also available to try free for 14 days by downloading a free trial of the cloud version or free software trial.
John is a Principal Consultant with Denny Cherry & Associates Consulting holding Microsoft Data Platform MVP and VMware vExpert awards. He specializes in deploying SQL Server related solutions to solve business needs for organizations.