Data Modeling & Database Design for Developers
The classic standards of database design and normalization are a “gold standard” that many developers struggle with, either because they:
- Don’t know what the standards are
- Don’t want to apply them to their designs because they seem old-fashioned
- Are handed an application that has an already completed but inadequate design
I recently teamed up with Louis Davidson (@drsql), a renowned Microsoft MVP and data architect specializing in relational database design with SQL Server, for the SentryOne webinar, “Data Modeling & Database Design for Developers.”
Louis Davidson and Kevin Kline
During the webinar, shared an overview of how to design a relational database—allowing your data structures to ease your job instead of making it harder—and how to use SQL Server naturally so that the query engine can better optimize your SQL code without you needing to spend much time thinking about it.
If you missed the webinar, you can view it here. During our discussion we received a lot of great questions from attendees, and we’ve captured those questions and their answers below.
Q: What kind of differences are there between the logical and the physical code?
Kline: I assume you mean differences between the logical and physical models. The logical model explains the database design without going all the way down to the level detail used to specify its physical format or database platform specifics. As such, many logical models might be a full Entity-Relationship Diagram (ERD) but only have generalized data types along with attributes like NULL/NOT NULL and check constraints. The physical model is likely to go all the way to code, using platform-specific syntax, such that a physical model for Oracle would not be able to run on a SQL Server instance and vice versa.
The physical model will also, of course, have all the details about where the database sits on the file system, the number-location-size of database files and log files, and other characteristics of databases that are specific to the given instance on which you are deploying. Another example might be how each data platform implements a monotonically increasing primary key, for example using an IDENTITY on SQL Server but using a SEQUENCE on Oracle.
Q: What is a release candidate? I know what it sounds like but would like to be sure.
Davidson: A release candidate (RC) is when a company puts out a release that is “feature complete.” It’s better than a beta release, but it’s still pre-release. With an RC, the company isn’t going to add features anymore, and they want people to test it out and find any straggler bugs that need to be fixed. Usually, RCs are nearly 100% complete, except for work to improve speed and performance of UIs and perhaps to strip out any additional debugging/tracing code used during the alpha and beta stages.
Q: Do you recommend that development and test environments be as robust as production? That way, you can identify real performance issues vs. the development environment not having the same resources found in production.
Davidson: If possible, have at least one test environment that can be used to make sure your code works fast enough before actual production. However, that may not be possible depending on size or load. Don’t build your code, run your tests, then ship it to production on hardware that is 100 times faster than what you’ve tested on, and then get 10,000 times more load than you’ve tested with. The more you can simulate and control, the better results you’ll likely have.
Kline: Usually, a dev/test environment that is comparable to production is known as a User Acceptance Testing (UAT) environment. It is generally considered essential when the software or application in question has a degree of legal liability associated with it. But as Louis says, it’s “hella expensive.”
Q: What are some rookie mistakes that people regret later with their database design?
Davidson: Whoa, that is a big question. Most of the answers are related to what I shared in the webinar: not following what the requirements state.
For example, let’s say the requirements are to store a list of names and unique email addresses. We could build the database and ignore things like uniqueness or allow the customer to store something other than email addresses in the email address column.
Going the other direction is an issue too. We could over-engineer the solution by using five tables: one for the email address domain and the details about the company that hosts the email, one for each part of a person's name (and a person may have unlimited names if you want to support it, and sometimes you may), then add in the mailing address, phone number, height, company they work for, the companies that that company owns, etc.
The bottom line is if you deliver more or less than the customer asked for, you haven't done a great job.
Q: Most people inherit a database or design that doesn't follow best practices. Do you have references you recommend for creating a migration plan that follows better design practices?
Davidson: If you follow a lousy standard, it’s better than making one up or following none at all. Also, it is harder to follow multiple patterns than it is to follow one standard.
If possible, consider incremental changes such as making physical tables that meet a better standard, and implement views that encapsulate your objects to look like the old systems. At PASS Summit, I learned a bit more about microservices. They allow you to have one database used by one application ideally, so if you do have poor practices, it only affects the one system and fixing things over time is (theoretically) more manageable.
Kline: Your question is a bit vague. I’m not sure if you’re asking for best practices about the migration itself or about the database design. I’ll postulate that you mean, “We’re going to migrate a database, and this might be a good time for us to refactor the design a bit to implement some more best practices.” If that’s true, there are a couple of things that I really, really, really encourage you to do. One, standardize your object names. Two, make sure data types are consistent for columns (in tables and views) as well as parameters and expressions (in procs, triggers, and functions) to minimize the risk of implicit conversions—a ninja stealth performance killer in SQL Server apps. Finally, I encourage you to make sure all tables have a clustered index, a primary key (possibly the same as the clustered index), and fully declare all foreign keys and create non-clustered indexes on them. The performance will improve.
Q: Is it ok to abbreviate table and column names?
Davidson: It’s ok to do whatever you work out with your team. Build a standard that works for you and stick with it. That is the most important thing.
Just know that you need to be careful with distinguishing abbreviations—for example, Tablenames: PO and POCustomer. If one means Purchase Order and the other means angry customers, then you have a problem. The less you abbreviate, the easier it is to avoid such issues. Even still, naming is one of the hardest parts of the job of doing any design because the names are documentation and will probably be seen by your customers at some point.
Kline: Abbreviations were very important before 32-bit and 64-bit computing, back when every bit of space was at a premium. That’s not the case anymore, so why be stingy, especially with a new database design? On the other hand, I’m ok with abbreviations if they are universal, sensible, and very consistent.
Q: Most often, the data warehouse database design defines data types for every column as varchar or nvarchar. When I had an opportunity, I redesigned the databases and stuck to basic standards such as storing data in an uncommon data type column and int as int as opposed to storing it in a varchar. What is a good practice ford warehouse database design when compared with OLTP database design?
Kline: I have not heard of that practice before. I agree that your approach is better for routine data types like dates and integers, at least for SQL Server. This is probably less of an issue on other data warehouse platforms. But on SQL Server, you'll see many improvements from choosing data types that are specific and appropriate for the data stored within.
Q: Do you have any specific recommendations for designing OLTP vs. OLAP databases?
Kline: They can be quite different. In a broad statement, OLTP should be optimized for a multitude of small, singleton transactions while OLAP should be optimized for long, serial reads. But it's a bit too much for a single response in our Q&A here. There’s also a SentryOne blog post called, "Slides & Video: Tales from Building a SQL Server Data Warehouse in Azure" that points to the video and has accompanying slides available for download.
Q: With your design, could a message be sent at 9:57, and the same message be sent at 10:08 on one day?
Davidson: That was the requirement, no duplicate messages in the same hour, not within an hour of each other. It is a lot easier to implement, and the case the customer almost always wants to solve is not sending the same message at 1:00:01, 1:00:02, 1:00:03, 1:00:04, 1:00:05, and so on. If you were to send a very similar message once an hour, that would seem like you specifically chose to do this. This was just an easy way to make sure it worked—changing it to longer than an hour would be easy too.
Q: I learned about surrogate keys when I studied Kimball's data warehousing books. Would surrogate keys ever be used in an OLTP design?
Kline: Sure. They are found quite frequently in OLTP designs. So much of this discussion is based on the best results from trying multiple alternative design techniques, but some of them are based purely on personal preference. Surrogate keys are usually very easy to find when looking at an OLTP design because they’ll have the term “ID.” For example, in the demonstration database AdventureWorks, there is a column (and key) called BusinessEntityID. If you want to start an argument, just ask a group of DBAs, “Which is better surrogate keys or natural keys?” It’ll turn into bedlam quickly.
Q: Is it the PK or the Clustering Key that gets copied into all NC indexes? Scenario: PK is a natural key but backed by NC index.
Davidson: Clustering Key. Usually, the PK, but not always. However, be careful if you are not using the clustered index for the PK. In OLTP systems, it is prevalent that while logically you feel like you are going to fetch the row by a natural set of values, it often turns out that internally, the PK is used to fetch the rows.
In one system we worked on, changing the key on the invoice line item to cluster on the invoice ID was discussed. Logically, you feel like you fetch the rows by invoice, not one at a time. It was true for the one fetch, but the code was accessed many more times by the key as processing occurred.
Q: Do you have a recommended modeling tool?
Kline: I previously used the big, expensive tools quite a lot: ERwin and ERStudio. But now I use inexpensive or free modeling tools like Model Magic and SQLDM.com.
Davidson: I agree with Kevin. ERwin and Embarcadero are fantastic and have many features, but SQLDM.com has become a favorite because it is a solid portable tool.
Q: So, what's the benefit of using nvarchar? This webinar makes it sound like you're only asking for trouble by using it, because you don't know how many bytes the incoming characters will take.
Davidson: Not knowing how many bytes the incoming characters will take is always the case, for nvarchar, and with 2019, for char. The fact is that for most usage, it can be larger than you expect. It is mostly if you are making tools, or if you are using any of the special collations that allow extra-sized characters. The point is to write your check constraints to limit data to the size and contents of the data that you need it to be.
Kline: In my case, my shorthand rule for using nvarchar is if the database needs to accept multiple languages. However, once you use it in your design, never go back to using plain old char or varchar, not even (or perhaps especially) in code like procs and functions. That’s because char and varchar will throw implicit conversion issues when compared to nchar and nvarchar.
Q: Can you recommend database design books for OLTP and data warehouse design?
Davidson: Well, I would suggest my book, “Pro SQL Server Relational Database Design and Implementation” or Michale Hernandez' book, “Database Design for Mere Mortals” for OLTP, if you want practical information. If you are looking for more academic details, check out resources by EF Codd or CJ Date. For data warehouse books, I suggest books by Ralph Kimball and his team, which you can find here.
Q: Unfortunately, what I regularly hear from potential clients and other developers, is that it’s not worth the effort to design a proper relation (OLTP) model upfront. The attitude seems to be, “We don’t care how the data is captured, let the developer store it however they desire, the OLTP structure is not important, we will ‘fix it’ on the backend (OLAP or BI).” I’m curious what you are hearing. Why is this happening, and what do you tell them to counter?
Davidson: There is so much wrong and right about this at the same time. The crux of this problem is not knowing WHAT a proper relational model is to start with. It is (as I have hopefully covered somewhat well enough in this webinar series) not difficult.
I spend a lot of my time fixing data in the backend that other developers have built (often from third-party systems). Users tend to use it however they want, and the poorer the source is, the slower it is to fix.
At the very same time, it isn't always worth it to spend tons of time designing the "perfect" system, and getting business going is very important for sure. But the problem generally stems from a lack of understanding WHAT a good design is and WHY we want to do it. We need a reasonable balance between perfection and time spent.
Finally, one last note: I asked how many people attending this webinar are still using SQL Server 2008, and someone said they were still using 2005. People typically find themselves in this situation because it’s not easy to change your data structures, and we can’t just replace them when it feels right. No matter what you build, the data is going to be the hardest thing to replace.
Kline: In my opinion, people with that mindset are cheerleading for a NoSQL data platform, like MongoDB, Cassandra, or Cosmos DB. Those data platforms all support the idea of “schema on data extraction” rather than the model used by RDBM’es of “schema on data design.” In the case of the former, you basically just throw all the data you might want into JSON records that you stuff into the data store using a key-value pair. How do you counter this argument? Simply enough, wide and long-term experience has shown that this strategy never pans out in terms of performance. You can take the strategy they suggest, quickly build the app, and then spend months or even years on the backend making it perform well. However, I’ve personally seen dozens of cases where it NEVER performed well enough and had to be repatriated back to an RDBMS using standard data modeling techniques. Food for thought, eh?
Q: You mentioned the downside of using large Natural Composite Keys, but for the indexes that cover queries, looking up records seem to work very efficiently. What recommendation do you have for deciding when to use surrogate keys or not?
Davidson: There is no problem with using large Natural Composite Keys in general, but using them for your clustering key can be problematic. Every non-clustered key has the entire clustering key in it as additional key columns. So, if you have a 30-byte key for your clustered index, and then 10 non-filtered indexes, each row will at least take up 330 bytes, just for clustering keys. Most of the time, using a surrogate key for the primary key that is 4 or so bytes will be more efficient, and if sequential, even more efficient. Hence, the general performance suggestion of using sequential integers as a surrogate key is a common starting point. Just always keep in mind that this sequential integer value doesn't really have a meaning and should never get a meaning. It is simply there to stand in for the natural key for performance reasons.
Kline: A friend of mine once told me about how their team did a major refactoring of a database where they switched from natural, composite keys to surrogate keys for their clustered index. They kept the composite keys for covering queries and to reduce key lookups if I remember correctly. They ultimately saved about 200Gb on a 1TB database and reduced their IOPs by about 20%. It was a dramatic improvement and well worth their time.
The Data Modeling & Database Design for Developers webinar covered a lot of great information. Here are some additional resources for further best practices.
- Slides & Video - Tales from Building a SQL Server DW in Azure—This SentryOne webinar recap blog shares a lot of great database design resources and information.
- Architectural Choices That Affect Performance—This SentryOne webinar covers a variety of database design topics, particularly why design impacts performance on Microsoft SQL Server.
- Pro SQL Server Relational Database Design and Implementation—This is a book co-authored by Louis Davidson that shares the effective and scalable database design techniques in a SQL Server 2016 and higher environment.
Kevin (@kekline) serves as Principal Program Manager at SentryOne. He is a founder and former president of PASS and the author of popular IT books like SQL in a Nutshell. Kevin is a renowned database expert, software industry veteran, Microsoft SQL Server MVP, and long-time blogger at SentryOne. As a noted leader in the SQL Server community, Kevin blogs about Microsoft Data Platform features and best practices, SQL Server trends, and professional development for data professionals.