Database Schema Versioning
Storing the source code of an application in a version control system is nothing controversial. It is a universally accepted practice. There is a small detail, however, which makes the version control of applications difficult. I am talking about the database schema. If the application has to manage state, doing it in a database is a very common approach. We must consider the database schema as part of the source code of the application, and as a consequence, it must be stored with the rest of the source code.
Because it is difficult, I’ve seen a few common themes in the wild. Either people don’t do it at all, or they do it just for the initial version of the application, or they tuck it away to a separate repository, and it starts living its own life, disconnected from the application code. Needless to say, none of the above hits the mark for effective version control.
In this article I talk about relational databases. Most of my experience was with Oracle, but the same principles can be applied to other relational databases.
Challenges
What is the problem then? The naïve approach of simply storing the DDL scripts defining the desired schema configuration does not work in all situations. It works pretty well when you are in the greenfield phase of the project. You can start with an empty schema, apply the DDL script, load some test data, and you are ready to test or demo. You can do it either in a real database or in an in-memory mock.
Once the project transitions to the brownfield phase, i.e. when you have to deal with deployed versions of the application, the state gets in the way: you must preserve it across production deployments of new versions. A typical suggestion is to anticipate this and not to use DDL scripts to define the target configuration of the schema, but rather use schema migration scripts, growing the schema over time.
This solves the state preservation problem, but you lose the ability to reason about the desired target configuration of the schema. You can see all the changes as they happen, but how should the schema objects look in the end?
The reasonable question to ask at this point is: why not both?
Approach
Why yes, let’s use both. And when I say “both” I don’t mean using DDL scripts in the greenfield phase and then switching to schema migration scripts in the brownfield phase. What I mean is to use DDL scripts to define how the schema should look and then derive the migration scripts from the DDL scripts when we need them.
Object Types
Before we look into the migration script derivation, let’s note that not all database objects are created equal. This will help us to narrow down the scope of the script derivation.
In relation to the state, all database objects can be divided into three categories: the ones that store state, the ones that don’t store state, and the ones that store state, but the state can be derived from the data stored in other objects.
We need to be very careful with the objects that store state. For such objects we have to derive the migration scripts and to make sure the scripts do not cause data loss. The objects that don’t store state can be simply dropped and recreated. Finally, the objects with derived state can also be dropped and recreated, but if these operations are more expensive than migration, we can opt for migration. The table below summarizes the object types.
| Object Type | Stores State | Approach |
|---|---|---|
| Tables | Yes | Migrate |
| Sequences | Yes1 | Migrate |
| Indexes and constraints backed by indexes | Derived | Drop and recreate2 |
| Views | No | Drop and recreate |
| Synonyms | No | Drop and recreate |
| Database links | No | Drop and recreate |
| Materialized views | Derived | Drop and recreate2 |
| Referential integrity constraints | No | Drop and recreate |
| Stand-alone check constraints | No | Drop and recreate |
| Triggers | No | Drop and recreate |
| Stored code | No | Drop and recreate |
| Grants | No | Drop and recreate |
Looking at the table, we can see that only tables and to some extent sequences require migration to preserve the state. All other objects can be safely dropped and recreated without risking any data loss.
Drop and Recreate
We do not need to decide what objects we need to drop and recreate. Because this operation is safe and normally fast, we can drop and recreate all objects allowing such treatment. We just need to be careful sequencing the operations. The sequences below assume Oracle database objects, but you can easily adopt them to other relational databases.
The drop sequence:
- Revoke object grants.
- Drop triggers.
- Drop implementation of types and packages.
- Drop procedures.
- Drop views and materialized views.
- Drop functions.
- Drop declaration packages.
- Drop declaration of types.
- Drop synonyms.
- Drop database links.
- Drop referential integrity constraints.
- Drop stand-alone check constraints.
- Drop primary keys and unique constraints cascading the indexes.
- Drop remaining indexes.
At this point only tables and sequences remain in the database schema. We deal with them depending on whether we need to preserve the state or not. Then, we can create the rest of the objects.
The creation sequence:
- Create indexes.
- Create primary keys and unique constraints.
- Create stand-alone check constraints.
- Create referential integrity constraints.
- Create database links.
- Create synonyms.
- Create declaration of types.
- Create declarations of packages.
- Create functions.
- Create views and materialized views.
- Create procedures.
- Create implementation of types and packages.
- Create triggers.
- Collect schema stats.
- Apply object grants.
The reason for the operations to be sequenced like this is to minimize the chance of dropping objects still referenced by existing objects and attempting to create objects referencing objects that do not exist yet. There is a small area of contention around stand-alone functions, procedures and views. The assumption is that stand-alone functions do not reference views and the views do not reference procedures. If this presents an issue, you can encapsulate the functions and the procedures in packages.
DDL Scripts
You do not need to store any script for the drop sequence. This sequence must be implemented programmatically. The code would access the database schema metadata and issue the statements to drop the objects in the right order.
The creation sequence, on the other hand, needs scripts. The scripts must be stored in the version control system together with the code that uses the database objects. This will ensure that the database schema always matches the expectations of the code that uses it.
Script Naming
Because the sequence creates different types of objects at different times, the scripts need to be split by object type. There may be different ways to achieve this, I’ll just give you an example for a Java project. It assumes that more than one schema definition may be stored: <schema1>, <schema2>, etc. Storing the scripts as resources allows accessing them in runtime, which opens up some interesting possibilities, like schema provisioning in testing environments.
src/main/resources/db/
<schema1>/
packages/
billing.pkb
billing.pks
procedures/
cleanup.prc
schema/
indexes.sql
keys.sql
ri.sql
sequences.sql
tables.sql
views/
customers.vw
projects.vw
<schema2>/
...
The table below summarizes the rules I follow for storing the scripts. Grants are likely to be implemented programmatically, with the script containing only one-off additions.
| Object Type | Subfolder | File Name | Suffix | Contains |
|---|---|---|---|---|
| Tables | schema | tables | .sql | Multiple |
| Sequences | schema | sequences | .sql | Multiple |
| Indexes | schema | indexes | .sql | Multiple |
| Primary keys | schema | keys | .sql | Multiple |
| Unique constraints | schema | keys | .sql | Multiple |
| Check constraints | schema | checks | .sql | Multiple |
| Foreign keys | schema | ri | .sql | Multiple |
| Database links | schema | links | .sql | Multiple |
| Synonyms | schema | synonyms | .sql | Multiple |
| Grants | schema | grants | .sql | Multiple |
| Type declarations | types | object name | .tps | Single |
| Type implementations | types | object name | .tpb | Single |
| Package declarations | packages | object name | .pks | Single |
| Package implementations | packages | object name | .pkb | Single |
| Functions | functions | object name | .fnc | Single |
| Procedures | procedures | object name | .prc | Single |
| Views | views | object name | .vw | Single |
| Triggers | triggers | object name | .trg | Single |
You can customize the naming to your taste and to the database code editing tools or the IDE you use. The important thing is to maintain enough separation to allow the creation sequence to proceed in stages as described.
Script Content
The table above shows that some files contain multiple objects, while others contain only one object. The distinction is purely cosmetic. It is there to facilitate editing the scripts: any code-like object is stored in a separate file. This is how IDEs and database code editing tools prefer to have the code organized.
On the other hand, the rest of the objects are bundled together. If the file grows too big, you can split it into multiple files, for example, by subject area. To separate the objects in the file, you need to pick a separator that works equally well with the database native tools, editors, and the tools you’ll use to apply the scripts. In particular, for Oracle, I prefer using a slash on a line by itself over a semicolon. This choice is driven by the script execution tool, which did not support semicolons.
When writing the scripts, assume that the schema is empty. Then use CREATE statements to create stand-alone objects, like tables. Then use ALTER statements to add objects tied to the stand-alone objects, like primary key constraints. The goal is to create the schema from scratch with the least amount of statements to execute. For example, normally you don’t need to create the underlying indexes for primary keys separately, unless you really want to customize them. This approach makes the scripts more readable.
Data
A database schema is not just structure and code, it also includes data (surprise, surprise). Some of the data warrants special treatment when we move from one version of the schema to another.
Static Data
If you inspect closely how the data is accessed by the application code, you will likely find a set of tables that are never updated. As far as the application is concerned, they are essentially read-only. The data stored in these tables usually represents decoding of some values to their human-readable form: ES may mean Spain in COUNTRIES, A may mean Active in STATUSES, etc. There are other possibilities, of course, but the important thing is that the data is read-only, and it needs to change quite infrequently, so that it may look like a good idea to tie its state to the schema version.
The way to handle such data is to do truncate and load even when we want to preserve the data state. The master copy of the data can be stored in a structured format, for example, in CSV files, and then loaded into the database tables just before we are ready to execute the creation sequence. There are no referential integrity constraints in the schema at that time, so the truncate and load approach can work.
Test and Demo Data
Another class of data that can be handled the same way is the data you use to conduct application testing and give application demonstrations to the client and other stakeholders. Having the master copy of the data in the version control is a good way to guarantee the same initial state for the tests and a target-rich environment for the demos.
Below is an example of how the static, test, and demo data may be stored in the version control.
src/main/resources/db/
<schema1>/
data/
demo/
customers.csv
transactions.csv
init/
countries.csv
transaction_types.csv
packages/
billing.pkb
billing.pks
schema/
indexes.sql
keys.sql
ri.sql
sequences.sql
tables.sql
<schema2>/
...
The static data here is stored in init folder while the test and demo data is combined in demo folder.
Deployment Scenarios
Building on the approach outlined above, we can consider three possible deployment scenarios. The first scenario is using the scripts to create the schema from scratch. This is what we need to do when we want to run automated acceptance tests using an in-memory database. We start with an empty schema, create the objects, and load the initial set of data to support testing. Each test case will manage its own data, preparing it before the test and cleaning up after.
The second scenario is similar. We deploy to a shared testing environment, and therefore we cannot assume the schema to be empty. Because we still deal with a non-production environment, we do not need to worry about state preservation. Before deploying the new version, we need to execute the drop sequence, and then drop all sequences and tables. This will essentially make the schema empty, and we can proceed creating the schema from scratch as outlined above. In addition to the initial set of data, it would make sense to load specifically prepared data to support demos and exploratory testing.
Finally, the third scenario is deploying to a non-empty schema, but now we must preserve the state. This can be a production deployment, or it can be one of the numerous dress rehearsals we perform before that. In this case, we do not drop tables and sequences. Instead, we execute the prepared migration scripts, which would modify the tables and the sequences without losing the data.
The table below outlines all three scenarios.
| From scratch | Drop and recreate | Preserve state |
|---|---|---|
| Execute the drop sequence | Execute the drop sequence | |
| Drop tables and sequences | ||
| Create tables and sequences | Create tables and sequences | Perform the migration |
| Load static data | Load all data | Reload static data |
| Execute the creation sequence | Execute the creation sequence | Execute the creation sequence |
In situations when the application accesses multiple schemas, for example, when it fetches data from peer databases and loads it into the principal schema, it is possible to apply different deployment scenarios to different schemas. For example, you can preserve state for the principal schema, but use drop and recreate for the mock peer schemas. Or you can use in-memory database to represent peers, and create them from scratch. The flexibility is certainly there.
Migration Scripts
In order to implement the deployment scenario with state preservation, we need to have migration scripts. Given that we drop and recreate everything except tables and sequences, we only need to produce migration scripts for these two types of objects.
Script Derivation
One way to create the migration scripts is to compare the content of DDL scripts for the source and the target versions of the schema, and to describe the differences in terms of DDL statements. It works, but it is labor-intensive and error-prone. There are database management tools that can compare two schemas and produce the migration script automatically. The script can then be reviewed for potential data loss, adjusted if needed, and stored in the version control. Figure 1 below shows the process.
A much simpler approach would be to have a tool that would compare the DDL scripts directly, without rendering them to database schemas. There is a tool named ddl-compare for MySQL, but I have not seen anything like this for Oracle, SQL Server, or PostgreSQL. It is possible I have not looked hard enough.
Trial Migration
Once you produced the set of migration scripts, your work is not done yet. At this point you still have the two schemas you used to derive the scripts. You can apply the migration scripts to the source schema and that should presumably produce the same schema as the target. Compare them. If there are no discrepancies, great. Otherwise, you can use the discrepancies to adjust the migration scripts and try again. This process is illustrated in Figure 2 below.
Moreover, when you execute the real migration in production, you still can use the same verification approach as a quick sanity check that you’ve hit the mark.
Migration Script Versioning
The first thing you should note is that the migration scripts are expected to be applied on top of a specific version of the schema. Scripts to move from version 1.3.0 to version 1.4.0 are not the same as scripts to move from 1.2.0 to version 1.4.0. In fact, there is a relationship between these scripts: the former, in a way, are a subset of the latter.3
At the very least, the migration scripts must be stored in the version control in a folder named after the version of the schema they are compatible with. If you plan to support multiple versions, you will need to have multiple folders with the scripts. Something along these lines:
src/main/resources/db/
<schema1>/
patches/
1.2.0/
p-001.sql
p-002.sql
1.3.0/
p-002.sql
schema/
indexes.sql
keys.sql
ri.sql
sequences.sql
tables.sql
views/
customers.vw
projects.vw
<schema2>/
...
Next, it is a good idea to mark a database schema with its version. This will allow you to implement an easy way to select the right set of migration scripts by querying the schema and selecting the right folder. I implement this with a DDL script, which creates a view named schema$version with the following content (Oracle syntax):
create view schema$version as
select
'${project.version}' as version
from dual
/
All scripts get pre-processed by the build system and ${project.version} gets replaced by the current version of the application. The tool that applies the migration scripts can do a quick select statement and use the version to derive the folder name. Feel free to adjust this as you see fit.
Rollback
When you apply the migration scripts to advance the database schema version while preserving the data, occasionally things may not go as scripted. The common advice given for such situations is to produce the rollback scripts using the same procedure as for the migration scripts, only reversing the source and the target schemas.
I reject this advice. There are a couple of common situations when a rollback would be necessary. One, we migrated the schema and then something else went wrong, for example, we discovered a critical bug when performing the post deployment validation of the application. Two, the application of the migration scripts failed midway due to environmental issues, for example, the database ran out of TEMP space.
These situations are fundamentally different. In the first case, we have the database schema in a known consistent state. We can apply the rollback scripts and take it back to where we started. In the second case, the state of the schema is unpredictable. Trying to apply the rollback scripts at this point probably will not work.
Moreover, even when we start applying the rollback scripts from the expected state, there is still a small chance that the rollback script application may fail, and that will also leave the schema in an unpredictable state.
This can be mitigated by making the scripts idempotent, meaning that they are coded in a way that they sense what is already present in the schema, and therefore allowing them to be applied to the schema multiple times. Idempotent scripts can be applied to a schema in an inconsistent state and take it to the known consistent state.
Still, I prefer not to go down that path. Instead, for rollback purposes, I use methods not involving rollback scripts. For small schemas, it can be as simple as taking a snapshot backup just before the deployment. For larger databases, you can opt for pausing the replication to the disaster recovery site before the deployment. Then a rollback would constitute a switchover to the disaster recovery instance of the application.
On the other hand, if you like the rollback script approach, there is nothing here that would stop you from using it.
Operational Considerations
As you’ve already seen, modifying a database schema using the described approach involves many steps: the schema gets stripped to the bare bones, the structure of the tables changes, the data gets modified. Then everything gets assembled back to its running state.
Practically it means an outage. You must shut down the application instances accessing the schema before starting the migration. The duration of the outage really depends on the amount of data you’ll have to pump while making the schema changes. It also depends on the amount of time necessary to recreate derivative objects, such as indexes and materialized views, and to collect the schema statistics in the end. You definitely want to practice the timings on realistic data volumes. Dropping and recreating stateless objects in the schema takes almost no time.
Depending on the data volume, it may be practical to take a snapshot backup of the schema just before you start the migration. I strongly recommend it. If you have a data replication configured to your disaster recovery site, you probably want to pause it. This will give you an insurance policy of being able to switch to the disaster recovery instance if everything else fails or, perhaps, use it to recover your primary instance. After the smoke tests are done, you can restart the replication.
If having an outage is not an option, you will need to invest in blue-green deployment setup for your application. For stateful applications it is a non-trivial undertaking and the implementation description is way outside the scope of this article.
I can expect some readers to say: “But if we just use the migration scripts without stripping the schema naked, we may avoid the outage.” To that I have to say that it’s an illusion. The knock-on effect of schema changes will get you one day. If you have not experienced it yet, it simply means that you have not been doing schema migrations often enough. Doing it with the application instances up and accessing the schema is just plain dangerous. The outage window can probably be shorter in this case, that is true.
Tools
This is the part where my tone becomes a bit pessimistic. The reason is that I am not aware of a single tool that would do all of the above. There are tools that can do schema comparison efficiently and derive the migration scripts. There are tools that can sequence the application of the scripts to a schema. There are tools that can load data from CSV files. I would be hard-pressed to find tools making the distinction between different deployment scenarios outlined above or otherwise managing the whole thing end-to-end as described. But maybe I have not looked hard enough.
Earlier in my career I made the mistake of not open-sourcing a collection of Ant tasks that could have become a foundation for such a tool. Since then this collection has become a work for hire, and I see no chance of the company ever releasing it as open source. Ant also looks a bit out of vogue now.
The best I can try to do here is to describe the functionality of the tools needed to manage database schemas effectively. Maybe something good will come out of it.
Orchestration
Orchestration tool needs to run the drop sequence, the creation sequence, and the migration scripts in the right order. It also needs to distinguish between different deployment scenarios. Below is the short list of requirements.
- Support connectivity, metadata processing, and command generation for popular database engines. Ideally, the support should be extensible.
- Drop all objects of a given type with the names matching the include pattern and excluding the ones with the names matching the exclude pattern.
- Run SQL scripts in a given folder, either using include/exclude patterns and ordered lexicographically, or listed explicitly.
- Load data from CSV files, optionally truncating the table before load.
- Sequence object destruction, script execution, and data load tasks conditionally, based on environment and deployment configuration properties.
- Sense the current version of the target schema and use it in the migration script selection. Ideally, the tool should support migrating multiple schema versions to the current application version.
As you see, it is really not that much.
Schema Comparison
There are lots of tools available commercially that allow you to compare two database schemas and derive the migration script. Some of them are quite good. Unfortunately, they do not lend themselves to automation very well. They either tend to be UI-driven, or force you to a vendor-specific vision of how the rest of the automation should work. They also tend to work on rendered schemas and not on DDL scripts. The wish list for a schema comparison tool follows.
- CLI-based with a good way to integrate it to the orchestration tool above, e.g. as a task to confirm that the schemas are identical.
- Really concentrating only on objects that store state, the rest we drop and recreate anyway. Comparing derivative objects should come as an option.
- Offering a vendor-independent way to represent such schema objects. Vendor-specific DDL scripts should be derived from it.
- Allowing to produce the migration scripts without rendering the DDL scripts in the schemas first.
Again, not much, but far from being trivial.
Conclusion
Here are the main takeaways from the article.
- Versioning of database schemas is a genuinely difficult problem. You have to tackle it though if your application manages state in a relational database.
- You need to keep both the DDL scripts representing the target configuration of the schema and the schema migration scripts if you want both to preserve the state during migrations and be able to reason about the resulting schema.
- When preparing the migration scripts, you only need to concentrate on the objects that store original state. Stateless objects and objects with derived state can be dropped and recreated. Practically it means tables and sequences, and even the latter is questionable.
- There are three principal deployment scenarios: create from scratch, drop and recreate, and migration. They have significantly overlapping parts using the same DDL scripts. You will need to exercise all three depending on your goals and your deployment process must support it.
- Some of the data you try to preserve is actually static. You can manage it differently.
- There are tools that can help with the script derivation and orchestration. Better tools are possible.
- You need to have other means besides the rollback scripts to ensure you can undo any damage to the schema. In my opinion, rollback scripts are just an optimization technique, which may not work out in the end.
Sometimes the next value of a sequence can be derived from the data stored in the tables. In such cases, you can drop and recreate the sequence. ↩
Optionally, objects with derived state can be migrated. You opt for it when the time to migrate is significantly less than the time to drop and recreate. ↩ ↩2
It is tempting to lean on this relationship to avoid storing the same scripts multiple times. It is pretty clear how this would work for linear progression of the versions. Unfortunately, the most general case is not linear and schema changes are not commutative. It is definitely worth investigating further. ↩