Database Schema Versioning

22 minute read

Storing the source code of an application in a version control system is nothing controversial. It is a universally accepted practice. There is a small detail, however, which makes the version control of applications difficult. I am talking about the database schema. If the application has to manage state, doing it in a database is a very common approach. We must consider the database schema as part of the source code of the application, and as a consequence, it must be stored with the rest of the source code.

Because it is difficult, I’ve seen a few common themes in the wild. Either people don’t do it at all, or they do it just for the initial version of the application, or they tuck it away to a separate repository, and it starts living its own life, disconnected from the application code. Needless to say, none of the above hits the mark for effective version control.

In this article I talk about relational databases. Most of my experience was with Oracle, but the same principles can be applied to other relational databases.

Challenges

What is the problem then? The naïve approach of simply storing the DDL scripts defining the desired schema configuration does not work in all situations. It works pretty well when you are in the greenfield phase of the project. You can start with an empty schema, apply the DDL script, load some test data, and you are ready to test or demo. You can do it either in a real database or in an in-memory mock.

Once the project transitions to the brownfield phase, i.e. when you have to deal with deployed versions of the application, the state gets in the way: you must preserve it across production deployments of new versions. A typical suggestion is to anticipate this and not to use DDL scripts to define the target configuration of the schema, but rather use schema migration scripts, growing the schema over time.

This solves the state preservation problem, but you lose the ability to reason about the desired target configuration of the schema. You can see all the changes as they happen, but how should the schema objects look in the end?

The reasonable question to ask at this point is: why not both?

Approach

Why yes, let’s use both. And when I say “both” I don’t mean using DDL scripts in the greenfield phase and then switching to schema migration scripts in the brownfield phase. What I mean is to use DDL scripts to define how the schema should look and then derive the migration scripts from the DDL scripts when we need them.

Object Types

Before we look into the migration script derivation, let’s note that not all database objects are created equal. This will help us to narrow down the scope of the script derivation.

In relation to the state, all database objects can be divided into three categories: the ones that store state, the ones that don’t store state, and the ones that store state, but the state can be derived from the data stored in other objects.

We need to be very careful with the objects that store state. For such objects we have to derive the migration scripts and to make sure the scripts do not cause data loss. The objects that don’t store state can be simply dropped and recreated. Finally, the objects with derived state can also be dropped and recreated, but if these operations are more expensive than migration, we can opt for migration. The table below summarizes the object types.

Object Type	Stores State	Approach
Tables	Yes	Migrate
Sequences	Yes¹	Migrate
Indexes and constraints backed by indexes	Derived	Drop and recreate²
Views	No	Drop and recreate
Synonyms	No	Drop and recreate
Database links	No	Drop and recreate
Materialized views	Derived	Drop and recreate²
Referential integrity constraints	No	Drop and recreate
Stand-alone check constraints	No	Drop and recreate
Triggers	No	Drop and recreate
Stored code	No	Drop and recreate
Grants	No	Drop and recreate

Looking at the table, we can see that only tables and to some extent sequences require migration to preserve the state. All other objects can be safely dropped and recreated without risking any data loss.

Drop and Recreate

We do not need to decide what objects we need to drop and recreate. Because this operation is safe and normally fast, we can drop and recreate all objects allowing such treatment. We just need to be careful sequencing the operations. The sequences below assume Oracle database objects, but you can easily adopt them to other relational databases.

The drop sequence:

Revoke object grants.
Drop triggers.
Drop implementation of types and packages.
Drop procedures.
Drop views and materialized views.
Drop functions.
Drop declaration packages.
Drop declaration of types.
Drop synonyms.
Drop database links.
Drop referential integrity constraints.
Drop stand-alone check constraints.
Drop primary keys and unique constraints cascading the indexes.
Drop remaining indexes.

At this point only tables and sequences remain in the database schema. We deal with them depending on whether we need to preserve the state or not. Then, we can create the rest of the objects.

The creation sequence:

Create indexes.
Create primary keys and unique constraints.
Create stand-alone check constraints.
Create referential integrity constraints.
Create database links.
Create synonyms.
Create declaration of types.
Create declarations of packages.
Create functions.
Create views and materialized views.
Create procedures.
Create implementation of types and packages.
Create triggers.
Collect schema stats.
Apply object grants.

The reason for the operations to be sequenced like this is to minimize the chance of dropping objects still referenced by existing objects and attempting to create objects referencing objects that do not exist yet. There is a small area of contention around stand-alone functions, procedures and views. The assumption is that stand-alone functions do not reference views and the views do not reference procedures. If this presents an issue, you can encapsulate the functions and the procedures in packages.

DDL Scripts

You do not need to store any script for the drop sequence. This sequence must be implemented programmatically. The code would access the database schema metadata and issue the statements to drop the objects in the right order.

The creation sequence, on the other hand, needs scripts. The scripts must be stored in the version control system together with the code that uses the database objects. This will ensure that the database schema always matches the expectations of the code that uses it.

Script Naming

Because the sequence creates different types of objects at different times, the scripts need to be split by object type. There may be different ways to achieve this, I’ll just give you an example for a Java project. It assumes that more than one schema definition may be stored: <schema1>, <schema2>, etc. Storing the scripts as resources allows accessing them in runtime, which opens up some interesting possibilities, like schema provisioning in testing environments.

src/main/resources/db/
    <schema1>/
        packages/
            billing.pkb
            billing.pks
        procedures/
            cleanup.prc
        schema/
            indexes.sql
            keys.sql
            ri.sql
            sequences.sql
            tables.sql
        views/
            customers.vw
            projects.vw
    <schema2>/
        ...

The table below summarizes the rules I follow for storing the scripts. Grants are likely to be implemented programmatically, with the script containing only one-off additions.

Object Type	Subfolder	File Name	Suffix	Contains
Tables	schema	tables	.sql	Multiple
Sequences	schema	sequences	.sql	Multiple
Indexes	schema	indexes	.sql	Multiple
Primary keys	schema	keys	.sql	Multiple
Unique constraints	schema	keys	.sql	Multiple
Check constraints	schema	checks	.sql	Multiple
Foreign keys	schema	ri	.sql	Multiple
Database links	schema	links	.sql	Multiple
Synonyms	schema	synonyms	.sql	Multiple
Grants	schema	grants	.sql	Multiple
Type declarations	types	object name	.tps	Single
Type implementations	types	object name	.tpb	Single
Package declarations	packages	object name	.pks	Single
Package implementations	packages	object name	.pkb	Single
Functions	functions	object name	.fnc	Single
Procedures	procedures	object name	.prc	Single
Views	views	object name	.vw	Single
Triggers	triggers	object name	.trg	Single

You can customize the naming to your taste and to the database code editing tools or the IDE you use. The important thing is to maintain enough separation to allow the creation sequence to proceed in stages as described.

Script Content

The table above shows that some files contain multiple objects, while others contain only one object. The distinction is purely cosmetic. It is there to facilitate editing the scripts: any code-like object is stored in a separate file. This is how IDEs and database code editing tools prefer to have the code organized.

On the other hand, the rest of the objects are bundled together. If the file grows too big, you can split it into multiple files, for example, by subject area. To separate the objects in the file, you need to pick a separator that works equally well with the database native tools, editors, and the tools you’ll use to apply the scripts. In particular, for Oracle, I prefer using a slash on a line by itself over a semicolon. This choice is driven by the script execution tool, which did not support semicolons.

When writing the scripts, assume that the schema is empty. Then use CREATE statements to create stand-alone objects, like tables. Then use ALTER statements to add objects tied to the stand-alone objects, like primary key constraints. The goal is to create the schema from scratch with the least amount of statements to execute. For example, normally you don’t need to create the underlying indexes for primary keys separately, unless you really want to customize them. This approach makes the scripts more readable.

Data

A database schema is not just structure and code, it also includes data (surprise, surprise). Some of the data warrants special treatment when we move from one version of the schema to another.

Static Data

If you inspect closely how the data is accessed by the application code, you will likely find a set of tables that are never updated. As far as the application is concerned, they are essentially read-only. The data stored in these tables usually represents decoding of some values to their human-readable form: ES may mean Spain in COUNTRIES, A may mean Active in STATUSES, etc. There are other possibilities, of course, but the important thing is that the data is read-only, and it needs to change quite infrequently, so that it may look like a good idea to tie its state to the schema version.

The way to handle such data is to do truncate and load even when we want to preserve the data state. The master copy of the data can be stored in a structured format, for example, in CSV files, and then loaded into the database tables just before we are ready to execute the creation sequence. There are no referential integrity constraints in the schema at that time, so the truncate and load approach can work.

Test and Demo Data

Another class of data that can be handled the same way is the data you use to conduct application testing and give application demonstrations to the client and other stakeholders. Having the master copy of the data in the version control is a good way to guarantee the same initial state for the tests and a target-rich environment for the demos.

Below is an example of how the static, test, and demo data may be stored in the version control.

src/main/resources/db/
    <schema1>/
        data/
            demo/
                customers.csv
                transactions.csv
            init/
                countries.csv
                transaction_types.csv
        packages/
            billing.pkb
            billing.pks
        schema/
            indexes.sql
            keys.sql
            ri.sql
            sequences.sql
            tables.sql
    <schema2>/
        ...

The static data here is stored in init folder while the test and demo data is combined in demo folder.

Deployment Scenarios

Building on the approach outlined above, we can consider three possible deployment scenarios. The first scenario is using the scripts to create the schema from scratch. This is what we need to do when we want to run automated acceptance tests using an in-memory database. We start with an empty schema, create the objects, and load the initial set of data to support testing. Each test case will manage its own data, preparing it before the test and cleaning up after.

The second scenario is similar. We deploy to a shared testing environment, and therefore we cannot assume the schema to be empty. Because we still deal with a non-production environment, we do not need to worry about state preservation. Before deploying the new version, we need to execute the drop sequence, and then drop all sequences and tables. This will essentially make the schema empty, and we can proceed creating the schema from scratch as outlined above. In addition to the initial set of data, it would make sense to load specifically prepared data to support demos and exploratory testing.

Finally, the third scenario is deploying to a non-empty schema, but now we must preserve the state. This can be a production deployment, or it can be one of the numerous dress rehearsals we perform before that. In this case, we do not drop tables and sequences. Instead, we execute the prepared migration scripts, which would modify the tables and the sequences without losing the data.

The table below outlines all three scenarios.

From scratch	Drop and recreate	Preserve state
	Execute the drop sequence	Execute the drop sequence
	Drop tables and sequences
Create tables and sequences	Create tables and sequences	Perform the migration
Load static data	Load all data	Reload static data
Execute the creation sequence	Execute the creation sequence	Execute the creation sequence

In situations when the application accesses multiple schemas, for example, when it fetches data from peer databases and loads it into the principal schema, it is possible to apply different deployment scenarios to different schemas. For example, you can preserve state for the principal schema, but use drop and recreate for the mock peer schemas. Or you can use in-memory database to represent peers, and create them from scratch. The flexibility is certainly there.

Migration Scripts

In order to implement the deployment scenario with state preservation, we need to have migration scripts. Given that we drop and recreate everything except tables and sequences, we only need to produce migration scripts for these two types of objects.

Script Derivation

One way to create the migration scripts is to compare the content of DDL scripts for the source and the target versions of the schema, and to describe the differences in terms of DDL statements. It works, but it is labor-intensive and error-prone. There are database management tools that can compare two schemas and produce the migration script automatically. The script can then be reviewed for potential data loss, adjusted if needed, and stored in the version control. Figure 1 below shows the process.

A much simpler approach would be to have a tool that would compare the DDL scripts directly, without rendering them to database schemas. There is a tool named ddl-compare for MySQL, but I have not seen anything like this for Oracle, SQL Server, or PostgreSQL. It is possible I have not looked hard enough.

Trial Migration

Once you produced the set of migration scripts, your work is not done yet. At this point you still have the two schemas you used to derive the scripts. You can apply the migration scripts to the source schema and that should presumably produce the same schema as the target. Compare them. If there are no discrepancies, great. Otherwise, you can use the discrepancies to adjust the migration scripts and try again. This process is illustrated in Figure 2 below.

Database schema migration and verification using the migration scripts — Figure 2: database schema migration and verification

Moreover, when you execute the real migration in production, you still can use the same verification approach as a quick sanity check that you’ve hit the mark.

Migration Script Versioning

The first thing you should note is that the migration scripts are expected to be applied on top of a specific version of the schema. Scripts to move from version 1.3.0 to version 1.4.0 are not the same as scripts to move from 1.2.0 to version 1.4.0. In fact, there is a relationship between these scripts: the former, in a way, are a subset of the latter.³

At the very least, the migration scripts must be stored in the version control in a folder named after the version of the schema they are compatible with. If you plan to support multiple versions, you will need to have multiple folders with the scripts. Something along these lines:

src/main/resources/db/
    <schema1>/
        patches/
            1.2.0/
                p-001.sql
                p-002.sql
            1.3.0/
                p-002.sql
        schema/
            indexes.sql
            keys.sql
            ri.sql
            sequences.sql
            tables.sql
        views/
            customers.vw
            projects.vw
    <schema2>/
        ...

Next, it is a good idea to mark a database schema with its version. This will allow you to implement an easy way to select the right set of migration scripts by querying the schema and selecting the right folder. I implement this with a DDL script, which creates a view named schema$version with the following content (Oracle syntax):

create view schema$version as
select
    '${project.version}' as version
from dual
/

All scripts get pre-processed by the build system and ${project.version} gets replaced by the current version of the application. The tool that applies the migration scripts can do a quick select statement and use the version to derive the folder name. Feel free to adjust this as you see fit.

Rollback

When you apply the migration scripts to advance the database schema version while preserving the data, occasionally things may not go as scripted. The common advice given for such situations is to produce the rollback scripts using the same procedure as for the migration scripts, only reversing the source and the target schemas.

I reject this advice. There are a couple of common situations when a rollback would be necessary. One, we migrated the schema and then something else went wrong, for example, we discovered a critical bug when performing the post deployment validation of the application. Two, the application of the migration scripts failed midway due to environmental issues, for example, the database ran out of TEMP space.

These situations are fundamentally different. In the first case, we have the database schema in a known consistent state. We can apply the rollback scripts and take it back to where we started. In the second case, the state of the schema is unpredictable. Trying to apply the rollback scripts at this point probably will not work.

Moreover, even when we start applying the rollback scripts from the expected state, there is still a small chance that the rollback script application may fail, and that will also leave the schema in an unpredictable state.

This can be mitigated by making the scripts idempotent, meaning that they are coded in a way that they sense what is already present in the schema, and therefore allowing them to be applied to the schema multiple times. Idempotent scripts can be applied to a schema in an inconsistent state and take it to the known consistent state.

Still, I prefer not to go down that path. Instead, for rollback purposes, I use methods not involving rollback scripts. For small schemas, it can be as simple as taking a snapshot backup just before the deployment. For larger databases, you can opt for pausing the replication to the disaster recovery site before the deployment. Then a rollback would constitute a switchover to the disaster recovery instance of the application.

On the other hand, if you like the rollback script approach, there is nothing here that would stop you from using it.

Operational Considerations

As you’ve already seen, modifying a database schema using the described approach involves many steps: the schema gets stripped to the bare bones, the structure of the tables changes, the data gets modified. Then everything gets assembled back to its running state.

Practically it means an outage. You must shut down the application instances accessing the schema before starting the migration. The duration of the outage really depends on the amount of data you’ll have to pump while making the schema changes. It also depends on the amount of time necessary to recreate derivative objects, such as indexes and materialized views, and to collect the schema statistics in the end. You definitely want to practice the timings on realistic data volumes. Dropping and recreating stateless objects in the schema takes almost no time.

Depending on the data volume, it may be practical to take a snapshot backup of the schema just before you start the migration. I strongly recommend it. If you have a data replication configured to your disaster recovery site, you probably want to pause it. This will give you an insurance policy of being able to switch to the disaster recovery instance if everything else fails or, perhaps, use it to recover your primary instance. After the smoke tests are done, you can restart the replication.

If having an outage is not an option, you will need to invest in blue-green deployment setup for your application. For stateful applications it is a non-trivial undertaking and the implementation description is way outside the scope of this article.

I can expect some readers to say: “But if we just use the migration scripts without stripping the schema naked, we may avoid the outage.” To that I have to say that it’s an illusion. The knock-on effect of schema changes will get you one day. If you have not experienced it yet, it simply means that you have not been doing schema migrations often enough. Doing it with the application instances up and accessing the schema is just plain dangerous. The outage window can probably be shorter in this case, that is true.

Tools

This is the part where my tone becomes a bit pessimistic. The reason is that I am not aware of a single tool that would do all of the above. There are tools that can do schema comparison efficiently and derive the migration scripts. There are tools that can sequence the application of the scripts to a schema. There are tools that can load data from CSV files. I would be hard-pressed to find tools making the distinction between different deployment scenarios outlined above or otherwise managing the whole thing end-to-end as described. But maybe I have not looked hard enough.

Earlier in my career I made the mistake of not open-sourcing a collection of Ant tasks that could have become a foundation for such a tool. Since then this collection has become a work for hire, and I see no chance of the company ever releasing it as open source. Ant also looks a bit out of vogue now.

The best I can try to do here is to describe the functionality of the tools needed to manage database schemas effectively. Maybe something good will come out of it.

Orchestration

Orchestration tool needs to run the drop sequence, the creation sequence, and the migration scripts in the right order. It also needs to distinguish between different deployment scenarios. Below is the short list of requirements.

Support connectivity, metadata processing, and command generation for popular database engines. Ideally, the support should be extensible.
Drop all objects of a given type with the names matching the include pattern and excluding the ones with the names matching the exclude pattern.
Run SQL scripts in a given folder, either using include/exclude patterns and ordered lexicographically, or listed explicitly.
Load data from CSV files, optionally truncating the table before load.
Sequence object destruction, script execution, and data load tasks conditionally, based on environment and deployment configuration properties.
Sense the current version of the target schema and use it in the migration script selection. Ideally, the tool should support migrating multiple schema versions to the current application version.

As you see, it is really not that much.

Schema Comparison

There are lots of tools available commercially that allow you to compare two database schemas and derive the migration script. Some of them are quite good. Unfortunately, they do not lend themselves to automation very well. They either tend to be UI-driven, or force you to a vendor-specific vision of how the rest of the automation should work. They also tend to work on rendered schemas and not on DDL scripts. The wish list for a schema comparison tool follows.

CLI-based with a good way to integrate it to the orchestration tool above, e.g. as a task to confirm that the schemas are identical.
Really concentrating only on objects that store state, the rest we drop and recreate anyway. Comparing derivative objects should come as an option.
Offering a vendor-independent way to represent such schema objects. Vendor-specific DDL scripts should be derived from it.
Allowing to produce the migration scripts without rendering the DDL scripts in the schemas first.

Again, not much, but far from being trivial.

Conclusion

Here are the main takeaways from the article.

Versioning of database schemas is a genuinely difficult problem. You have to tackle it though if your application manages state in a relational database.
You need to keep both the DDL scripts representing the target configuration of the schema and the schema migration scripts if you want both to preserve the state during migrations and be able to reason about the resulting schema.
When preparing the migration scripts, you only need to concentrate on the objects that store original state. Stateless objects and objects with derived state can be dropped and recreated. Practically it means tables and sequences, and even the latter is questionable.
There are three principal deployment scenarios: create from scratch, drop and recreate, and migration. They have significantly overlapping parts using the same DDL scripts. You will need to exercise all three depending on your goals and your deployment process must support it.
Some of the data you try to preserve is actually static. You can manage it differently.
There are tools that can help with the script derivation and orchestration. Better tools are possible.
You need to have other means besides the rollback scripts to ensure you can undo any damage to the schema. In my opinion, rollback scripts are just an optimization technique, which may not work out in the end.

Sometimes the next value of a sequence can be derived from the data stored in the tables. In such cases, you can drop and recreate the sequence. ↩
Optionally, objects with derived state can be migrated. You opt for it when the time to migrate is significantly less than the time to drop and recreate. ↩ ↩²
It is tempting to lean on this relationship to avoid storing the same scripts multiple times. It is pretty clear how this would work for linear progression of the versions. Unfortunately, the most general case is not linear and schema changes are not commutative. It is definitely worth investigating further. ↩

Alexei Yashkov