9 minute read

In the article Database Schema Versioning, I outlined the process for managing versions of a database schema. A significant part of this process is the ability to compare two database schemas—the source and the target—and derive SQL scripts that migrate the source schema to match the target without causing data loss.

Moreover, according to the article, only two types of schema objects keep the data we can never afford to lose: table and, to some extent, sequences. Other objects either do not store any state or their state can be derived from the state stored in other objects. This means that to derive migration scripts we only need to compare tables and sequences.

Finally, the article expresses the desire to perform the comparison without rendering the DDL scripts that define the schema in the actual database first. This is a common approach and there are plenty of tools that support it, yet, if this step can be skipped, it should be skipped.

Trouble with Renames

One obstacle that makes comparing DDL scripts—or rendered schemas, for that matter—difficult is that object names can change. However, we must not treat a rename as dropping the object and creating a new one, because that would cause data loss. The same applies to table column names.

It would be very helpful to use something other than names as identifiers and treat names solely as attributes. If we make this new identifier durable, in other words, make it persist for the whole life of the object or table column, the detection of renames becomes trivial.

Durable Identifiers

What properties do we want these durable identifiers to have? Below is my short list.

  • They must be unique. The goal is to uniquely identify an object in the schema or a column in the table.
  • They should look like valid SQL identifiers. I can envision using them in the migration scripts.
  • They should be reasonably short.
  • Given the same input DDL scripts, durable identifiers should have the same values. Basically, they should be derived from object and column names.
  • At the same time, I don’t want the connection between the original name and the derived durable identifier to be too strong to impede potential renames. In particular, I don’t want the original name to be recoverable from a durable identifier.

The last two points strongly suggest using a hash. SQL names tend to be case-insensitive unless special steps are taken, so it would name sense to normalize names before hashing. Hashes may start with a digit and therefore need a prefix to form valid SQL identifiers.

Derivation for Objects

Here is one way to derive durable identifiers from object names:

  1. Convert the name to lowercase.
  2. Compute the SHA-1 hash and represent it as a lowercase hex string.
  3. Take the first eight characters of the string.
  4. Prepend the id$ prefix.

Below the algorithm is expressed in Bash commands.

OBJECT_NAME=MyTable

echo "id\$$(echo -n "$OBJECT_NAME" |
    tr '[:upper:]' '[:lower:]' |
    sha1sum |
    cut -c -8)"

The table shows durable identifiers produced by the script for different sample object names.

Object NameDurable Identifier
oneid$fe05bcdc
TWOid$ad782ecd
mytableid$eab7bf0b
MyTableid$eab7bf0b
MYTABLEid$eab7bf0b

Truncating the hash to eight characters is somewhat arbitrary. It produces short names with a low probability of collision. The dollar sign is a valid character in SQL identifiers, so adding the id$ prefix yields a valid SQL identifier regardless of the hash value.

Derivation for Columns

For columns, I want the identifier to depend not just on the name of the column but also on the durable identifier of the table. Below is one possible way to achieve that.

  1. Get (or compute, if still unknown) the durable identifier for the table.
  2. Concatenate a dot and the column name to the identifier.
  3. Convert the result to lowercase.
  4. Compute the SHA-1 hash and represent it as a lowercase hex string.
  5. Take the first eight characters of the string.
  6. Prepend the id$ prefix.

Below is a Bash script implementing this sequence.

TABLE_NAME=mytable
COLUMN_NAME=Id

TABLE_ID=$(echo "id\$$(echo -n "$TABLE_NAME" |
    tr '[:upper:]' '[:lower:]' |
    sha1sum |
    cut -c -8)")

echo "$TABLE_ID"
echo "id\$$(echo -n "$TABLE_ID.$COLUMN_NAME" |
    tr '[:upper:]' '[:lower:]' |
    sha1sum |
    cut -c -8)"

The table that follows demonstrates the results of the script application to a few table and column names.

Table NameColumnTable IdentifierColumn Identifier
oneidid$fe05bcdcid$5e1e0b2e
onenameid$fe05bcdcid$c6ff0c0d
mytableIdid$eab7bf0bid$929aa2f4
MyTableIDid$eab7bf0bid$929aa2f4
MYTABLEidid$eab7bf0bid$929aa2f4

A couple of observations. First, the derivation is case-insensitive. Second, the durable identifier for the same column name depends on the durable identifier of the table it belongs to.

Handling Collisions

Using hash for derivation creates the possibility of hash collisions. Moreover, it is possible for object A to be renamed to object B by definition keeping the original durable identifier, and then another object A to be created in the schema. Using the derivation rules described above this new object should receive the same durable identifier as object B.

Regardless of how a collision occurs, durable identifiers must remain unique. Therefore, if a newly created durable identifier collides with an existing one, resolve it by repeating the derivation algorithm on the produced identifier until a unique value is obtained.

The same approach applies to column identifiers: repeat the derivation for the colliding column identifier until a unique one is produced.

The table below shows some examples of higher order identifiers.

RoundObject IdentifierColumn Identifier
0id$eab7bf0bid$929aa2f4
1id$858313ceid$2c1b26d6
2id$6bb3d87did$136b49af
3id$a1b4f4e3id$86b0ac78

One interesting consequence of such collision handling is the following. If we had a table with certain name, dropped it, and then recreated it with the same name, the produced durable identifiers will be the same both times. On the other hand, if we had a table, renamed it, and then created a new one with the same name as the original table, the durable identifier for the new table will be different. This is a nice property to have.

Using Durable Identifiers

Having defined how to produce durable identifiers, let’s look at how they can be of use to us.

Handling Renames and Moves

The first and the most obvious use of durable identifiers is to detect object renames as well as column renames and moves in tables. Durable identifiers would be used as keys in such comparisons, while the names of the objects and the columns and the positions of the columns would then be attributes.

Let’s consider some examples. Imagine we have the following two tables in the schema. I annotated all identifiers with their durable identifiers in the comments.

create table class /* id$8d767bf5 */ (
    id /* id$7e1c372d */ integer not null,
    name /* id$f6654666 */ varchar(128) not null
)
/

create table student /* id$204036a1 */ (
    id /* id$19935226 */ integer not null,
    class_id /* id$73598ce7 */ integer not null,
    first_name /* id$556dfe8b */ varchar(128) not null,
    last_name /* id$4fee3fe6 */ varchar(128) not null
)
/

One obvious defect of this schema is that each student can only attend one class. In the next version of the application we decide to fix this. We also make some other minor changes to the schema in the process. The new schema looks like this.

create table class /* id$8d767bf5 */ (
    id /* id$7e1c372d */ integer not null,
    name /* id$f6654666 */ varchar(256) not null
)
/

create table enrollment /* id$26375e68 */ (
    class_id /* id$c5e74ba0 */ integer not null,
    person_id /* id$71acef0f */ integer not null
)
/

create table person /* id$204036a1 */ (
    id /* id$19935226 */ integer not null,
    family_name /* id$4fee3fe6 */ varchar(128) not null,
    given_name /* id$556dfe8b */ varchar(128) not null
)
/

Note that the tables and their columns retain their durable identifiers as much as possible. This would allow us to derive the migration script that recognizes the renames and column moves correctly.

alter table class /* id$8d767bf5 */ modify
    name /* id$f6654666 */ varchar(256)
/

alter table student rename to id$204036a1
/

create table enrollment /* id$26375e68 */ (
    class_id /* id$c5e74ba0 */ integer not null,
    person_id /* id$71acef0f */ integer not null
)
/

insert into enrollment /* id$26375e68 */ (
    class_id, /* id$c5e74ba0 */
    person_id /* id$71acef0f */
)
select
    class_id as id$c5e74ba0,
    id as id$71acef0f
from id$204036a1
/

create table person /* id$204036a1 */ (
    id /* id$19935226 */ integer not null,
    family_name /* id$4fee3fe6 */ varchar(128) not null,
    given_name /* id$556dfe8b */ varchar(128) not null
)
/

insert into person /* id$204036a1 */ (
    id, /* id$19935226 */
    family_name, /* id$4fee3fe6 */
    given_name /* id$556dfe8b */
)
select
    id as id$19935226,
    last_name as id$4fee3fe6,
    first_name as id$556dfe8b
from id$204036a1
/

drop table id$204036a1
/

A few things worth mentioning here.

  • The column modification in class table is an optimization technique. In this particular case, the column can be extended without rebuilding the table.
  • Table student is renamed to id$204036a1 to allow its full rebuild. In our case it’s not strictly necessary because the table has a different name in the new schema; it’s included here for generality.
  • The full rebuild of student table is necessary because of the column order change. There is no optimization technique that could be used to achieve that.

Migration Script Naming

Another use for durable identifiers is to group all operations concerning one table into a single file and use the durable identifier for the table to name the file. This will make the structure of the change sets stable across multiple branches, which should help with merges in the version control system.

The theory behind such grouping is that each table can be transformed more or less independently of the others, which still remains to be seen. Table splits and merges can be handled by placing all relevant operations in the same script and using the durable identifier of the parent table to name the file.

Using the example above, the single migration script can be split in two:

File 8d767bf5.sql:

alter table class /* id$8d767bf5 */ modify
    name /* id$f6654666 */ varchar(256)
/

File 204036a1.sql:

alter table student rename to id$204036a1
/

create table enrollment /* id$26375e68 */ (
    class_id /* id$c5e74ba0 */ integer not null,
    person_id /* id$71acef0f */ integer not null
)
/

insert into enrollment /* id$26375e68 */ (
    class_id, /* id$c5e74ba0 */
    person_id /* id$71acef0f */
)
select
    class_id as id$c5e74ba0,
    id as id$71acef0f
from id$204036a1
/

create table person /* id$204036a1 */ (
    id /* id$19935226 */ integer not null,
    family_name /* id$4fee3fe6 */ varchar(128) not null,
    given_name /* id$556dfe8b */ varchar(128) not null
)
/

insert into person /* id$204036a1 */ (
    id, /* id$19935226 */
    family_name, /* id$4fee3fe6 */
    given_name /* id$556dfe8b */
)
select
    id as id$19935226,
    last_name as id$4fee3fe6,
    first_name as id$556dfe8b
from id$204036a1
/

drop table id$204036a1
/

From this point on, the files can start living their own lives, since the changes contained in them can never overlap.

Updated: