Migrating Data Structures in DynamoDB

DynamoDB provides a flexible storage solution for web applications. As an application evolves, the data structures stored in DynamoDB change to fit the needs of the system. Managing these data structure changes can be easy with the right patterns in place.

I created a small example of how my team migrates user data in DynamoDB. In this post, I’ll highlight the key patterns we use and demonstrate some basic usages of DynamoDB in a NodeJS environment.

1. Scanning for Records

The first step to migrating data structures in DynamoDB is identifying the records we need to update. We can use DynamoDB’s scan method to do this.


const migrate = async () => {
  const db = new DynamoDB.DocumentClient();
  let lastEvalKey;
  do {
    // Find the relevant records
    const { Items, LastEvaluatedKey } = await db.scan({
      TableName: process.env.DYNAMODB_TABLE_NAME,
      // Find all records with a RecordId (primary key column for this table) that begin with "User"
      FilterExpression: 'begins_with(RecordId, :x)',
      ExpressionAttributeValues: {
        ':x': 'User:'
      },
      ExclusiveStartKey: lastEvalKey,
    }).promise();

    lastEvalKey = LastEvaluatedKey;

    // Update the structure and save it to DynamoDB...
  } while(lastEvalKey);
}

There are two things to note here:

  • We scan our table for records that match a particular pattern. Since our table might have records other than user data, we want to find the identifying pattern for just the user-related records (in this case, all RecordIds that begin with "User:").
  • DyanmoDB will only return 1MB of data back at a time. So, depending on how many records are present in the table, scan will return one page of data and return back LastEvaluatedKey. Passing that value in as the ExclusiveStartKey will make sure the next scan picks up where the last one left off. We want to continue to scan the table until we no longer have a value for LastEvaluatedKey.

2. Updating Data Structures

With the records now loaded into memory, we can map over the list of records. In my example, I created a few update functions that I can map over. These functions take in a single record and return an updated version of that record.


const migrate = async () => {
  const db = new DynamoDB.DocumentClient();
  let lastEvalKey;
  do {
    // Scan for relevant records...

    // Update all the items for the page of data.
    const updatedItems = Items.map(updateFn);

    // Save updated records to DynamoDB...
  } while(lastEvalKey);
}

const updateFn = (record) => {
  // Add a new field to the record so users can have some favorite fruits selected, and everyone loves apples, so just make that the default.
  return {
    ...record,
    defaultFruitSelection: ['Apple'],
  };
};

As we add more migrations, the update function is the only thing we’ll need to create. We don’t need to redo any of the other work mentioned in these other sections.

3. Applying the Updates to DyanmoDB

After generating a list of updates, we can use DynamoDB’s put method to replace the outdated records.


const migrate = async () => {
  const db = new DynamoDB.DocumentClient();
  let lastEvalKey;
  do {
    // Scan for relevant records...

    // Update all the items for the page of data...

    // Save updated records to DynamoDB
    await Promise.all(updatedItems.map((item) =>
      db.put({
        TableName: process.env.DYNAMODB_TABLE_NAME,
        Item: item,
      }).promise()
    ));
  } while(lastEvalKey);
}

Assuming we haven’t changed the primary key field in the record, DynamoDB will replace the records.

It’s also worth noting the Promise.all. Initially, I didn’t have this in my example, opting for running each put operation sequentially. But running all put invocations in parallel speeds up the script by quite a bit.

4. Creating Migration Files

I manage the migrations for the data structures by creating a folder in the project dedicated to the different versions of the update functions. In the example project, the folder is called migrations. You’ll notice that the migrations directory contains files that all share a similar structure, for example:


const up = (record) => {
  return {
    ...record,
    defaultFruitSelection: ['Apple'],
  };
}

const down = (record) => {
  const { defaultFruitSelection, ...rest } = record;
  return rest;
}

module.exports = {
  up,
  down,
  sequence: 1,
};

Each migration file contains an up function to migrate a record to the latest structure. I also create a down function in each file to allow changes to roll back if necessary. Finally, there’s a sequence number. I use this number to make sure migrations are applied in the correct order.

The script that runs the migrations will load each module found in the migrations directory:


const fs = require('fs');

const runMigrations = async () => {
  const migrationFiles = fs.readdirSync('./migrations');
  const migrations = migrationFiles.map((fileName) => require(`./migrations/${fileName}`)).sort((a, b) => a.sequence - b.sequence);
  for (const migration of migrations) {
    console.log('migrating: ', migration.sequence);
    await migrate(migration.up);
  }
}

Whenever we need to update the data structure, all we need to do is add a new file to the migrations directory and run this script.

5. Recording Applied Migrations in Batches

There’s one problem with the script from the last section: it will run all the migrations, including ones that have been run previously. To address this, we need to record when we run a migration. What I’ve implemented in the example is a batch system. Each run of the migrations will create a brand new batch and record a list of migration sequences that run. Each run, the migration script will check all batches to see which migrations have already been applied.

Here’s the updated script:


const fs = require('fs');

const runMigrations = async () => {
  const migrationFiles = fs.readdirSync('./migrations');
  const migrations = migrationFiles.map((fileName) => require(`./migrations/${fileName}`)).sort((a, b) => a.sequence - b.sequence);
  const latestBatch = await getLatestBatch();
  const batch = latestBatch ? latestBatch.batchNumber + 1 : 1;
  for (const migration of migrations) {
    console.log('migrating: ', migration.sequence)
    await migrate(batch, migration.sequence, migration.up);
  }
}

Notice that we are now passing a batch number and sequence number into the migrate function. This allows us to record that the migration was run. Here’s what that update might look like:


const recordMigration = async (batch, sequence) => {
  const db = new DynamoDB.DocumentClient();
  const { Item } = await db.get({
    TableName: process.env.DYNAMODB_TABLE_NAME,
    Key: {
      // We are using the same table, but a different primary key value than "User"
      RecordId: 'Migrations'
    },
  }).promise();

  // Find the set a batches, or initialize a new set of batches if none are present.
  const existingBatches = Item ? Item.Batches : {};

  // Check to see if the batch we are running already exists, else initial a new one.
  const batchToUpdate = existingBatches[batch] || [];

  // Record the sequence we just ran
  batchToUpdate.push(sequence);

  // Update the DynamoDB record with the latest updates.
  const updatedItem = {
    ...Item,
    RecordId: 'Migrations',
    Batches: {
      ...existingBatches,
      [batch]: batchToUpdate,
    }
  }

  // Save the record.
  await db.put({
    TableName: process.env.DYNAMODB_TABLE_NAME,
    Item: updatedItem,
  }).promise()
}

There’s a lot more to batches in this migration system. I’d recommend looking through the example project to see the different usages of batches and how that allows us to roll back migrations as needed.

More to Explore

Assuming you have your own AWS account, you can continue to explore this example project for some inspiration on how to manage data in DynamoDB. My team’s usage of DynamoDB is fairly minimal, but I think these patterns that we’ve developed are a good foundation. Let me know what works and what could be improved in the comments below.