Automated Data Migration testing for Drupal

Migration testing is to ensure that your application gets migrated from one system to another without affecting the legacy data and the functionalities. While the migration is in process, the testing team needs to ensure the integrity of the system is intact and that there is no data loss while migrating the data.

Data loss is the most common issue we come across in Migration Testing, and when we have huge data to be migrated we cannot completely depend on manual testing. Because it becomes impossible to verify content and compare it with the source data. If we have automation in place to check the vast amount of data, it will help us find data loss in less time and will give us more confidence in the migrated system.

Drupal 7 to Drupal 8 Migration

Let me share an example, we had a requirement to migrate a university site from Drupal 7 to Drupal 8. Since it was a university site it had heavy data/content including various courses, programs, staff, faculty, events etc. and we needed to ensure that all the data gets migrated with no data loss.

We had around 24 content types and each content type having thousands of nodes. It was impossible to verify this number of nodes manually. So we created an automation framework that helped us verify all nodes.

Our Tool Selection

We started exploring various options using existing functional tools like Behat, Selenium, Cypress, etc. but we wanted to avoid the browser interaction in this data verification process because visiting node pages and verifying data would require a lot of time and when it comes to 1000+ nodes it could take hours. Choosing functional tools which use browsers never suffice our needs when it comes to large amounts of data.

We Chose Drupal

Our source (D7) and destination (D8) were in MySql DB format, so we decided to create a custom tool to verify the data migration. We created a Drupal module having a custom Drush, command that would verify the migrated data. Helping us execute test fasters and at the same time utilise the existing drupal functionalities and its framework in our tool.

Our Approach

Creating Drush Command and Managing DB connections

We created the custom Drush 9 command - “drush migration:test <content_type_machine_name>”


drush.services.yml
services:
 migrate_testing.commands:
   class: \Drupal\migrate_testing\Commands\MigrationTestingCommand
   tags:
     - { name: drush.command }
/**
* Drush command to run Migration testing
*
* @command migration:test
* @aliases mt
* @param null $content
* @option limit
*   number of nodes.
* @throws \Masterminds\HTML5\Exception
*/
public function migrationTesting($content = NULL, $options = ['limit' => 0]) {
 // actual code goes here.
}

The Drush command accepts two parameters - a content type name and the number of nodes to check, by default it would check all nodes.

Apart from this, we maintained two DB connections in the Settings.php file by adding another array for D7 - DB connection.

Mapping Files

We required a few inputs before running the migration tests - we needed the information of each content type and its field mapping in D8. This helped us compare the data with the exact node and data inside each field.

We then created Mapping files in YAML format, each of these files was named in <content_type_machine_name>.yml format, each file had the following format:

‍


content: <type_of_content>
type: <content_type_machine_name>
table: <migration_table_name>
fields:
 -
   d7:
     name: <d7_Field_machine_name>
     type: <d7_field_type>

   d8:
     name: <d8_Field_machine_name>
     type: <d8_field_type>
 -
   d7:
     name: <d7_Field_machine_name>
     type: <d7_field_type>
  
   d8:
     name: <d8_Field_machine_name>
     type: <d8_field_type>

The Fields array holds all the information about the fields from D7 to its corresponding D8 field and the Drush command uses these YAMLs as a source of input before querying the DB’s.

Field Types Configurations

Each field type has specific columns in DB, and those particular columns were used to verify the data.

For example, each field having type text (we get this field type from a mapping file) will have a field table with column name as - <field_machine_name_value> as a column that will hold the actual data. We created field_type.yml which holds all the column information of each field type that will be used to fetch specific data for comparison.

‍


d7:

 text:
   - "{fieldname}_value"

 text_long:
   - "{fieldname}_value"

 string:
   - "{fieldname}_value"

 email:
   - "{fieldname}_email"

 link_field:
   - "{fieldname}_url"
   - "{fieldname}_title"

 list_text:
   - "{fieldname}_value"

 file:
  - "{fieldname}_fid"
  - "{fieldname}_description"

 image:
   - "{fieldname}_fid"
   - "{fieldname}_alt"
   - "{fieldname}_title"
   - "{fieldname}_width"
   - "{fieldname}_height"

d8:

 text:
   - "{fieldname}_value"

 text_long:
   - "{fieldname}_value"

 string:
   - "{fieldname}_value"

 email:
   - "{fieldname}_value"

 link_field:
   - "{fieldname}_uri"
   - "{fieldname}_title"

 link_field:
   - "{fieldname}_uri"
   - "{fieldname}_title"
   - "{fieldname}_options"

 string:
   - "{fieldname}_value"

 image:
   - "{fieldname}_target_id"
   - "{fieldname}_alt"
   - "{fieldname}_title"
   - "{fieldname}_width"
   - "{fieldname}_height"

Here the {fieldname} will get replaced with the actual field name from the mapping file. We had to maintain field type for D7 and D8 both because in D8 few column names were updated.

Fetching all D7 and D8 Node values

To get the node data from the DB, we created a custom function that will dynamically create SQL queries based on mapping and config files for both D7 and D8 databases and fetch the values.

‍


public function getNodeValues($fields, $d_version, $nids) {
 $prefix = $d_version == 'd8' ? 'node__' : 'field_data_';
 $node_table = $d_version == 'd8' ? 'node_field_data' : 'node';
 $columns = "nid,title,type";
 $tables = $node_table;
 foreach ($fields as $field) {
   $field_name = $field[$d_version]['name'];
   $field_type = $field[$d_version]['type'];
   $field_columns =  $this->getFieldColumnList($field_name, $field_type, $prefix, $d_version);
   $columns .= $field_columns;
   $tables .= " left join {$prefix}{$field_name} on {$prefix}{$field_name}.entity_id = {$node_table}.nid";
 }
 $query = "SELECT {$columns} from {$tables} WHERE {$node_table}.nid in ({$nids})";
 return $this->getValues($query, $d_version);
}

Since we have different naming conventions for tables in D7 and D8 we require to maintain the version name and change the prefix of the field table based on a version like D8 store field values in node_<field_name> and D7 stores in field_data_<field_name> same goes for storing node values D8 uses node_field_data table where D7 uses node table.

We used the fields option for mapping files and for each field, we use the field config file mentioned above to fetch the column of respective fields. Once all the tables and columns are collected we Command-Line built and executed the query on the respective database.

Comparing Data and Generating HTML Report

Once the data for both D7 and D8 is fetched, both the sets of arrays were compared and failed nodes were logged in the reporter which then generated the HTML report after comparing all the nodes.

‍


foreach ($d7_nodes as $node) {
 $nid = $node['nid'];
 if (key_exists($nid, $d8_nodes)) {
   for ($i = 0; $i < count($column_d7); $i++) {
     $d8_column = $column_d8[$i];
     $d7_column = $column_d7[$i];
     if (is_array($node[$d7_column]) && is_array($d8_nodes[$nid][$d8_column])) {
       if (array_diff($node[$d7_column], $d8_nodes[$nid][$d8_column])) {
         $reporter->failed($nid, $node[$d7_column], $d8_nodes[$nid][$d8_column], $d7_column);
       }
     }
     else {
       if ($node[$d7_column] != $d8_nodes[$nid][$d8_column]) {
         $reporter->failed($nid, $node[$d7_column], $d8_nodes[$nid][$d8_column], $d7_column);
       }
     }
   }
 }
 else {
   $reporter->failed($nid, $node, $d8_nodes[$nid]);
 }
}

Reports

Reporting was important so all stakeholders could understand the reports and identify the issues mentioned in the reports. We created reports in 2 formats:

Command Line - This was easy to understand and see the progress of test execution. It gives an overview of the number of tests that failed and passed and also provides a link to the HTML. report
HTML Report - The HTML report had a detailed analysis of the test run, with all the information of failures and passed test cases.

- The Test Information section provided the basic information of the test and content under the test like content name, migration table, mapping file, etc.

Automated Data Migration testing for Drupal

- The Result Section provided a pie chart with the exact number of pass and failures.

- The Fields Report Section highlights the fields’ columns that failed for a particular number of nodes.

As shown in the above image, the profile visibility field has seen issues with around 10000+ nodes.

- Issues section - This shows the values that are missing or are not matching with the D8 migrated data.

Referring to the above images - we can go through each node and verify the issue and compare the D7 and D8 values that are seen after migration. For example in the above image for field_tags, the tid is missing in D8 which means this tag was missing in the node after migrating it to D8.

Challenges

False Positives

Because of the client requirements and changes in data format, there were various false positives. There were fields from D7 which were stored in different formats in D8. Following are a few examples:

Boolean values 0 and 1 changed to true and false
Date UTC timestamp got changed to other Date formats
URL was prefixed in D8 as “internal:/”
The Name field was getting divided into 2 fields First Name and Last Name
Multiple fields were merged into 1 field

To handle such false positive scenarios we added one more step of preprocessing before comparing the D7 and D8 data. Prepossessing functions were called based on the field type, field name or content type.

‍


public function executePreprocess($name, $fields, $source, $dest) {
 $this->name = $name;
 $this->fields = $fields;
 $this->source = $source;
 $this->dest = $dest;
foreach ($fields as $field) {
  // Preprocessors field types.
  if (method_exists(self::class, $field['type'])) {
    call_user_func(array(self::class, $field['type']), $field);
  }
  // Preprocessors field specific.
  if (method_exists(self::class, $field['name'])) {
    call_user_func(array(self::class, $field['name']), NULL);
  }
}
  // Preprocessors based on the Main type like node, field_collection name.
  if(method_exists(self::class, $name)) {
    call_user_func(array(self::class, $name), NULL);
  }
return ['source' => $this->source, 'dest' => $this->dest];
}

The prepossessing class had all preprocessing functions based on the specific field type, field name or content type. These preprocessors updated the D8 data as per changes required and returned the values of the updated array.

Here’s an example of Link Preprocessing for one field:

‍


public function field_link_unlimited() {
 foreach ($this->dest as $key => $value) {
   $this->dest[$key]['field_link_unlimited_url'] = str_replace("internal:/", "", $value);
 }
}

Field Collections to Paragraph Migration

D7 had a list of field collections that were used in nodes, these field collections were going to get merged as paragraphs in D8 and had around 44 types of field collections, which were directly mapped to paragraph fields. To handle this separately we directly verified field collections data with paragraphs using different Drush commands because the structure and way of storing data in field collections and paragraphs were different from normal fields.

So we created 2 YAML fields which stored a list of field collections fields and another for paragraph fields.

An example of FieldCollection.yml:

‍


field_accordion_item:
 -
   name: field_title
   type: text
 -
   name: field_body
   type: text
field_act_project_team_role:
 -
   name: field_act_alumni_other
   type: text
 -
   name: field_act_full_name
   type: text
 -
   name: field_act_role
   type: text
 -
   name: field_alumni_team_member
   type: entity_reference

Example for ParagraphType.yml

‍


field_accordion_item:
 -
   name: field_title
   type: text
 -
   name: field_text_editor
   type: text
field_act_project_team_role:
 -
   name: field_person_type
   type: text
 -
   name: field_first_name
   type: text
 -
   name: field_last_name
   type: text
 -
   name: field_title_department
   type: text
 -
   name: field_profile_ref
   type: entity_reference

As shown above, each field collection type had a paragraph type mapped and every field in field collection was mapped to fields in paragraph type. The rest of the process was the same as for nodes; only table and column prefix would be changed.

Demo

Outcomes

We were able to find and fix more than 5000+ migration issues, which was difficult to find as part of manual testing.
The creation of the test case was easy and quick, the user had to only create a mapping YAML file.
Test execution was fast, we were able to test around 5000+ nodes in less than 10secs.
Reduced the testing efforts and time is taken for testing.
Good and readable reports helped all stakeholders to understand the issues.
Dev-testing was easy and quick for developers, they were able to execute tests and identify bugs before passing them to the testing team.
This framework can be reused and implemented for various content like taxonomy migrations, user migrations etc.
Improved confidence of stakeholders in delivering huge data migration without data loss.