Remapping Datasets

Remapping Datasets

Background

I've been involved in COVID-19 related data gathering. Data was capured in CASTOR, but needed to be remapped to WHO standard. This meant mapping variables, remapping values (e.g. NO: 2 in the one NO: 0 in the other), adding missing units, substituting keys based on a key table, and some other stuff.  While the script created for CASTOR -> REDCAP is quite complicated, I've made a simpler version for Remapping Datasets in Python that is available on Github.

The challenge was to involve people who could do and check the mapping, but have not so much of a background in scripting (and no desire to learn it). Hence a solution based on Python and Excel. Python to do the stuff, Excel for mapping and configuration in a way that a normal user would find this not too difficult.

Requirements


How to use

For how to use it, see the instructions in the table Explanation in conversion.xlsx (part of the zip / available on GitHub)
  • Jupyter Notebook: it will ask which Excel to use for mapping & configuration
  • From the command line: python conversion.py conversion.xlsx (other Excel can be used)

What the script can do

You can have multiple conversion.xlsx; each with its unique name, own mapping & configuration
  • Substitute IDs based on the key list
  • Remap variables with or without value conversion
  • Swap two variables; A->B & B->A
  • Convert a date format (e.g. %m/%d/%Y) into an ISO8601 date (%Y-%m-%d)
  • Add variable with duration in days between two valid ISO8601 dates
  • Remap Option group Variables to Check Box group Variables (granted only one option can be remapped per entry)
  • Remap Check Box group Variables to Option group Variables (when multi check box, then the 'option' group gets multi options comma separated)
  • Add unit variables (e.g. add variable Temperature_Unit = 1 (could be Celcius) when source variable has a value, empty when empty
  • Removes all 'redundant variables (non mapped variables in the source)
  • Orders the variables based on the Excel order list of target variables

The output

Remapping creates a time-stamped:
  • CSV, and
  • Excel