To achieving deduplication in the data warehouse patient matching starts at the facility level in the EMR and is strengthened in DWAPI as illustrated below;
Patient matching in DWAPI (At the facility level)
In the absence of a facility-based master patient index (MPIs) services, we developed a patient matching algorithm to support de-duplication of patient records at facility level and to prevent multiple registrations of the same client. DWAPI uses a combination of both deterministic and probabilistic matching of the patient demographics.
Summary of approaches to deduplication/Matching
- Soundex:
Applied on the first name (Assumption for English Names)
- Double metaphone:
Applied on the second name (Assumption for African Names)
- Output from 1 & 2:
Concatenated with gender and DoB to generate a PKV (Patient Key Value)
- Jaro-winkler (Probabilistic scoring):
Uses the Patient Key Value to compute a matching score against other PKV and generate a list of possible duplicates. The score is used to set a threshold for automated/manual merging of patient records
- Deterministic matching
Using other MPI variables to double check & verify the matches generated from the above approach
Preparing the MPI variables for matching:
The MPI data is prepared for parsing as follows:
Step 1: All names (patient Name and Nok Names) are normalized and stripped of leading and trailing spaces, comma, and any special characters
Step 2: All dates are converted to YYYY-MM-DD format
Step 3: Sex variable is mapped to M, F
After data is normalized and prepared for deduplication a Patient Key Variable (PKV) is created by:
- Prefix Sex of patient
- Concatenate with the Soundex values of FirstName
- Concatenating with the double Metaphone of the LastName
- Concatenate with the DOB in ISO format (YYYY-MM-DD)
The resulting PKV is <gender>soundex(firstname)dm(lastName)DoB.
The deduplication algorithm starts with a deterministic pass through the data as described as in the flow below;
- Deterministic Matching
Deterministic matching is carried out in two phases as outlined below.
Phase One: Matching CCC Numbers
- Step 1: Group the data set on CCC Numbers to identify any duplicate CCC numbers. Assumption: CCC number is a Unique number for the HIV program.
- Step 2: For each record found in a group, compare the Patient Key Variable (PKV) – the output of this PKV is stored in a database for purposes of faster querying of the duplicate records.
- Step 3: If PKV matches, this is a duplicate record go to step 4 else go to step 5
- Step 4: Merge the patient profile and delete one record (Log delete in Audit table with the reason for delete as “duplicate”)
- Step 5: PKV does not match do anything to records.
- Step 6: Move to next group – See step 2
Outcome: All duplicate CCC numbers that have 100% match on PKV are collapsed into one record and duplicate record is flagged.
Phase Two: Matching PKV
Assumptions: (1) Soundex algorithm detects spelling variations in names that sound the same. (2) double metaphone algorithm detects spelling variations on African Names (3) Phase one flagged all possible CCC Number duplicates.
- Step 1: Group the data set on PKV to find any duplicate PKV records.
- Step 2: For each record found in a group compare the patient-telephone-numbers. If matched, go to 3 else go to 4
- Step 3: For matches flag as possible duplicate, Generate Patient UPI Key and link the UPI key and CCC Numbers in the Patient Program Numbers Go to 7
- Step 4: If not matched, compare Patient-NOK-telephone-Numbers, if matched go to 3 else go to 5
- Step 5: If NOK details not matched, compare Patient Start ART Date & Regimens details if matched go to 3 else go to 6
- Step 6: If not matched, flag record for probabilistic matching using a different algorithms such as Jaro-Winkler.
- Step 7: Move to next group – See step 2
- Probabilistic Matching
For probabilistic matching, the project uses the PKV and applies the Jaro-Winkler distance algorithm to detect possible duplicate patients. A threshold of 0.96 based is applied currently. For records identified as possible matches they are then applied to the algorithm below:
- Step 1: Group data by PKV values and Jaro-winkler score.
- Step 2: For each record found in a group compare the patient-telephone-numbers. If matched, go to 3 else go to 4
- Step 3: For matches flag as possible duplicate, Generate Patient UPI Key and link the UPI key and CCC Numbers in the Patient Program Numbers Go to 7
- Step 4: If not matched, compare Patient-NOK-telephone-Numbers, if matched go to 3 else go to 5
- Step 5: If NOK details not matched, compare Patient Start ART Date & Regimens details if matched go to 3 else go to 6
- Step 7: Move to next group – See step 2
Note: Currently a facility sends data from 3 dockets HTS, C&T and MPI. Where all the 3 datasets exist the PKV value generated from the MPI docket is matched and appended to HTS Clients and Stg_Patients table. Where data for MPI is missing the resulting PKV value is recorded as null. The MPI docket does not include PIIs
Comments are closed