– Researchers leveraged an ensemble learning approach to develop and integrate an automated EHR tool that produces de-identified data by flagging personally identifiable information in clinical notes and replacing that data by concealing identifiers, according to a study published in Patterns.
Certain types of information comprise personally identifiable information (PII), making the use of that data in research projects subject to privacy and security concerns. Specifically, identifiable information can include names, all geographic data (state, address, zip code, etc.), birth date, telephone numbers, Social Security numbers, and other information to identify a specific individual.
More recently, individuals are starting to understand the importance of sharing their respective data to help researchers in the future, so long as that information is kept private and secure. With the rise of patient data sharing and the promise of precision medicine, it will be critical for medical professionals to generate de-identified patient EHR data.
Researchers leveraged ensemble learning, which combines several machine learning techniques into one predictive model, and incorporated deep-learning models and rule-based methods to create what they called the nference de-identification system. The solution detected identifiers and transformed the identifiers into plausible surrogates to further change the identifier.
Researchers tested the nference de-identification system solution against six other tools. The nference de-identification tool scored a higher recall score and precision on the i2b2 2014 data set. It also outperformed the other tools on a dataset of 10,000 notes from the Mayo Clinic.
“The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries,” the study authors wrote.
The nference de-identification system addressed the limitations of other tools and methods, while it also achieved high recall and precision levels. Some of those limitations included leveraging existing knowledge graphs and language models, sentence quality, and integrated unsupervised methods.
As a result, there are several ways to improve de-identification system performance.
First, existing solutions and models could improve incorrect biological term tagging.
“For example, if a patient’s note contains the sentences ‘Patient diagnosed with lung cancer’ and ‘ECOG performance status was determined to be 2,’ ECOG would not be treated as PII, since it has a strong biological association with lung cancer based on the knowledge graph,” the study authors explained.
Additionally, the de-identification process could recover biological terms deemed false positives, which were incorrectly tagged as PII.
Next, the solution could identify and improve sentence quality because unstructured clinical text is not always well-formatted, and often misses punctuation.
“A case-sensitive pre-trained model along with an MLM objective can be used to train a system capable of correctly introducing punctuation in the right location,” explained the study authors. “Another challenge with the quality of clinical documents is the prevalence of short fragments and bullet points, giving rise to sentences with poor context.”
Last, health IT professionals could accelerate the named-entity recognition (NER) task annotation process. NER aims to classify and locate unstructured text into predefined categories, such as names, locations, or medical facilities.
“Overall, this work implemented an ensemble approach to de-identification of unstructured EHR data incorporating transformer models supported by heuristics for automatically identifying PII across diverse clinical note types,” concluded the study authors. “Upon detection, suitable surrogates replaced PII in the processed text, there by concealing residual identifiers (HIPS).”