The text anonymization helper tool is used as an aid to anonymise data and remove disclosive information from the dissemination copy of data files. Where informed consent for unedited data sharing was not obtained, pseudonyms or vaguer descriptors should be used to replace any problematic identifying information. To date it has only been used with Microsoft Word (2010 and 2013 versions) and may not work with other versions.
When sensitive data have been collected, information should not simply be removed from the text during the anonymization process. This could result in loss of data accuracy or distortion of data. Instead, remaining within the context of the data, suitable replacements should be found to replace real names, place and company names or full addresses. Even when the respondent's name, company name and full address are removed, very detailed information on employment, educational institutions qualifications gained, occupations of other family members and small geographical locations could compromise confidentiality. The anonymization log should be created to keep track of all changes made during anonymization. However, care must be taken when anonymising to avoid removing too much information and making the data unusable.
The text anonymization helper tool
The text anonymization helper tool is based on Microsoft Word macros. Functions include:
- Finding all numbers and words starting with capital letters
- Resetting highlighting of numbers and words starting with capital letters
The text anonymization helper tool does not anonymise data or make changes to data in any other way except formatting. It does not identify names, organisations, dates or place names as some text analytics tools do. The text anonymization helper tool only highlights numbers and words which start with capital letters to make the anonymization process easier. Words which start with capital letters will not only be the first words of sentences, but also any capitalised names, companies, addresses, educational institutions, countries and other identifiable information. In the same way, numbers can represent page numbers, counts, but also age, date of birth and address. All this information could potentially be disclosive and should be reviewed during the anonymization process.
The text anonymization helper tool does not change or remove any data. After all numbers and words with capital letters have been highlighted, it is the responsibility of the reader to read through the document and make decisions on each highlighted word. Any information that can result in breaches of confidentiality should be anonymised. After checks and replacements have been made, all highlighting produced by the text anonymization helper tool can be removed.
Precautions to consider when using the text anonymization helper tool
It is important to note that text anonymization helper tool does not search for names, places, dates or organisations. A set of dictionaries could be implemented to search for words that are in those dictionaries to identify them as names or dates. However, this is highly impractical for qualitative data anonymization. While there are comprehensive dictionaries for English names, countries, UK, USA or international organisations, dictionaries in other languages are very limited, if they exist at all. Even if data are collected in English, respondents' answers might involve disclosive information in foreign languages. Also, even if names and places are recorded in the dictionary, they can be misspelled or purposefully entered wrongly to preserve original pronunciation. Information that does not match dictionaries correctly would not be picked up by such tools, but would still be potentially disclosive. Therefore, the text anonymization helper tool highlights all numbers and capitalised words to enable the reader to make judgements on preserving confidentiality as they read through.
As the text anonymization helper tool highlights words starting with capital letters only, any names, places, companies or other potentially disclosive information will be missed if not capitalised. Once disclosive information is found, it is important to search for the same information without capitalisation to find known disclosive words that were not capitalised.
It is important to note that 'Reset all Nums and Caps' function removes all highlighting from the document. Any text should be saved before using text anonymization helper tool. Previous file version should be used if any undesired changes are made when using the text anonymization helper tool.
When using the text anonymization helper tool and making changes to the data, care should be taken to avoid mistakes which could occur if the automated search and replace function is used. For example, the name Jan should be anonymised. 'Replace all' option when using Microsoft Word search and replace function is not suitable as it can result in undesired changes in data. This can happen even when 'Match case' option is chosen. If it was decided to change Jan to Mike to preserve anonymity, all instances of 'Jan' will be changed into Mike. This could introduce mistakes such as 'Mikeuary' instead of 'January'. If such mistakes occur, it does not only result in misspelled or unrecognisable words, but also the anonymization could be reverted, especially if such mistakes happen more than once, for example, a word 'Mikeitor' is found instead of 'Janitor'. Also, the order of replacements matters if the automated option 'Replace all' is used. In this case, if Jan was anonymised before Jane, all instances of Jane would be changed to Mikee. It is suggested to use 'Replace' option instead of 'Replace all' and to review each word individually before making changes.