Working with text is quite hard, especially when you need “clean” text to create a resource or something that doesn’t support special characters. Let’s think, for example, if you want to create a site for each of your employees. You can create a simple script that uses the person’s name for the site name, but people may have accents or special characters on their names, so we would like to replace the accents in the string but not remove them.
There are many ways to solve this, so let’s look at two of them.
Nuclear. Remove them all
One solution could be to simply remove them, as demonstrated here with the following function:
function Remove-SpecialCharacters {
param ([String]$sourceStringToClean = [String]::Empty)
return $sourceStringToClean -replace '[^a-zA-Z0-9]', ''
}
If you run it:
Remove-SpecialCharacters("António")
You’ll get:
Antnio
The function will remove all characters that are not from A to Z (capitalized or not), including numbers.
This may be what you want, but in this case, it would make more sense to replace the special character with its “not so special” counterpart.
Replace the accents
Accents are also called “Diacritics,” so here’s how to replace them with their “normal” counterparts. I developed the Frankenstein script over time and used it in my projects.
So we’ll replace, for example, “Á” with “A” simplifying how it’s displayed.
function Update-SpecialCharacters {
# From https://stackoverflow.com/questions/7836670/how-remove-accents-in-powershell
param ([String]$sourceStringToClean = [String]::Empty)
$normalizedString = $sourceStringToClean.Normalize( [Text.NormalizationForm]::FormD )
$stringBuilder = new-object Text.StringBuilder
$normalizedString.ToCharArray() | ForEach-Object {
if ( [Globalization.CharUnicodeInfo]::GetUnicodeCategory($_) -ne [Globalization.UnicodeCategory]::NonSpacingMark) {
[void]$stringBuilder.Append($_)
}
}
# From https://lazywinadmin.com/2015/05/powershell-remove-diacritics-accents.html
[Text.Encoding]::ASCII.GetString([Text.Encoding]::GetEncoding("Cyrillic").GetBytes($stringBuilder.ToString()))
}
We’re combining here two types of cleaning. The first “block” cleans the Latin characters, while the second will clear the Cyrillic alphabet characters. Here’s an example:
Update-SpecialCharacters("António")
Update-SpecialCharacters("Łagiewnicki")
We’ll get:
Antonio
Lagiewnicki
I understand that the name now doesn’t make sense to my Polish friends, but some systems will appreciate it better than its original form.
Final thoughts
There are many reasons we want to remove the accents from a string, like calculating a username or an email address to a person that doesn’t allow for special characters but needs the characters to be there.
I hope the script above will help you, but if you find something that can be improved, please let me know.
Photo by Martin Sanchez on Unsplash
Thank you for sharing this! I will reference your site when I add it to my scripts. My next step will be to learn what each call is doing, but at least I can use it to handle folders and files that have diacritics in the name.