Outsource this
90 Thousand line broken CSV file. UTF-8.
Cyrilic and hanzi characters/logograms showing…
… Latin-1 diacritic characters broken.
i.e. Flórez
instead of Flórez
Attached is a zip file containing two text files.
I’ve removed most character-type garbage and (non-visible) termination characters.
I wouldn’t use it to import into a database but this should be adequate for your stated requirements.
Should you wish any further alterations please don’t hesitate to ask.
Final command used to clean, sort, remove duplicates, reshuffle and output to separate file:
⇒ gsed -e 's/[0-9]//g' -e 's/,/ /g' -e 's/é/é/g' -e 's/"//g' -e '/^$/d' -e 's/´/´/g' -e 's/ú/ú/g' -e 's/ñ/ñ/g' -e 's/ó/ó/g' -e 's/á/á/g' -e 's/ù/ù/g' -e 's/à /Á/g' -e 's/Ñ/Ñ/g' -e 's/ / /g' -e 's/[[:space:]]$//' -e 's/ò/ò/g' -e 's/Ã/í/g' -e 's/ì/ì/g' -e 's/ü/ü/g' -e 's/À/À/g' -e 's/ç/ç/g' -e 's/è/è/g' -e 's/É/É/g' -e 's/Ó/Ó/g' -e 's/ê/ê/g' -e 's/ä/ä/g' -e 's/Ê/Ê/g' -e 's/ö/ö/g' -e 's/Ú/Ú/g' -e 's/Ã^Í/Í/g' -e 's/Ã^Á/Á/g' -e 's/ÃŒ/Ì/g' -e 's/î/î/g' -e 's/ï/ï/g' -e 's/ã/ã/g' -e 's/Ä/Ä/g' -e 's/Ã/à/g' 90K_list_of_names.csv |sort |uniq | gshuf > 80K_clean_list.txt
Followed by a command to remove the newline (carriage return) characters and replace them with two spaces for design use:
⇒ gsed ':a;N;$!ba;s/\n/ /g' 80K_clean_list.txt > two_space_separated.txt
Show file with line breaks, tabs and non-printing characters
gcat -A thing.txt
Great reference: UTF-8 Encoding Debugging Chart