Filename fixer

Recursively rename directories and files with a regexp
Bash JCZD

The story behind fixnames.sh is worth telling : as the FP was working as a network- and sys-admin in a small IT company near Lausanne, his boss informed him he was to stay for the entire weekend. This meant from Friday morning to Sunday evening, and possibly until Monday morning, because a client had decided to change from MacOS systems to Windows. While nobody in their right mind would ever want to do this, the boss would take MacOS’s file manager to drag-and-drop files from one network share to another while requesting the client stop all activity from Friday noon onwards. The reason is that most employees were basically computer illiterates, had no file naming convention whatsoever and would frequently insert pseudo-random characters so that M$ Windows would go berzerk (it’s quite restrictive on filenames). And using Apple’s file manager (At the time, Windows Explorer would behave the same) the thing would just throw an obscure “copy failed” error and halt with no further details ; so the boss’ procedure would be to recursively (and manually) enter the directory, and copy each and every file and subdir until he found which filename was causing the error.

My heart almost stopped.

At this point, it was Monday and I had a few days ahead, so I told him I was putting aside all my current tasks to write a script to automate that. He was very dubious and tried to object, but I told him I refused to stay for the weeked if I didn’t get a chance to try.

So in the course of the next few days, I wrote this script ; what took me the longuest was to run it on a copy of the client’s 12k or so files, discovering the absurdities users would come up with (such as duplicating files (this script sort-of fixes them) so I could write a suitable regex.

On Thursday afternoon and after a final check run on other client’s backups, the script was ready and demonstrated to the boss : more than 12k files were checked and renamed at a rate of around 1k files a minute, additionally freeing some disk space. From then on, my boss never objected to my suggestions and made me join the dev team.

The script follows ; its features are:

Script content

(the above title was added because of an issue while parsing the current page’s .md file)

#!/bin/bash
#
# superbe script magique pour corriger les noms de fichier qui peuvent poser problème
# avec windobe mais qui sont valables sous mac
#
# créé pour les migrations netatalk -> samba
#
# auteur:	david.lutolf@adbin.ch
# date:		2008-08-20
# licence:	GPL v2
# modifications:
#		2008-08-21, david@adbin.ch
#			petit fix youpie pour les espaces un fin de nom
#		2008-08-22, david@adbin.ch
#			utilisation de $REPLAY pour les espaces en fin de nom
#			remplacement de ' par `
#			option --pretend
#			enlever les logs inutiles
#			vérifier si la destination existe (avec arrêt pour les dossiers)
#			jolies couleurs pour la sortie
#		2008-08-25, david@adbin.ch
#			espaces multiples
#			teste un nom supplémentaire si le premier est indispo
#			vérifie si des fichiers portant un nom semblable sont identiques
#		2009-06-21, david@lutolf.net
#			correction du comptage d'erreurs et de fichiers
#			
#
# arguments:
#	--pretend	doit-on simplement simuler?
#	target		directory to recurse in
#
# limitations/bugs:
#	problèmes si le nom de fichier n'est composé que d'un caractère invalide
#	the --pretend option will output more results than a real run, simply because
#	 invalid but unmodified dir names will get detected in file paths
# 	ne remplace pas les \
#	ne vérifie pas les duplicatas de noms insensibles à la casse (sous win, Foo = foo)
#	 (utiliser casefix.sh à cette fin) - en fait je crois que oui, à vérifier svp
#	path given in argument must not contain spaces or script will break
#	the target directory's name must not be changed by the rules (eg. Tmp > tmp)
#
# liste des choses qui posent problèmes dans les noms
# caractères à remplacer par -
# 	:2f \
# caractères à éliminer:
# 	? * ' : <espaces en début/fin de nom> <. en fin de nom> <espaces multiples>
# caractères bizzares qu'on doit laisser tels quels
#	:2e :2f2e
# caractères qui posent problème:
#	\

# chaîne utilisée par sed lors des substitutions. PAS UTILISÉ ACTUELLEMENT, MODIFIER LA CHAINE PLUS BAS DANS LE CODE
SEDARGS1="-e s/:2f2e/KEEP_2F2E_KEEP/g -e s/:2e/KEEP_2E_KEEP/g -e s/:2f/-/g -e s/\ \ */\ /g -e s/\!//g -e s/\?//g -e s/[*]//g -e s/://g -e s/[\ ]$// -e s/[.]*$// -e s/[\/][\ ]/\\\// -e s/\'/\\\`/g -e s/[\\\]/-/g -e s/KEEP_2F2E_KEEP/:2f2e/g -e s/KEEP_2E_KEEP/:2e/g"
#SEDARGS1="-e s/:2f2e/KEEP_2F2E_KEEP/g"
#SEDARGS1="-e s/:2f2e/KEEP_2F2E_KEEP/g -e s/:2e/KEEP_2E_KEEP/g -e s/:2f/-/g -e s/!//g -e s/\?//g -e s/[*]//g -e s/://g -e s/[\ ]$// -e s/[.]*$// -e s/[\/][\ ]/\\\// -e s/\'/\\\`/g -e s/[\\\]/-/g -e s/KEEP_2F2E_KEEP/:2f2e/g -e s/KEEP_2E_KEEP/:2e/g"
# au cas où un nom de fichier existe, on essaie avec ça: (on remplace la plupart des char par _ au lieu de les supprimer)
SEDARGS2="-e s/:2f2e/KEEP_2F2E_KEEP/g -e s/:2e/KEEP_2E_KEEP/g -e s/:2f/-/g -e s/\ \ */\ /g -e s/!//g -e s/\?/_/g -e s/[*]/_/g -e s/:/_/g -e s/[\ ]$/_/ -e s/[.]*$/_/ -e s/[\/][\ ]/\\\_// -e s/\'/\\\`/g -e s/[\\\]/-/g -e s/KEEP_2F2E_KEEP/:2f2e/g -e s/KEEP_2E_KEEP/:2e/g"
#SEDARGS2="-e s/:2f2e/KEEP_2F2E_KEEP/g -e s/:2e/KEEP_2E_KEEP/g -e s/:2f/-/g -e s/!/_/g -e s/\?/_/g -e s/[*]/_/g -e s/:/_/g -e s/[\ ]$/_/ -e s/[.]*$/_/ -e s/[\/][\ ]/\\\_// -e s/\'/\\\`/g -e s/[\\\]/-/g -e s/KEEP_2F2E_KEEP/:2f2e/g -e s/KEEP_2E_KEEP/:2e/g"


#echo "$SEDARGS1"
#echo "$SEDARGS2"

if [ $# -lt 1 ] || [ $# -gt 2 ]
then
	echo "usage: fixnames [--pretend] target > logfile"
	exit 1
fi

if [ $1 == '--pretend' ]
then
	PRETEND=true
	TARGET=$2
else
	PRETEND=false
	TARGET=$1
fi

# fichiers utilisés pour la conversion
TMPLIST=/tmp/fixnames_tmp
ERRLOG=/tmp/fixnames.err
TIMESTART=`date +%s`

# 1ère étape, on commence par chercher les répertoires récursivement
echo -e "\033[1;33m*\033[0;37m starting..." 1>&2
DEPTH=1
DIRTOT=0
DIRMOD=0
DIRERR=0
DIRSIM=0
echo -ne "\033[1;33m*\033[0;37m processing directories, level: " 1>&2
while find $TARGET -maxdepth $DEPTH -mindepth $DEPTH -type d | grep \.. > $TMPLIST
do
	echo -n "$DEPTH " 1>&2

	while read
	do
		DIRTOT=`expr $DIRTOT + 1`
		#NEWNAME=`echo "$REPLY" | sed "$SEDARGS1"`
		# ORIGIN/WORKING: NEWNAME=`echo "$REPLY" | sed -e s/:2f2e/KEEP_2F2E_KEEP/g -e s/:2e/KEEP_2E_KEEP/g -e s/\ \ */\ /g -e s/:2f/-/g -e s/\?//g -e s/[*]//g -e s/://g -e s/[\ ]$// -e s/[.]*$// -e s/[\/][\ ]/\\\// -e s/\'/\\\`/g -e s/[\\\]/-/g -e s/KEEP_2F2E_KEEP/:2f2e/g -e s/KEEP_2E_KEEP/:2e/g`
		# very simplified version:
		NEWNAME=`echo "$REPLY" | sed -e s/__*/_/g -e s/\ \ */\ /g -e s/\ /_/g -e s/[A-Z]/"\L&"/g -e s/_$//`
		#'` # color fix

		# on ne fait les tests suivant que si le nom a été modifié
		if [ "$REPLY" != "$NEWNAME" ]
		then
			# on vérifie si un répertoire du même nom existe
			if test -d "$NEWNAME"
			then
				# le nom existe déjà, on essaye le nom alternatif
				DIRSIM=$(($DIRSIM+1))
				#NEWNAME=`echo "$REPLY" | sed "$SEDARGS2"`
				#NEWNAME=`echo "$REPLY" | sed -e s/:2f2e/KEEP_2F2E_KEEP/g -e s/:2e/KEEP_2E_KEEP/g -e s/:2f/-/g -e s/\ \ */\ /g -e s/\?/_/g -e s/[*]/_/g -e s/://g -e s/[\ ]$/_/ -e s/[.]*$/_/ -e s/[\/][\ ]/\\\// -e s/\'/\\\`/g -e s/[\\\]/-/g -e s/KEEP_2F2E_KEEP/:2f2e/g -e s/KEEP_2E_KEEP/:2e/g`
				NEWNAME=`echo "$REPLY" | sed -e s/__*/_/g -e s/\ \ */\ /g -e s/\ /_/g -e s/[A-Z]/"\L&"/g`_
				#'` # color fix
				if test -d "$NEWNAME"
				then
					# le nom alternatif existe également (très peu probable)
					echo -e "\n\033[0;31mE: could not move '$REPLY'\033[0;37m" 1>&2
					echo "alt name already exists for '$REPLY'" >>$ERRLOG
					DIRERR=`expr $DIRERR + 1`
				else
					# ok, on peut renommer
					if [ $PRETEND == true ]
					then
						echo "'$REPLY' -> '$NEWNAME'"
						DIRMOD=`expr $DIRMOD + 1`
					else
						if mv -v "$REPLY" "$NEWNAME" 2>>$ERRLOG 
						then
							DIRMOD=`expr $DIRMOD + 1`
						else
							FILERR=$((DIRERR++))
						fi
					fi
				fi
			else
				# ok, on renomme
				if [ $PRETEND == true ]
				then
					echo "'$REPLY' -> '$NEWNAME'"
					DIRMOD=`expr $DIRMOD + 1`
				else
					if mv -v "$REPLY" "$NEWNAME" 2>>$ERRLOG 
					then
						DIRMOD=`expr $DIRMOD + 1`
					else
						FILERR=$((DIRERR++))
					fi
				fi
			fi
		fi

	done < $TMPLIST

	# on arrête si des dossiers doivent être modifiés
	if [ $DIRERR != 0 ] && [ $PRETEND != true ]
	then
		break 2;
	fi
	DEPTH=`expr $DEPTH + 1`
done

# 2ème étape, les noms de fichier	
FILEERR=0
FILECUR=0
FILEMOD=0
FILEREM=0
FILESIM=0
if [ $DIRERR == 0 ] || [ $PRETEND == true ]
then
	echo -en "\n* processing files: " 1>&2
	find $TARGET -type f > $TMPLIST

	# 3ème étape, on renomme les fichiers en parsant les listes de noms
	FILETOT=`wc -l $TMPLIST | cut -f 1 -d ' '`
	while read
	do
		#NEWNAME=`echo "$REPLY" | sed "$SEDARGS1"`
		# ORIGINAL/WORKING: NEWNAME=`echo "$REPLY" | sed -e s/:2f2e/KEEP_2F2E_KEEP/g -e s/:2e/KEEP_2E_KEEP/g -e s/:2f/-/g -e s/\ \ */\ /g -e s/\?//g -e s/[*]//g -e s/://g -e s/[\ ]$// -e s/[.]*$// -e s/[\/][\ ]/\\\// -e s/\'/\\\`/g -e s/[\\\]/-/g -e s/KEEP_2F2E_KEEP/:2f2e/g -e s/KEEP_2E_KEEP/:2e/g`
		# very simplified version:
		NEWNAME=`echo "$REPLY" | sed -e s/__*/_/g -e s/\ \ */\ /g -e s/\ /_/g -e s/[A-Z]/"\L&"/g -e s/_$//`
		#'` # color fix

		# on ne fait les tests suivant que si le nom a été modifié
		if [ "$REPLY" != "$NEWNAME" ]
		then
			if test -f "$NEWNAME"
			then
				# le nom existe déjà, on vérifie si les fichiers sont identiques
				FILESIM=$(($FILESIM+1))
				#echo -ne "\n - diff '$REPLY' '$NEWNAME'" >&2
				if diff "$REPLY" "$NEWNAME" > /dev/null
				then
					# même fichier, on supprime
				#	echo " SAME FILE, REMOVING" >&2
					rm "$NEWNAME"
					FILEREM=$(($FILEREM+1))
				else
					# pas le même contenu, on essaie le nom alternatif
				#	echo -n " DIFFERENTS!" >&2
					#NEWNAME=`echo "$REPLY" | sed "$SEDARGS2"`
					#NEWNAME=`echo "$REPLY" | sed -e s/:2f2e/KEEP_2F2E_KEEP/g -e s/:2e/KEEP_2E_KEEP/g -e s/:2f/-/g -e s/\ \ */\ /g -e s/\?/_/g -e s/[*]/_/g -e s/://g -e s/[\ ]$/_/ -e s/[.]*$/_/ -e s/[\/][\ ]/\\\// -e s/\'/\\\`/g -e s/[\\\]/-/g -e s/KEEP_2F2E_KEEP/:2f2e/g -e s/KEEP_2E_KEEP/:2e/g`
					#'` # color fix
					NEWNAME=`echo "$REPLY" | sed -e s/__*/_/g -e s/\ \ */\ /g -e s/\ /_/g -e s/[A-Z]/"\L&"/g`_
					if test -f "$NEWNAME"
					then
						# le nom alternatif existe également (très peu probable)
						echo -en "\r\033[0;31mE: could not move '$REPLY'\033[0;37m\n" 1>&2
						echo "alt name already exists for '$REPLY'" >>$ERRLOG
						FILEERR=`expr $FILEERR + 1`
					else
					#	echo " ALTERNATIVE OK!" >&2
						# ok, on peut renommer
						if [ $PRETEND == true ]
						then
							echo "'$REPLY' -> '$NEWNAME'"
							FILEMOD=`expr $FILEMOD + 1`
						else
							if mv -v "$REPLY" "$NEWNAME" 2>>$ERRLOG 
							then
								FILEMOD=`expr $FILEMOD + 1`
							else
								FILERR=$((FILEERR++))
							fi
						fi
					fi
				fi
			else    
				# ok, on renomme
				if [ $PRETEND == true ]
				then    
					echo "'$REPLY' -> '$NEWNAME'"
					FILEMOD=`expr $FILEMOD + 1`
				else    
					if mv -v "$REPLY" "$NEWNAME" 2>>$ERRLOG 
					then
						FILEMOD=`expr $FILEMOD + 1`
					else
						FILERR=$((FILEERR++))
					fi
				fi
			fi
		fi

		FILECUR=`expr $FILECUR + 1`
		echo -en "\r\033[1;33m*\033[0;37m processing files: $FILECUR / $FILETOT" 1>&2
	done < $TMPLIST
fi


# RAPPORT DE FIN #

rm $TMPLIST
TIMESTOP=`date +%s`
echo -e "\n\033[1;33m*\033[0;37m done. Operation completed in $(($TIMESTOP-$TIMESTART)) seconds" 1>&2
echo "  $DIRTOT directories processed, $DIRMOD renamed, $DIRSIM with similar names, $DIRERR errors." 1>&2
echo "  $FILECUR files processed, $FILEMOD renamed, $FILESIM with similar names, $FILEREM removed, $FILEERR errors." 1>&2
if [ $DIRERR -ge 1 ]
then
	echo -e "  \033[0;31msome directories could not be renamed. fix manually and rerun script. see '$ERRLOG' for details\033[0;37m" 1>&2
	exit 3
fi
if [ $FILEERR -ge 1 ]
then
	echo -e "  \033[0;31msome files could not be renamed. see '$ERRLOG' for details\033[0;37m" 1>&2
	exit 2
fi