We’ve recently moved a website to a new server, and are running into an odd issue where some uploaded images with unicode characters in the filename are giving us a 404 error.
Via ssh/FTP, we can see that the files are definitely there.
For example:
http://sjofasting.no/project/adnoy
none of the images are working:
Code:
<img class='image-display' title='' src='http://sjofasting.no/wp/wp-content/uploads/2012/03/ådnøy_1_2.jpg' width='685' height='484'/>
SSH:
-rw-r–r– 1 xxxxxxxx xxxxxxxx 836813 Aug 3 16:12 Ã¥dnøy_1_2.jpg
What is also strange is that if you navigate to the directory you can even click on the image and it works:
http://sjofasting.no/wp/wp-content/uploads/2012/03/
click on ‘Ã¥dnøy_1_2.jpg’ and it works.
Somehow wordpress is generating
http://sjofasting.no/wp/wp-content/uploads/2012/03/ådnøy_1_2.jpg
and copying from the direct folder browse is generating
http://sjofasting.no/wp/wp-content/uploads/2012/03/a%CC%8Adn%C3%B8y_1_2.jpg
What is going on??
edit:
If I copy the image url from the wordpress source I get:
http://sjofasting.no/wp/wp-content/uploads/2011/11/Bore-Strand-Hotellg%C3%A5rd-12.jpg
When copied from the apache browser I get:
http://sjofasting.no/wp/wp-content/uploads/2011/11/Bore-Strand-Hotellga%cc%8ard-12.jpg
What could account for this discrepancy between:
%C3%A5 and %cc%8
??
Unicode normalisation.
0xC3
0xA5
is the UTF-8 encoding for U+00E5 a-with-ring.0xCC
0x8A
is the UTF-8 encoding for U+030A combining ring.U+0035 is the composed (Normal Form C) way of writing an a-ring; an
a
letter followed by U+030A is the decomposed (Normal Form D) way of writing it.Ã¥
vså
– they should look the same, though they may differ slightly depending on font rendering.Now normally it doesn’t really matter which one you’ve got because sensible filesystems leave them untouched. If you save a file called
[char U+00E5].txt
(Ã¥.txt
), it stays called that under Windows and Linux.Macs, on the other hand, are insane. The filesystem prefers Normal Form D, to the extent that any composed characters you pass into it get converted into decomposed ones. If you put a file in called
[char U+00E5].txt
and immediately list the directory, you’ll find you’ve actually got a file calleda[char U+030A].txt
. You can still access the file as[char U+00E5].txt
on a Mac because it’ll convert that input into Normal Form D too before looking it up, but you cannot recover the same filename in character sequence terms as you put in: it’s a lossy conversion.So if you save your files on a Mac and then transfer to a filesystem where
[char U+00E5].txt
anda[char U+030A].txt
refer to different files, you will get broken links.Update the pages to point to the Normal Form D versions of the URLs, or re-upload the files from a filesystem that doesn’t egregiously mangle Unicode characters.
Think Different, Cause Bizarre Interoperability Problems.