Discussion:
Problem with encoding of filenames
(too old to reply)
Ivan Shmakov
2017-10-26 03:24:37 UTC
Permalink
Since the switch to a new hoster, the files on
http://hendrikmaryns.name/antro.shtml are no longer downloadable,
due to garbling of utf8 filenames. How to solve this?
That depends on what kind of access you have to your public
directory on the server. A solution that I expect to work for a
variety of cases would be to remove all the files with mangled
filenames and reupload them under proper ones.

If you have SSH (command-line) access, then, depending on the
tools available to you, you may be able to, say, run a Perl
script to rename them on the server.
P.S. I just realize this is probably not the right newsgroup for
this. Please refer me to the proper place.
I’m cross-posting to news:comp.infosystems.www.misc just in case.

As for HTML, the page seems to claim HTML4 compliance, but using
“unencoded” UTF-8 in ‘href’ is something that is only allowed in
HTML5. Moreover, even there, spaces need to be encoded as %20,
unless I be mistaken. Cf.:

<a href="https://ru.wikipedia.org/wiki/%D0%9E%D0%BC%D0%BE%D0%BD_%D0%A0%D0%B0"
(strict HTML4)</a>
<a href="https://ru.wikipedia.org/wiki/Омон_Ра" >(allowed in HTML5)</a>

(Although the browsers seem to be rather forgiving in this regard.)
It does not appear to be a UTF-8 issue. This is how one of the URLs
http://hendrikmaryns.name/Antroposofie/Valentin%20Wember%20%E2%80%93%20Waar%20gaan%20we%20eigenlijk%20heen?.pdf
In plain text: Antroposofie/Valentin Wember – Waar gaan we eigenlijk
heen?.pdf
Note the “?” at the end. I doubt that is what is supposed to be
printed; it is a replacement character to some other value.
I’m unsure of what you mean by “replacement character” here, but
indeed, ‘?’ in a URI signifies the start of a ‘query’ portion,
so it has to be encoded as %3F. Cf.:

https://en.wikipedia.org/wiki/Main_page?
https://en.wikipedia.org/wiki/Main_page?action=history
https://en.wikipedia.org/wiki/Main_page%3F
https://en.wikipedia.org/wiki/Main_page%3Faction=history

(Then, it appears that the Wikimedia servers are slightly
misconfigured in that respect. Admittedly, this behavior may be
rather tricky to get right.)

That said, replacing ? with %3F in the URI above results in a
surprising 301 “permanent” redirect:

HTTP/1.1 301 Moved Permanently
Date: Thu, 26 Oct 2017 02:53:33 GMT
Server: Apache/2
Location: http://hendrikmaryns.name/Antroposofie/Valentin%20Wember%20%e2%80%93%20Waar%20gaan%20we%20eigenlijk%20heen.shtml?.pdf
Content-Length: 325
Keep-Alive: timeout=2, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1

Yet it still doesn’t explain why some other URIs may be
inaccessible; say:

http://hendrikmaryns.name/Antroposofie/Spirituele%20opgaven%20Belgi%C3%AB%20%E2%80%93%20Johan%20Steverlinck.pdf

HTTP/1.1 404 Not Found
Date: Thu, 26 Oct 2017 02:57:57 GMT
Server: Apache/2
Content-Length: 382
Keep-Alive: timeout=2, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
--
FSF associate member #7257 np. Unforgettable — Illya Leonov
James Moe
2017-10-26 23:43:35 UTC
Permalink
Post by Ivan Shmakov
Note the “?” at the end. I doubt that is what is supposed to be
printed; it is a replacement character to some other value.
I’m unsure of what you mean by “replacement character” here, but
indeed, ‘?’ in a URI signifies the start of a ‘query’ portion,
The "?" replaces some other non-displayable character.
--
James Moe
jmm-list at sohnen-moe dot com
Think.
Thomas 'PointedEars' Lahn
2017-10-27 00:14:44 UTC
Permalink
Post by James Moe
Post by Ivan Shmakov
Note the “?” at the end. I doubt that is what is supposed to be
printed; it is a replacement character to some other value.
I’m unsure of what you mean by “replacement character” here, but
indeed, ‘?’ in a URI signifies the start of a ‘query’ portion,
The "?" replaces some other non-displayable character.
No, it does not. The character you mean is “�”, which looks similar,
but has a different Unicode codepoint (U+FFFD, not U+003F).

Please do not crosspost without Followup-To. F’up2 ciw.misc set.


PointedEars
--
When all you know is jQuery, every problem looks $(olvable)
James Moe
2017-10-27 19:26:31 UTC
Permalink
Post by Thomas 'PointedEars' Lahn
Post by James Moe
The "?" replaces some other non-displayable character.
No, it does not. The character you mean is “�”, which looks similar,
but has a different Unicode codepoint (U+FFFD, not U+003F).
It does in ASCII text displays.
Post by Thomas 'PointedEars' Lahn
Please do not crosspost without Followup-To. F’up2 ciw.misc set.
I am not crossposting. AKAIK this is the original topic. I am not
subscribed to ciw.misc.
--
James Moe
jmm-list at sohnen-moe dot com
Think.
James Moe
2017-10-26 23:55:58 UTC
Permalink
Post by Ivan Shmakov
Yet it still doesn’t explain why some other URIs may be
http://hendrikmaryns.name/Antroposofie/Spirituele%20opgaven%20Belgi%C3%AB%20%E2%80%93%20Johan%20Steverlinck.pdf
A couple of possibilities:
- The web server does not understand UTF-8. It decodes, say "%E2%80%93%"
(the e umlaut) to binary characters, tests the string for valid ASCII
characters, and rejects the UTF-8 values.
- Or the underlying filesystem does not accept UTF-8 values.
--
James Moe
jmm-list at sohnen-moe dot com
Think.
Thomas 'PointedEars' Lahn
2017-10-27 00:11:02 UTC
Permalink
http://hendrikmaryns.name/Antroposofie/Spirituele%20opgaven%20Belgi%C3%AB%20%E2%80%93%20Johan%20Steverlinck.pdf
Post by James Moe
- The web server does not understand UTF-8. It decodes, say "%E2%80%93%"
(the e umlaut) to binary characters, tests the string for valid ASCII
characters, and rejects the UTF-8 values.
- Or the underlying filesystem does not accept UTF-8 values.
Both are *very* unlikely.


PointedEars
--
realism: HTML 4.01 Strict
evangelism: XHTML 1.0 Strict
madness: XHTML 1.1 as application/xhtml+xml
-- Bjoern Hoehrmann
Loading...