String instance is broken with non-ASCII data #28

hsenag · 2012-08-31T21:58:26Z

Although the package works with String, and indeed encourages its use by having that be the default type in some of the API helper functions, the handling of non-ASCII data is completely broken. We should be doing encoding properly to be consistent with the Content-Type header, both when sending and receiving.

This has been reported by a couple of users so far and is clearly pretty nasty, but I don't think it's trivial to fix because presumably the receiving and sending sides can use different encodings and when sending requests we ought to be optionally giving the user control of the encoding.

The alternative of removing the String instances completely is likely to be very disruptive so I don't think it's a reasonable option.

haasn · 2012-08-31T22:05:36Z

Ideally I think the Content-Encoding and Content-Length headers should be set by the HTTP implementation, which would give the String instance liberty to default to something like UTF-8 (modulo user configuration perhaps).

Alternatively, I would just delete the String instance and bump the version number. Applications that rely on this instance should really be fixed to do their own encoding, just recently I had the pleasure of figuring out what was causing package CouchDB to break - it was the broken String instance in here.

If I see a String instance like that I'll assume it's sane - as in actually handles all valid Strings appropriately.

hsenag · 2012-09-01T00:14:20Z

I agree in principle that a broken String instance shouldn't exist, it's just that removing it now could have a very substantial impact and I'm scared of doing that. On the other hand there is all the time people like you will have wasted figuring out the brokenness.

As well as dealing with the sending aspect, don't we need to deal with any encoding the server chooses to return?

singpolyma · 2012-09-01T18:15:04Z

I'm in favour of doing String properly by looking at Content-Type, but removing support for String and just using ByteString or [Word8] or whatever is certainly better than the current situation.

hsenag · 2012-09-10T22:32:36Z

I've spent a while investigating and I currently think doing String properly is too hard :-( I guess removing the String instance is the only remaining option.

http://www.haskell.org/pipermail/libraries/2012-September/018426.html

singpolyma · 2012-09-10T22:44:12Z

You are right that if the library is handed content that does not have a Content-Type with a charset already, then it is rather hard. And, of course, if handed a charset the library does not support (for whatever reason) that is also pretty hard.

Since in practice encodings are sometimes detected from the actual content itself (XML declarations, meta tags), there is in fact no sane simplifying assumption if Content-Type lacks an encoding. While by-the-book such a Content-Type should indicate binary data, in practise it is used to deliver data with all sorts of character encodings (yay browsers magically auto-detecting things!). So, the only sane thing for a generic HTTP library that is not content aware to do in the case of Content-Type with no charset is to throw an error (or at least a very strong warning), which is likely not in the spirit of this library.

Long live ByteString.

atombender · 2013-05-15T18:37:04Z

+1 for moving to ByteString. Any progress on this?

Another option is Either ByteString String, where the Right is decoded properly, but only if the content is actually, unambigiously, text.

I was really dismayed, on my very first HTTP attempt, to find that the HTTP module blatantly ignores Content-Type (and its charset) and assumes everything is a string. This will be confusing a lot of first timers; the additional work required to become encoding-aware will be fairly daunting. And among other things it makes it impossible to work with images.

You are right that if the library is handed content that does not have a Content-Type with a charset already, then it is rather hard.

According to the MIME spec, content without Content-Type is considered by application/octet-stream, ie. a raw stream of 8-bit bytes. Text content that has a Content-Type but no charset is assumed to be 7-bit ASCII, except when the particular MIME type defines a default.

hsenag · 2013-08-23T20:50:35Z

Sorry, I was hoping to do this much more promptly. It's a bit harder than I thought to do nicely because the entire library is based around using Strings :-(

hsenag mentioned this issue Aug 31, 2012

Add Text instance #29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String instance is broken with non-ASCII data #28

String instance is broken with non-ASCII data #28

hsenag commented Aug 31, 2012

haasn commented Aug 31, 2012

hsenag commented Sep 1, 2012

singpolyma commented Sep 1, 2012

hsenag commented Sep 10, 2012

singpolyma commented Sep 10, 2012

atombender commented May 15, 2013

hsenag commented Aug 23, 2013

String instance is broken with non-ASCII data #28

String instance is broken with non-ASCII data #28

Comments

hsenag commented Aug 31, 2012

haasn commented Aug 31, 2012

hsenag commented Sep 1, 2012

singpolyma commented Sep 1, 2012

hsenag commented Sep 10, 2012

singpolyma commented Sep 10, 2012

atombender commented May 15, 2013

hsenag commented Aug 23, 2013