Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capturing stdout as both a ByteString and a String does not properly decode the String #828

Open
lexi-lambda opened this issue Apr 25, 2022 · 2 comments

Comments

@lexi-lambda
Copy link

Suppose we write the following:

(Stdout (bs :: ByteString), Stdout (str :: String)) <- cmd some_args

What should this do? The documentation for BinaryPipes says this:

Treat the stdin/stdout/stderr messages as binary. By default String results use text encoding and ByteString results use binary encoding.

This suggests the str result should still be properly decoded the same way it would be decoded if the bs result weren’t there. However, it is not: str is merely Data.ByteString.Char8.unpack bs, so non-ASCII characters get garbled. The bug is here:

if isBinary then do
hSetBinaryMode h True
dest<- pure $ flip map dest $ \case
DestEcho -> BS.hPut hh
DestFile x -> BS.hPut (outHandle x)
DestString x -> addBuffer x . (if isWindows then replace "\r\n" "\n" else id) . BS.unpack
DestBytes x -> addBuffer x

Note the call to BS.unpack on line 185. That’s wrong!

This is not just a theoretical issue—I ran into it just now while working on GHC, as Hadrian sometimes ends up generating a use of Stdout like this (albeit less directly).

@ndmitchell
Copy link
Owner

You only have (and only want) one pipe, and if the user has specified Stdout as a ByteString, you have to set it to a binary pipe to get that. But if the user asks for both binary and String, what should we do? We don't know the encoding of the pipe - it might be UTF8, it might be Latin1, so at this point, pretty much everything is wrong. The advantage of BS.unpack is that it's wrong but total, which is better than wrong and partial - but agreed, it's far from desirable.

At least that way my conclusion - if you can think of a better way to solve this, I'd love to hear.

@lexi-lambda
Copy link
Author

The advantage of BS.unpack is that it's wrong but total, which is better than wrong and partial - but agreed, it's far from desirable.

I think it would be better to be correct and partial, or maybe more accurately, partial and not-wrong. That is, combining these two things could raise an error rather than silently producing nonsense.

A more involved possibility would be to manually decode the bytes. It’s possible to acquire the encoding of the pipe by calling hGetEncoding before the call to hSetBinaryMode, and one can obtain a decoder from the resulting TextEncoding value using mkTextDecoder. However, applying this to a ByteString is not entirely straightforward, so it would be more work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants