Unify all Windows build configuration on the same CharacterSet #888

arixmkii · 2023-12-05T15:07:40Z

Right now there are cases, when configuration sets Multibyte

zlib/contrib/vstudio/vc17/miniunz.vcxproj

Line 45 in 643e17b

<CharacterSet>MultiByte</CharacterSet>

but in other places it uses Unicode

zlib/contrib/vstudio/vc17/miniunz.vcxproj

Line 50 in 643e17b

<CharacterSet>Unicode</CharacterSet>

and

zlib/contrib/vstudio/vc17/zlibstat.vcxproj

Line 72 in 643e17b

<CharacterSet>Unicode</CharacterSet>

and in many it is left unset. I believe it would be beneficial to set all build configurations to the same default for the least surprise behavior. It seems that most common explicitly set in Multibyte, so, probably it should be explicit value everywhere.

irwir · 2024-01-20T00:39:20Z

Multibyte is obsolete, Unicode is the right way to go.

Mr-Clam · 2024-01-20T09:02:27Z

Unicode is multi-byte. Unicode encoded as UTF-8 is variable length multi-byte at that.

Neustradamus · 2024-01-20T09:22:47Z

@madler: Can you look this issue?

irwir · 2024-01-20T12:26:17Z

Unicode is multi-byte. Unicode encoded as UTF-8 is variable length multi-byte at that.

This is mostly correct, yet shows zero experience with Visual Studio and Windows.
Use Unicode Character Setin VS projects implies UTF-16 LE;. and the same UTF-16 LE is internally used by Windows.
Use of multi-byte (DBCS) was deprecated since VS 2013.

arixmkii · 2024-01-20T14:37:54Z

According to this https://devblogs.microsoft.com/cppblog/mfc-support-for-mbcs-deprecated-in-visual-studio-2013/ it looks like they later undeprecated it, but still recommend Unicode for new code bases. Also, PowerShell team seems to be using Multibyte for their rebuild of ZLib https://github.com/PowerShell/ZLib/blob/20ce7b065269cda22beb1cd9ecc450be5ea99b8f/build/zlib.vcxproj#L56 most notably used by their OpenSSH fork. I don't think they would be using it if it would be completely deprecated.

Any downstream would be able to change the settings to match their specific needs in the rebuild. So, if it will be Unicode across the board for all projects it should be just fine.

madler · 2024-01-21T03:04:35Z

Does this choice make any difference at all in the compiled zlib code? I believe that zlib does not use any types or functions that change their nature depending on Unicode vs. MultiByte. There is only explcit use of the wide character type and associated functions for gzopen_w(), and for error reporting.

irwir · 2024-01-21T11:09:49Z

@arixmkii

it looks like they later undeprecated it, but still recommend Unicode for new code bases.

It was repeated twice that the warning would be removed.
Even with continued support this is not exactly undeprecated.
One incurable flaw is that two differnt Unicode strings could be converted to the same MBCS byte sequence.

irwir · 2024-01-21T11:42:30Z

Does this choice make any difference at all in the compiled zlib code? I believe that zlib does not use any types or functions that change their nature depending on Unicode vs. MultiByte. There is only explcit use of the wide character type and associated functions for gzopen_w(), and for error reporting.

Never had reasons to look at this part of the library because compression and decompression code do not depend on character width choice, and it worked well for me with Unicode.
But there is difference when calling Windows API even indirectly. Explicit use (or mix) of wide and byte characters is always possible, but requires caution.
For example,

zlib/gzlib.c

Line 206 in 88ec246

strcpy(state->path, path);

Copy of wchar_t string to char string usually would be the first character only, because second byte is null.

Another issue is Unix-ism 0666

zlib/gzlib.c

Line 233 in 88ec246

fd == -2 ? _wopen(path, oflag, 0666) :

By some luck this value includes desired flags, but properly it should be _S_IREAD | _S_IWRITE for read and write access.

Also, there are peculiarities when using printf with wide characters.
Standard way of handling character width is using TCHAR and _T (or _TEXT) macros, and macros for calling functions, such as _topen.

In conclusion it might be said that conversion to Unicode is desirable, but cannot be done by simple substitution of parameters in VS projects.

madler · 2024-01-21T22:37:37Z

The question was "Does this choice make any difference at all in the compiled zlib code?"

To be explicit, do any of the bits in the resulting library differ if I change Unicode to MultiByte in the project file?

It's those T and t functions and types that change behavior as a result, my point being that zlib doesn't use those.

Neither strcpy() nor any printf() variant is ever called with a pointer to wide characters in zlib.

_S_IREAD | _S_IWRITE is the first 6 in the 0666.

A Unicode open function in zlib is already provided by gzopen_w().

Mr-Clam · 2024-01-22T09:27:04Z

Unicode is multi-byte. Unicode encoded as UTF-8 is variable length multi-byte at that.
This is mostly correct, yet shows zero experience with Visual Studio and Windows.
Use Unicode Character Setin VS projects implies UTF-16 LE;. and the same UTF-16 LE is internally used by Windows.
Use of multi-byte (DBCS) was deprecated since VS 2013.

Actually, I have plenty of experience with this. I even benchmarked an application that did a lot string process about 20 years ago and found building for Unicode was slower. It seems the conversion in the all the "A" functions in the Win32 API is less expensive than doubling the size of all your strings with mostly null bytes. Of course, if zlib only uses char instead of something portable like TCHAR, the point is moot. Somebody has to convert char to wchar_t if there's a Win32 call - you either do it explicitly or it's done implicitly for you. For most users, using single byte is fine... until of course they have to use strings (e.g. filenames) that contain "foreign" characters that can't be expressed in their current active code page. Even Windows' double-byte Unicode support isn't as simple as you think: how does it handle surrogates? (That's a rhetorical question)

Looking at @madler 's comment, I'm wondering if defining Unicode is necessary at all, and also, what problem occurs if you don't define this?

irwir · 2024-01-23T10:24:19Z

It's those T and t functions and types that change behavior as a result, my point being that zlib doesn't use those.

In any case, convertion to multi-byte string was not necessary.

_S_IREAD | _S_IWRITE is the first 6 in the 0666.

This also sets Execute/search permission and other (possibly reserved for the future or used internally) bits.

arixmkii changed the title ~~Unify all Windows build configuration on the same charset~~ Unify all Windows build configuration on the same CharacterSet Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify all Windows build configuration on the same CharacterSet #888

Unify all Windows build configuration on the same CharacterSet #888

arixmkii commented Dec 5, 2023

irwir commented Jan 20, 2024

Mr-Clam commented Jan 20, 2024

Neustradamus commented Jan 20, 2024

irwir commented Jan 20, 2024

arixmkii commented Jan 20, 2024

madler commented Jan 21, 2024

irwir commented Jan 21, 2024 •

edited

irwir commented Jan 21, 2024

madler commented Jan 21, 2024 •

edited

Mr-Clam commented Jan 22, 2024 •

edited by madler

irwir commented Jan 23, 2024

Unify all Windows build configuration on the same CharacterSet #888

Unify all Windows build configuration on the same CharacterSet #888

Comments

arixmkii commented Dec 5, 2023

irwir commented Jan 20, 2024

Mr-Clam commented Jan 20, 2024

Neustradamus commented Jan 20, 2024

irwir commented Jan 20, 2024

arixmkii commented Jan 20, 2024

madler commented Jan 21, 2024

irwir commented Jan 21, 2024 • edited

irwir commented Jan 21, 2024

madler commented Jan 21, 2024 • edited

Mr-Clam commented Jan 22, 2024 • edited by madler

irwir commented Jan 23, 2024

irwir commented Jan 21, 2024 •

edited

madler commented Jan 21, 2024 •

edited

Mr-Clam commented Jan 22, 2024 •

edited by madler