Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Globalize number formatter is incorrect for numeric digits in supplemental plane #922

Open
greghuc opened this issue Aug 16, 2021 · 4 comments
Labels

Comments

@greghuc
Copy link

greghuc commented Aug 16, 2021

Hi there

globalise (v1.7.0) number formatting is incorrect for cldr-data (v36.0.0), when cldr numeric digits are from the UTF-16 supplemental plane (from U+010000 to U+10FFFF).

Short example, discussed below: 44.56 formatted in ccp locale

  • Should be: "𑄺𑄺.𑄻𑄼" = ["1113a", "1113a", "2e", "1113b", "1113c"] (hex codepoints)
  • But returned by globalise: "��.��" = [ 'd804', 'd804', '2e', 'dd38', 'd804' ]

Based on the formatted value returned by globalise, I initially suspected that individual characters are somehow being represented in globalize as surrogate pairs (so two 16-bit hex values), but only the first of these hex values is returned. There's a worked example below, except I now have some doubts over this theory: for the 4 numeric digits involved, 3 of the digits returned by globalize seem to be the first half of a surrogate pair, but one isn't.

Example (no code)

For the "ccp" locale, digitals 0-9 are "𑄶𑄷𑄸𑄹𑄺𑄻𑄼𑄽𑄾𑄿", which have unicode hex codepoints of ["11136", "11137", "11138", "11139", "1113a", "1113b", "1113c", "1113d", "1113e", "1113f"].

So the number 44.56 formatted in ccp should be "𑄺𑄺.𑄻𑄼" = ["1113a", "1113a", "2e", "1113b", "1113c"]

What is actually returned from globalise is "��.��" = [ 'd804', 'd804', '2e', 'dd38', 'd804' ]

Using the Surrogate Pair Calculator for the individual characters in "𑄺𑄺.𑄻𑄼" = ["1113a", "1113a", "2e", "1113b", "1113c"]

  • 1113a = D804 + DD3A
  • 1113a = D804 + DD3A
  • 2e = 2e (no pair needed)
  • 1113b = D804 + DD3B (but globalise actually returns dd38)
  • 1113c = D804 + DD3C

So maybe globalise is returning the first hex value from each surrogate pair? But dd38 is returned, not D804 (for 1113b)

Example (code)

// Output hex values for Javascript unicode characters
var asUnicodePoints = function(value) {
  return Array.from(value).map(function(codePoint) {
    return codePoint.codePointAt(0).toString(16);
  });
};

// For us locale, works fine
var result = Globalize('us').numberFormatter()(44.56);
console.log(result);
=> 44.56
console.log(asUnicodePoints(result));
=> [ '34', '34', '2e', '35', '36' ]

// For cpp locale, wrongly returns first hex value from each surrogate pair? 
var result = Globalize('ccp').numberFormatter()(44.56);
console.log(result);
=> ��.��
console.log(asUnicodePoints(result));
=> [ 'd804', 'd804', '2e', 'dd38', 'd804' ]

// For ccp locale, the true hex values for formatted 44.56 should be.. 
console.log(asUnicodePoints("𑄺𑄺.𑄻𑄼"));
=> [ '1113a', '1113a', '2e', '1113b', '1113c' ]
@rxaviers
Copy link
Member

Thanks for filing the issue and your detailed debugging. I am open to accept a fix. Thanks!

@rxaviers rxaviers added the bug label Aug 16, 2021
@greghuc
Copy link
Author

greghuc commented Aug 16, 2021

@rxaviers I'll see what I can do. Any guidance on roughly where in the code I should be looking?

@greghuc
Copy link
Author

greghuc commented Aug 16, 2021

OK, this issue isn't going to be my highest priority, though I will hopefully get round to it at some point. I believe the issue only affects 4 locales, all related to the base ccp locale: ccp, ccp-u-nu-native, ccp-IN and ccp-IN-u-nu-native.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants