How do I read characters in a string as their UTF-32 decimal values?

0

Issue

I have, for example, this Unicode string, which consists of the Cyclone and the Japanese Castle defined in C# and .NET, which uses UTF-16 for its CLR string encoding:

var value = "🌀🏯";

If you check this, you find very quickly that value.Length = 4 because C# uses UTF-16 encoded strings, so for these reasons I can’t just loop on each character and get its UTF-32 decimal value: foreach (var character in value) result = (ulong)character;. It begs the question, how can I get the UTF-32 decimal value for each character in any string?

Cyclone should be 127744 and Japanese Castle should be 127983, but I am looking for a general answer that can take any C# string and always produce a UTF-32 decimal value out of each character inside of it.

I’ve even tried taking a look at Char.ConvertToUtf32, but this seems to be problematic if, for example:

var value = "a🌀c🏯";

This has a length of 6. So, how do I know when a new character begins? For example:

Char.ConvertToUtf32(value, 0)   97  int
Char.ConvertToUtf32(value, 1)   127744  int
Char.ConvertToUtf32(value, 2)   'Char.ConvertToUtf32(value, 2)' threw an exception of type 'System.ArgumentException'   int {System.ArgumentException}
Char.ConvertToUtf32(value, 3)   99  int
Char.ConvertToUtf32(value, 4)   127983  int
Char.ConvertToUtf32(value, 5)   'Char.ConvertToUtf32(value, 5)' threw an exception of type 'System.ArgumentException'   int {System.ArgumentException}

There is also the:

public static int ConvertToUtf32(
    char highSurrogate,
    char lowSurrogate
)

But for me to use this as well I need to figure out when I have surrogate pairs. How can you do that?

Solution

Here is an extension method that illustrates one way to do it. The idea is that you can loop through each character of the string, and use char.ConvertToUtf32(string, index) to get the unicode value. If the returned value is larger than 0xFFFF, then you know that the unicode value was composed of a set of surrogate characters, and you can adjust the index value accordingly to skip the 2nd surrogate character.

Extension method:

public static IEnumerable<int> GetUnicodeCodePoints(this string s)
{
    for (int i = 0; i < s.Length; i++)
    {
        int unicodeCodePoint = char.ConvertToUtf32(s, i);
        if (unicodeCodePoint > 0xffff)
        {
            i++;
        }
        yield return unicodeCodePoint;
    }
}

Sample usage:

static void Main(string[] args)
{
    string s = "a🌀c🏯";

    foreach(int unicodeCodePoint in s.GetUnicodeCodePoints())
    {
        Console.WriteLine(unicodeCodePoint);
    }
}

Answered By – sstan

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave A Reply

Your email address will not be published.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More