Is there an algorithm to detect "isolated combining characters"?

0

Issue

I am interested in detecting strings that contain "un-combined" or "dangling" combining characters. These are formally known as isolated combining characters.

An example of such a string would be "\u0303 hello", which starts with a COMBINING TILDE that is not actually combined with anything else.

Is there an algorithm for detecting such a thing?

It seems like I can search over the string looking for "combine-able" base characters, and reject any combining character that is not preceded by such a base character. But how do I know what characters are base characters? I imagine that there are also edge cases to worry about.

My objective is to reject such strings as invalid identifiers, in a programming language that supports Unicode identifiers. But this might also be useful for other text processing tasks as well.

Solution

Unicode 14.0 definitions D50, D51, D52 seem relevant.

You could find the first isolated combined character in an uninterrupted sequence of possibly multiple isolated combined characters by searching for combining characters that

  • immediately follow something that is not a Letter (L), Number (N), Punctuation (P), Symbol (S) or Space Separator (Zs) or another combining character (M).

In Java-Syntax that would be:

(?<!\p{L}|\p{N}|\p{P}|\p{S}|\p{Zs}|\p{M})\p{M}

Full runnable example (Scala, here an online interpreter]:

val rgx = """(?<!\p{L}|\p{N}|\p{P}|\p{S}|\p{Zs}|\p{M})\p{M}""".r

val examples = List(
  "\u0303bad",
  "ok\u0303",
  "ok\u0303\u0303",
  "bad\u001F\u0303"
)

for (e <- examples) {
  println(rgx.findFirstIn(e).nonEmpty)
}

prints:

true
false
false
true

Answered By – Andrey Tyukin

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave A Reply

Your email address will not be published.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More