ID:2519335
 
Resolved
Character-based versions of many built-in text procs have been added (as opposed to the current byte-based versions), although they come with a performance cost. These are:
  • length_char()
  • text2ascii_char()
  • copytext_char()
  • findtext_char()
  • findtextEx_char()
  • findlasttext_char()
  • findlasttextEx_char()
  • replacetext_char()
  • replacetextEx_char()
  • spantext_char()
  • nonspantext_char()
  • split_char()
  • regex.Find_char()
  • regex.Replace_char()
The performance cost can be mitigated slightly by replacing cases where you used to use length(text)-n as an index with simply -n instead of length_char(text)-n, since most of these procs support negative indexes. The read-only string index [] operator still uses byte positions.
Applies to:DM Language
Status: Resolved (513.1493)

This issue has been resolved.
There are so many (literal) edge cases that can mangle utf-8 with making copytext count bytes.

If the first byte of copytext is in the middle of a UTF-8 character, or if there's a UTF-8 character at the end, it's gonna get mangled. ascii2text(text2ascii(stuff))) can't solve everything here.

The fact that text is no longer 1:1 is going to screw a LOT of things up, and is less sane. I understand that counting bytes is faster, but sometimes you need to count characters, too. And I doubt parsing every single bit of text with regexes is the best solution for this.
Alright... it appears these edge cases result in the output being blank!
What are some cases that come up that can't be worked around?

Generally there aren't many situations where this should come up at all, because findtext() returns byte indexes and length() returns bytes, so most of the index values you'll work with are already correct.
There are more situations like this
In response to Lummox JR
For example, if I understood the principle correctly, it is now almost impossible to parse text from the end. Of course, you can get by with the means that are now, but it may not be rational. Now it’s much easier to stay on the 512 version than rewrite a lot of different code. This would help a lot if there was a function checking if the byte is the beginning of a character or not. Two weeks have passed, but not a single game with Cyrillic support has switched to the new version.
In response to SolarK
SolarK wrote:
For example, if I understood the principle correctly, it is now almost impossible to parse text from the end. Of course, you can get by with the means that are now, but it may not be rational. Now it’s much easier to stay on the 512 version than rewrite a lot of different code. This would help a lot if there was a function checking if the byte is the beginning of a character or not. Two weeks have passed, but not a single game with Cyrillic support has switched to the new version.

There are other reasons Cyrillic has had issues of course, some of which are pending fixes. But I think you make a really good point about parsing from the end of the string. While most copy/find operations can and should be reworked to be byte-aware, the particular case you mentioned is a good candidate for applying some other method.
Lummox JR resolved issue with message:
Character-based versions of many built-in text procs have been added (as opposed to the current byte-based versions), although they come with a performance cost. These are:
  • length_char()
  • text2ascii_char()
  • copytext_char()
  • findtext_char()
  • findtextEx_char()
  • findlasttext_char()
  • findlasttextEx_char()
  • replacetext_char()
  • replacetextEx_char()
  • spantext_char()
  • nonspantext_char()
  • split_char()
  • regex.Find_char()
  • regex.Replace_char()
The performance cost can be mitigated slightly by replacing cases where you used to use length(text)-n as an index with simply -n instead of length_char(text)-n, since most of these procs support negative indexes. The read-only string index [] operator still uses byte positions.