A version of copytext that counts by characters

BYOND Forums

Announcements · BYOND Help · Bug Reports · Feature Requests · Beta Testers · Beta Bugs · Developer Help · Design Philosophy · Demos & Libraries · Tutorials & Snippets · Art & Sound · Classified Ads · Game Updates · Contests & Events · Linux Talk · On Topic · Off Topic

ID:2519335

Oct 29 2019, 2:57 am

Steamp0rt

Resolved

Character-based versions of many built-in text procs have been added (as opposed to the current byte-based versions), although they come with a performance cost. These are:

length_char()
text2ascii_char()
copytext_char()
findtext_char()
findtextEx_char()
findlasttext_char()
findlasttextEx_char()
replacetext_char()
replacetextEx_char()
spantext_char()
nonspantext_char()
split_char()
regex.Find_char()
regex.Replace_char()

The performance cost can be mitigated slightly by replacing cases where you used to use length(text)-n as an index with simply -n instead of length_char(text)-n, since most of these procs support negative indexes. The read-only string index [] operator still uses byte positions.

Applies to:

DM Language

Status:

Resolved (513.1493)

This issue has been resolved.

There are so many (literal) edge cases that can mangle utf-8 with making copytext count bytes.

If the first byte of copytext is in the middle of a UTF-8 character, or if there's a UTF-8 character at the end, it's gonna get mangled. ascii2text(text2ascii(stuff))) can't solve everything here.

The fact that text is no longer 1:1 is going to screw a LOT of things up, and is less sane. I understand that counting bytes is faster, but sometimes you need to count characters, too. And I doubt parsing every single bit of text with regexes is the best solution for this.

Oct 29 2019, 3:14 am
Steamp0rt	Alright... it appears these edge cases result in the output being blank!

Oct 29 2019, 9:25 am
Lummox JR	What are some cases that come up that can't be worked around? Generally there aren't many situations where this should come up at all, because findtext() returns byte indexes and length() returns bytes, so most of the index values you'll work with are already correct.

Oct 30 2019, 12:15 am
SolarK	There are more situations like this

Oct 30 2019, 2:03 am

In response to Lummox JR

SolarK

For example, if I understood the principle correctly, it is now almost impossible to parse text from the end. Of course, you can get by with the means that are now, but it may not be rational. Now it�s much easier to stay on the 512 version than rewrite a lot of different code. This would help a lot if there was a function checking if the byte is the beginning of a character or not. Two weeks have passed, but not a single game with Cyrillic support has switched to the new version.

Oct 30 2019, 2:49 am
Steamp0rt

Oct 30 2019, 10:56 am

In response to SolarK

Lummox JR

SolarK wrote:

For example, if I understood the principle correctly, it is now almost impossible to parse text from the end. Of course, you can get by with the means that are now, but it may not be rational. Now it�s much easier to stay on the 512 version than rewrite a lot of different code. This would help a lot if there was a function checking if the byte is the beginning of a character or not. Two weeks have passed, but not a single game with Cyrillic support has switched to the new version.

There are other reasons Cyrillic has had issues of course, some of which are pending fixes. But I think you make a really good point about parsing from the end of the string. While most copy/find operations can and should be reworked to be byte-aware, the particular case you mentioned is a good candidate for applying some other method.

Oct 30 2019, 12:47 pm

Lummox JR

Lummox JR resolved issue with message:

Character-based versions of many built-in text procs have been added (as opposed to the current byte-based versions), although they come with a performance cost. These are:

length_char()

text2ascii_char()

copytext_char()

findtext_char()

findtextEx_char()

findlasttext_char()

findlasttextEx_char()

replacetext_char()

replacetextEx_char()

spantext_char()

nonspantext_char()

split_char()

regex.Find_char()

regex.Replace_char()

The performance cost can be mitigated slightly by replacing cases where you used to use length(text)-n as an index with simply -n instead of length_char(text)-n, since most of these procs support negative indexes. The read-only string index [] operator still uses byte positions.