Regex support for matching unicode catagories

BYOND Forums

Announcements · BYOND Help · Bug Reports · Feature Requests · Beta Testers · Beta Bugs · Developer Help · Design Philosophy · Demos & Libraries · Tutorials & Snippets · Art & Sound · Classified Ads · Game Updates · Contests & Events · Linux Talk · On Topic · Off Topic

ID:2729476

Oct 30 2021, 2:17 am

LemonInTheDark

Applies to:

DM Language

Status:

Open

Issue hasn't been assigned a status value.

It'd be very nice to have support for \p{} and \P{} from byond's regex.
I've ran into issues with filtering unicode, there's just no good way to do some things without being able to check for categories. Especially dealing with marks and such.

Nov 1 2021, 1:21 am
Lummox JR	Hrm. This notation is very new to me so I'll have to study it. In principle the main thing is just having a way of converting those categories to character ranges.

Nov 4 2021, 3:59 am
LemonInTheDark	I suppose so yeah. It's partially a method of offloading work from the end developer, since it means not needing to maintain a list of all some thousand unicode chars to filter for each group

Nov 4 2021, 8:56 pm

Lummox JR

The more I've looked at this, the more I think my idea of breaking this out into ranges won't work. I think it'll need some better way of handling it. Lowercase chars for instance don't always fall into neat ranges, but in many language blocks they alternate with the uppercase chars.

Parsing the UnicodeData.txt file isn't going to be a problem though. The current source actually uses a BYOND project to parse the file and create C++ code that's statically linked into the project.

Of course, UnicodeData.txt does not define the various ranges/blocks/scripts, which is disappointing.

Nov 4 2021, 10:17 pm
LemonInTheDark	The categories are defined in there, but by nature they're not really next to each other so ranges would be somewhat painful to implement I'd think. I have a feeling it'd just end up being a list of all the chars in the category, since they can be very spread out.

Mar 11 2022, 6:47 pm
LemonInTheDark	bump