It'd be very nice to have support for \p{} and \P{} from byond's regex.
I've ran into issues with filtering unicode, there's just no good way to do some things without being able to check for categories. Especially dealing with marks and such.
ID:2729476
Oct 30 2021, 2:17 am
|
|||||||
| |||||||
Nov 1 2021, 1:21 am
|
|
Hrm. This notation is very new to me so I'll have to study it. In principle the main thing is just having a way of converting those categories to character ranges.
|
I suppose so yeah. It's partially a method of offloading work from the end developer, since it means not needing to maintain a list of all some thousand unicode chars to filter for each group
|
The more I've looked at this, the more I think my idea of breaking this out into ranges won't work. I think it'll need some better way of handling it. Lowercase chars for instance don't always fall into neat ranges, but in many language blocks they alternate with the uppercase chars.
Parsing the UnicodeData.txt file isn't going to be a problem though. The current source actually uses a BYOND project to parse the file and create C++ code that's statically linked into the project. Of course, UnicodeData.txt does not define the various ranges/blocks/scripts, which is disappointing. |
The categories are defined in there, but by nature they're not really next to each other so ranges would be somewhat painful to implement I'd think. I have a feeling it'd just end up being a list of all the chars in the category, since they can be very spread out.
|