Request in relation of the 255-character problem

BYOND Forums

Announcements · BYOND Help · Bug Reports · Feature Requests · Beta Testers · Beta Bugs · Developer Help · Design Philosophy · Demos & Libraries · Tutorials & Snippets · Art & Sound · Classified Ads · Game Updates · Contests & Events · Linux Talk · On Topic · Off Topic

ID:1768158

Jan 22 2015, 9:11 am

Keywords: 1103, 255, ja, russian

VolAs

Applies to:

DM Language

Status:

Open

Issue hasn't been assigned a status value.

Good day. And to begin with - about the old problem of Russian community. And perhaps the Turkish too (if there even is such thing).
We know that it is a reserved character and you can't simply provide us with the letter, but it is a quite significant problem for us.
Changing letter 'я' to html entity &#255 is easy task, when you don't have to do it on large projects like SS13. There are plenty of things that need to be considered.
Many ways of inputting information into the game, multiple ways of text processing (which doesn't recognize &#255 symbol as a single letter!), and on top of this - necessity to output the same text in popup windows, where symbol 'я' has to be replaced with &#1103.

I would like to propose to simply replace 0xFF with something else, but after a little research, I think it is not so easy, as it seems to affect both server side and client side.
Perhaps, some sort of solution can be found?
For example, to make this symbol a server-side configurable parameter that synchronizes with a client after connection.
For those who don't need it, nothing will change. For non-English community it will be just 1 symbol change in configuration to fix tons of code maintenance trouble.

Maybe there are some other solutions which would satisfy everyone? We are hoping to see a discussion about this issue.

Jan 23 2015, 8:39 am

Lummox JR

It's kind of an iffy thing. There are a few problems in play:

1) Some of the routines that handle formatting codes exist on both the server and client. For example there's one that strips out formatting codes entirely, that is used on the client end quite a lot. This means a .dmb flag that said it was using a different formatting character or string type would fail.

2) BYOND uses strict ASCII; it does not use UTF-8 or other more advanced character schemes. It would be fairly difficult--although honestly, I don't think impossible--to adjust for that on the server end. On the client end it's a lot iffier.

Jan 24 2015, 10:31 am
VolAs	UTF-8 is a solution to the problem too. Most likely. Or something else that will ease the problem with 'я'. We will glad to any progress in this direction.

Mar 29 2016, 12:48 pm
VolAs	Up! Any good news?

Jan 5 2017, 12:49 pm
Optimumtact	I wouldn't hold your breath, utf-8 conversion is a boring and tedious task

Jan 5 2017, 6:01 pm
VolAs	Someday it should happen. It's a pity that we can't promote things like this, like in bountysource.com or something else.

Jan 12 2017, 11:28 pm
StrikS	Please do it! <3

Jan 13 2017, 12:09 am
SpaiR	https://youtu.be/ZXsQAXx_ao0 Do it.

Jan 22 2017, 3:19 am
Rascher	Another up

Jan 22 2017, 4:16 am

PJB3005

HOT OFF THE PRESS:

I just got UTF-8 to work for SS13 chat, so Russians should have no issues with that anymore. The system only really would work as replacement for regular output controls.

https://github.com/d3athrow/vgstation13/pull/13537

Here's how it works.

Lummox claims BYOND is strict ASCII which is a blatant lie. The client will render characters with the CodePage windows assigns to it (locale dependent). For example for English people this is windows-1252, Russians 1251. This means that if you enter a character like the thing that looks like a capital A on a Russian computer, people in America will see that as a capital A with a grave. (I can't use the actual UTF-8 characters because of course the forums can't handle this shit either...

More troubling is that BYOND uses upper character ranges as special code for text macros so those get filtered out. backwards R for example is 0xFF in windows-1251 and 0xFF is what BYOND uses for \improper...

So how do you get around this?

Like half a year ago, we ported the chat replacement made by Goonstation to /vg/. It's effectively the same as the regular output control, except written entirely in HTML, CSS and JS, so it can run in a browser element and it's a TON more flexible and powerful.

Once upon a time while writing the coding standard for the code base I came to HTML documents and wondered "wait what character encoding should be used, UTF-8 doesn't work right?"

So I tried UTF-8 in a HTML document in BYOND. It turns out it works if you set the charset in the <head> of the HTML document. I soon figured out that goonchat works with it too, but sending the UTF-8 strings to the chat was failing because BYOND decodes the message in the output() call...

So then I thought of something. What if I URL encode the message twice, and allow the Javascript to decode it itself so BYOND doesn't fuck up the message...

And success! It works. Because BYOND strings are purely byte strings they are sent 100% literally to the client and it all works great.

So that's piece two of the puzzle.

Problem is now though that the URL decode Javascript side breaks on non-UTF8. The primary source of such messages being... Client input.

So we need to detect the encoding of the client and transform the windows-xxxx text into UTF-8.

The detecting of client encoding was surprisingly easy. The non-standard document.defaultCharset variable Javascript side in the client browser actually has the encoding on the client.

Now, BYOND being > BYOND has no utilities for converting strings. Puzzled by this problem I took the logical option: A DLL!

I made a DLL in Rust that can convert the encoding to UTF-8. Works great.

And that was everything, we'll soon be able to convert the client input into direct UTF-8 and operate on UTF-8 everywhere with DLL functions. I'm still working on UTF-8 functions such as find.

BYOND doesn't mangle everything as long as we don't let its string functions even lay a bloody finger on my precious UTF-8.

Jan 22 2017, 1:25 pm
VolAs	Wow, PJB3005, awesome work. It's workaround, but it's looks very good! We definitely should try this. Also, how it works with the byond output like << to file (logs, etc.), or as default text in inputs (character records in the lobby for example)? For us it was an unsolvable problem, byond "eats" 0xFF (perhaps not only) there.

Jan 22 2017, 2:09 pm (Edited on Jan 22 2017, 2:17 pm)

Somepotato

its not a lie, \xFF is used for WAY more than just \improper. UTF8 should never have bytes equal to \xFF either so the other string functions (esp. the hard ones eg text2ascii etc never care about \xFF) shouldn't be interfered by utf8.
Although BYOND does have a fun bug that it only uses the ascii MFC (not utf8!) so you'll end up killing some clients by passing utf8 to the browser because of an "Improper Argument" error.

Jan 23 2017, 9:36 am

In response to PJB3005

Lummox JR

Excellent work on that UTF-8 converter. I should clarify a couple of things, though.

PJB3005 wrote:

Lummox claims BYOND is strict ASCII which is a blatant lie. The client will render characters with the CodePage windows assigns to it (locale dependent).

As you know, "strict ASCII" isn't really so much a thing unless you're only counting 7-bit characters. When I say BYOND uses ASCII only, I'm really using that as a verbal shorthand to say BYOND only supports up to 8-bit character encodings. Or in other words, not Unicode-aware.

For example for English people this is windows-1252, Russians 1251. This means that if you enter a character like the thing that looks like a capital A on a Russian computer, people in America will see that as a capital A with a grave. (I can't use the actual UTF-8 characters because of course the forums can't handle this shit either...

The forums can handle this a little bit, at least under the hood. Dealing with encodings is often a nightmare in web setings, though, and any issues the forums have with that are happening because of a breakdown somewhere in the process of going from user input to script to database. The database itself handles UTF-8 just fine.

More troubling is that BYOND uses upper character ranges as special code for text macros so those get filtered out. backwards R for example is 0xFF in windows-1251 and 0xFF is what BYOND uses for \improper...

0xFF is actually used by BYOND as a marker for all format codes.

So those are the nuts and bolts of it.

For some time now I've actually been pondering the possibility of some internal UTF-8 support, and trying to develop more support for it in the external code. The frontend is obviously a big problem because all of it has been written with ANSI characters in mind, and even significant parts of the backend are ANSI.

On the backend, UTF-8 support would mean a few changes under the hood. Let's say for instance that strings had a dual format where they could flag themselves as being in UTF-8 if they had Unicode characters (or a non-format 0xFF). While that would generally be an efficient format to store them in, and would likely be the format I'd want them to use for output, indexing and searches would be useless because those rely on random access. So it'd be desirable in those cases to have some kind of hybrid storage that could store the expanded, wide-character string. It gets very tricky.

Jan 23 2017, 3:57 pm

Somepotato

I'd like to say, that the UTF8 spec does not allow for any character to be \xFF. I've softcoded a fully functional UTF-8 string library to prove its possible, too. UTF-8 is FULLY ansi compatible (given that the 8th bit isn't used in ansi but is used to distinguish utf8 characters.)
UTF-8 is so nice because its easy to implement over other things. You wouldn't have to change the \xFF formatting setup because its an invalid utf8 byte (you'd have to make sure it wouldn't make the codepoint converter you use wouldn't barf at seeing it and/or would just ignore them.)
You'd have to normalize utf8 strings for purposes of comparison but thats not very hard either.

You'd make extensive use of MultiByteToWideChar but as long as you're smart with your memory tracking (eg its only necessary for what would appear on mfc forms).

Jan 23 2017, 4:31 pm In response to Somepotato
Lummox JR	The biggest deals to me IMO are the problems of upgrading the frontend to use it, and dealing with random access issues on the backend.

Jan 23 2017, 6:01 pm

In response to Lummox JR

Somepotato

Lummox JR wrote:

The biggest deals to me IMO are the problems of upgrading the frontend to use it, and dealing with random access issues on the backend.

Hands down the most obnoxious part would definitely be getting windows to cooperate (I can foresee already the number of multibytetowidechar calls being rediculous.)
The backend issues won't be as big of a deal as you'd think all things considered though IMO, you wouldn't have to change much -in theory-.

Feb 1 2017, 10:36 pm
Bandock	I just remembered suggesting a conversion for macro mode in UTF-16 mode. Though it's a duplicate request, here is my old details for UTF-16/possibly UTF-32 support: http://www.byond.com/forum/?post=108199

Feb 1 2017, 10:43 pm
Somepotato	utf16/32 support is unnecessary and gives no real benefot over utf8 (other htan utf16 being easier to implement on Windows)

Feb 1 2017, 10:56 pm
Bandock	Yeah, UTF-32 would definitely be unnecessary (in fact, overkill) considering very few real applications ever use it. Due to being a very old version of the request, makes me want to revise it. :p Edit: Actually, I just forgot I was thinking of some other new plans for it as well. Something to deal with localization. I haven't quite planned it out yet though.