ID:2842286
 
(See the best response by Ter13.)
Code:
proc/SanitizeText(syntax)
syntax = replacetext(syntax," "," ")
syntax = replacetext(syntax,"\n"," ")
syntax = replacetext(syntax," "," ")
syntax = replacetext(syntax," "," ")
syntax = trimtext(syntax)
return syntax

proc/Encapsulate(syntax,mram,counter,start,end=start)
var a,b,list/tram = new
for()
a = findlasttext(syntax,start)
b = findtext(syntax,end,a+length(end))
if(!a||!b){break}
tram["_[++*counter]"] = trimtext(copytext(syntax,a+1,b))
syntax = copytext(syntax,1,a) + " _[*counter] " + copytext(syntax,b+length(end))
*mram = *mram + tram
return syntax

proc/Compile(syntax)
var list/ram = new, counter = 0
syntax = SanitizeText(syntax)
syntax = Encapsulate(syntax,&ram,&counter,"(",")")
syntax = Encapsulate(syntax,&ram,&counter,"{","}")
syntax = SanitizeText(syntax)
world << "<b>Syntax</b><br><tt>[syntax]<br><b>RAM</b>"
for(var/i in ram)
world << "<tt>Token([i]) = Value([ram[i]])"

var syntax = @{" i = 123 j = 321 if(j >= i){ print("[j] >= [i]") } else{ print("[i] > [j]") } "}

mob/Login()
Compile(syntax)


Problem description:
What would be the best coarse of action to go from here, How would you guys read the tokens, in what order, how should they be contained? How do I make trees?


Currently I'm encapsulating it all in order starting with () and then {}

The final results are



The tokens below the syntax are the base tokens that get read resulting in the other tokens being read.
Best response
When compiled, high level patterns have low-level analogue patterns. Almost all of them using what are called jump points.

There are two jump point instructions that you will be interested in, because you can emulate almost all high level patterns using them.

Let's call them jmp and cndjmp.

Jmp is a simple command. jmp tells the interpreter to move the spindle position (read index) to a specific instruction.

This covers the high level pattern: goto.

Cndjmp is a little more complicated. A cndjmp can take two values. One is a boolean conditional. The second is an instruction address. If the boolean condition evaluates to true, jump to the specified instruction address. If it is not true, discard the condition and the jump address.

This is the magic that makes if, else, switch, while, and for patterns work.

if(1>0)
return 2
return 3


This would boil down to, in instructions:

1: push 0
2: push 1
3: op greater (pop pop)
4: op not (pop push)
5: cndjmp 8 (pop)
6: push 2
7: return (pop)
8: push 3
9: return (pop)


Note that a cndjmp can be tuned to jump only when a value is NOT true, which will remove the not operation here, or you can have a cndjmp that has an else jump point included in the bytecode. The trick is, while you are parsing your tree, you need to be keeping track of label points and scope traversals, and then loop back to fill in the jump points after you have completed compiling the scope.

In essence, trees of logic patterns aren't really trees when they hit the interpreter. It's like a choose your own adventure book the computer is reading.
If anyone has additional comments to leave, I'm still trying to figure out where to go from here.
I didn't really go into stacks. Do you get the push/pop actions in my notation?

Also, have you heard of the shunting yard algorithm? That was a big help for me figuring out how to make virtual machines.
In response to Ter13
Ter13 wrote:
I didn't really go into stacks. Do you get the push/pop actions in my notation?

I "understand" it, but getting all the pieces like I am above, from there I do not know how I'd go about translating that into what you've posted.
You'll need a stack while you are compiling that stores information about the scope you are in.

You do need to keep in mind that you can't just treat each block as a separate element, because multiple if/elseif/else blocks can be chained, and should all share the same endif position. You will need to update all of the endif jmp commands to the same offset once you have reached the end of the chained statement. This means that an if() block doesn't actually end until you process the first command that isn't an else-if or else statement. However, each if / else if block will update its jump point to the next block's condition instructions immediately upon ending if the next statement isn't an else or unrelated unscoped statement.
A tokenizer is your next step, which is basically just reading characters from a certain position to turn them into an identifier or operator. In BYOND this is called the lexer. This sort of thing can be context-dependent to a degree, but its main job is just to advance the cursor and spit out a token.

In DM's lexer, the Token object is basically a type, optional ID, and a text buffer (string object). The type corresponds to a primitive type like number, string, word (any higher-level operator or identifier), raw string, and then a number of low-level operators like left and right parentheses, brackets, braces.

The bigger difficulty is how to turn tokens into an expression tree as you receive them. DM uses a recursive descent parser, meaning it starts out on a function that looks for the lowest possible operator precedence and calls higher and higher precedences after that. The downside to this is that it's a heck of a lot of recursion; the upside it's relatively easy to understand and it maintains stacks very well.

A recursive descent parser basically looks like this:
// assignment would be the lowest level
ParseBinaryAsn()
// parse one level higher in precedence
Expression expr = ParseTernary()
// look at the next token (that's already been read)
if(token.token_type == OPERATOR_TOKEN && token.precedence = OPERATOR_ASSIGNMENT)
parser.NextToken() // read next token and store it in token
Expression e = new(token)
e.AddChild(expr)
expr = e
// assignments are right-to-left precedence, so recurse this way
expr.AddChild(ParseBinaryAsn())
return expr

Left-to-right precedence, which is more common, is similar.

ParseBinaryAdd()
// parse one level higher in precedence
Expression expr = ParseBinaryMul()
// look at the next token (that's already been read)
while(token.token_type == OPERATOR_TOKEN && token.precedence = OPERATOR_ADD)
parser.NextToken() // read next token and store it in token
Expression e = new(token)
e.AddChild(expr)
expr = e
// assignments are right-to-left precedence, so recurse this way
expr.AddChild(ParseBinaryMul())
return expr

And for unary operators, it depends on if you have a prefix or postfix.
ParseUnaryPrefix()
if(token.token_type == OPERATOR_TOKEN && token.precedence = OPERATOR_PREFIX)
parser.NextToken() // read next token and store it in token
Expression e = new(token)
// recurse
e.AddChild(ParseUnaryPrefix())
return e
return ParseUnaryPostfix()

ParseUnaryPostfix()
Expression expr = ParseValue() // identifier, string, number, etc.
while(token.token_type == OPERATOR_TOKEN && token.precedence = OPERATOR_POSTFIX)
parser.NextToken() // read next token and store it in token
Expression e = new(token)
e.AddChild(expr)
expr = e
return expr

Mind you recursion is not always going to work great in BYOND, but this is an example of how these sorts of parsers work.
Can't seem to find a better way to do it besides this.

var syntax = @{"
i = 123
j = 321
k = 0
print("i = [i], j = [j], k = [k]")
if(j >= i){
        print("[j] >= [i]")
        if(k == 1){
                print("k = [k] - pass")
        } else {
                print("k = [k] - fail")
        }
}
else{
        print("[i] > [j]")
}
"}

mob/Login()
Execute(arglist(Compile(syntax)))

proc/ReadToken(token,list/ram,list/variables)
while(ram.Find(token))
token = ram[token]
if(token[1]=="\"")
var a,b
for()
a = findtext(token,"\[")
b = findtext(token,"]")
if(!a||!b){break}
token = "[copytext(token,1,a)][ReadToken(copytext(token,a+1,b),ram,variables)][copytext(token,b+1)]"
return copytext(token,2,length(token))
return variables.Find(token) ? variables[token] : (text2num(token) ? text2num(token) : token)


proc/Execute(list/tokens,list/ram,list/carryover)
if(istext(tokens)){tokens = splittext(tokens," ")}
var list/variables = new,sp = 1 // Starting Position
if(carryover){variables += carryover}
var list/tempdata, tempval // Temp Storage
#define SKIP(i) sp = p+i+1; continue RESET
RESET
for()
for(var/p = sp to tokens.len)
switch(tokens[p])
if("=")
variables[tokens[p-1]] = text2num(tokens[p+1]) == null ? tokens[p+1] : text2num(tokens[p+1])
SKIP(1)
if("print")
world << ReadToken(tokens[p+1],ram,variables)
SKIP(1)
if("if")
tempval = ReadToken(tokens[p+1],ram,variables)
tempdata = splittext(tempval," ")
switch(tempdata.len)
if(3)
tempdata[1] = ReadToken(tempdata[1],ram,variables)
tempdata[3] = ReadToken(tempdata[3],ram,variables)
switch(tempdata[2])
if(">=")
if(tempdata[1] >= tempdata[3])
Execute(ReadToken(tokens[p+2],ram,variables),ram,variables)
SKIP(2)
else if(tokens[p+3] == "else")
Execute(ReadToken(tokens[p+4],ram,variables),ram,variables)
SKIP(3)
if("==")
if(tempdata[1] == tempdata[3])
Execute(ReadToken(tokens[p+2],ram,variables),ram,variables)
SKIP(2)
else if(tokens[p+3] == "else")
Execute(ReadToken(tokens[p+4],ram,variables),ram,variables)
SKIP(3)
break
Parsing a language like C is complex enough that it can fill a university course. Which is not to say it's particularly *hard* - a university student does about 8 courses in a year, so it's not like a giant amount of study, but it is to say that there's a lot of stuff to learn to get from here to there, and it's definitely way too complex to properly explain on BYOND forums.

Parsing a language like BYOND might be simpler because you have the indentation to guide you, but only a bit. The syntax inside if() in BYOND isn't any simpler than the syntax inside if() in C.

To make things worse, BYOND is generally not a great language for writing heavily algorithmic code, like parsers or pathfinding. You *can* do it, but it's a pain.

If you are willing to accept the scripting language being much uglier, you could make the parsing much simpler. A good example is Reverse Postfix Notation. Instead of writing 1 + (3 * 4), in RPN you write 1 3 4 * +. Now the computer processes this strictly from left to right - look at the number 1, the number 3, the number 4, multiply the last 2 things together (that's 3 and 4), add the last 2 things together (that's 1 and 12, because the 3 and 4 are no longer considered because they were multiplied. It uses a stack to do this). Because the computer just goes from left to right, the parsing is very easy. You just find the next space, then do that thing until the space, then set the current position after the space and repeat. To make things like if/then you could have jmp and cndjmp instructions like in Ter13's answer.

Then your script could be something like this:
1. 123 set:i
2. 321 set:j
3. get:j get:i >= iftrue:4:else:5
4. get:j " >= " get:i concat:3 print goto:6
5. get:i " > " get:j concat:3 print
6. end

(iftrue:4:else:5 meaning go to 4 if true else go to 5, and concat:3 meaning concatenate 3 thing to make a string)

I know that's not what you're looking for - I'm just trying to suggest a thing that might be easier to create and still be useful.

---

What you have already written is pretty clever, but it's not the standard way to parse things and I'm afraid you might run into a dead end somewhere. E.g. you will have trouble parsing things without parens this way: 1 + 3 * 4.