Using the new regex stuff to extract all occurrences of a substring between two substrings

BYOND Forums

Announcements · BYOND Help · Bug Reports · Feature Requests · Beta Testers · Beta Bugs · Developer Help · Design Philosophy · Demos & Libraries · Tutorials & Snippets · Art & Sound · Classified Ads · Game Updates · Contests & Events · Linux Talk · On Topic · Off Topic

ID:2028692

Feb 1 2016, 2:39 pm

(See the best response by Multiverse7.)

Metamorphman

Code:

client/verb/test()
    var/regex/r = regex("[REGEX_QUOTE("((")](.*)[REGEX_QUOTE("))")]")
    r.Find("((this)) ((that)) ((something)) ((else))")

    for(var/i = 1 to length(r.group))
        src << "[i]: [r.group[i]]"

Problem description:
Given the string: "((this)) ((that)) ((something)) ((else))",

What I'd like to extract is:
1: this
2: that
3: something
4: else

What I actually get is:
1: this)) ((that)) ((something)) ((else

Any ideas?

Feb 1 2016, 3:47 pm (Edited on Feb 1 2016, 4:31 pm)
Super Saiyan X	var/regex/r = regex("\\((\[^()]+)\\)")

Feb 1 2016, 4:32 pm

Metamorphman

The only returns the first one, 'this'. Specifically, I'm looking for something generalizable to replace/update these procs:

proc/between(t, a, b)
    var/p1 = findtext(t,a)
    if(p1)
        p1+=length(a)
        var/p2 = findtext(t,b,p1)
        if(p2)
            return copytext(t,p1,p2)

proc/between_all(t, a, b)
    var/p1 = findtext(t,a)
    var/l1 = length(a)
    var/l2 = length(b)
    .=list()
    while(p1)
        var/pk = p1+l1
        var/p2 = findtext(t,b,pk)
        if(p2) .+=copytext(t,pk,p2)
        else return
        p1 = findtext(t,a,p2+l2)

between("((this)) ((that)) ((something)) ((else))", "((", "))") = "this"

between_all("((this)) ((that)) ((something)) ((else))", "((", "))") = ["this", "that", "something", "else"]

Feb 1 2016, 5:14 pm (Edited on Feb 1 2016, 6:04 pm)

Best response

Multiverse7

Try these:

proc/between(t, a, b)
    a = REGEX_QUOTE("[a]")
    b = REGEX_QUOTE("[b]")
    var/regex/r = regex("(?:\[[a]]+)(\[^[a][b]]*)(?:\[[b]]+)")
    r.Find("[t]")
    if(length(r.group))
        . = r.group[1]

proc/between_all(t, a, b)
    a = REGEX_QUOTE("[a]")
    b = REGEX_QUOTE("[b]")
    var/regex/r = regex("(?:\[[a]]+)(\[^[a][b]]*)(?:\[[b]]+)", "g")
    . = list()
    while(r.Find("[t]") && length(r.group))
        . += r.group[1]

Calling between_all("((this)) ((that)) ((something)) ((else))", "(", ")") should return the list that you expect.

It would be nice if modifiers could be used to make recursive capture groups, so that we wouldn't need a loop, but I don't know how feasible that is.

Feb 1 2016, 5:34 pm (Edited on Feb 1 2016, 11:51 pm)

Multiverse7

These should use exact matches, if you need them to be more specific.

All that was needed was to remove some brackets.

Edit: I corrected the matching so that the group list isn't used.

proc/betweenXact(t, a, b)
    a = REGEX_QUOTE("[a]")
    b = REGEX_QUOTE("[b]")
    var/regex/r = regex("(?<=[a])\[^[a][b]]*(?=[b])")
    r.Find("[t]")
    . = r.match

proc/between_allXact(t, a, b)
    a = REGEX_QUOTE("[a]")
    b = REGEX_QUOTE("[b]")
    var/regex/r = regex("(?<=[a])\[^[a][b]]*(?=[b])", "g")
    . = list()
    while(r.Find("[t]"))
        . += r.match

Feb 1 2016, 5:35 pm

Metamorphman

Thanks, that does the trick. As for updating those procs: I just wrote another version using splittext and after doing some speed tests it seems to be much faster than using regex. In case anyone's interested (credit to ssx for some speed boosts):

proc/between_2(s, a, b)
    var/k[] = splittext(s, a)
    if(k.len)
        k = splittext(k[2], b)
        if(k.len) return k[1]
        
proc/between_all_2(s, a, b)
    . = list()
    var/i[] = splittext(s, a)
    for(var/j = 1 to length(i))
        var/k[] = splittext(i[j], b)
        if(length(k)) . += k[1]

Feb 1 2016, 5:39 pm
Multiverse7	That's cheating though! lol

Feb 1 2016, 6:10 pm
Multiverse7	I cleaned them up and reduced the proc calls. Regex is probably better for more complex tasks though, where it can end up being much more efficient.

Feb 1 2016, 7:31 pm
Lummox JR	The .* in your original regex is greedy by default. What you want is .? instead, so it's non-greedy. Alternatively, you could use [^)] which would be just as good.