ID:1569706
 
Resolved
Lists sometimes contained bogus references.
BYOND Version:506
Operating System:FreeBSD
Web Browser:Firefox 32.0
Applies to:Dream Daemon
Status: Resolved (506.1247)

This issue has been resolved.
We decided to try the latest beta on our servers, one is still running it (very unstably)

The end result is everything down to simple loops, arithmetic and list operations being utterly busted, crashes everywhere and a currently 27MB error log consisting of a shower of a variety of refcount errors, bad objs, undefined global procs, bad images, crashing involving premature return values, et al.

This issue also seems to be isolated to the FreeBSD (and possibly Linux) version, as I cannot seem to cause it locally
(also noteworthy is that the FreeBSD and Linux versions seem to be one build behind Windows)

I can upload the error log somewhere, but I'm not sure if it would be of much help
We have a fairly noddy (read: chat program, with some persistent backing through MySQL calls) app running on 506.1245 Linux build here, fairly happily.

What are you hosting, and is that something I can grab and have a crack at hosting on the Linux build?
The FreeBSD/Linux builds are still at 1245 because the changes in 1246 didn't impact servers, so no worries there.

Are you experiencing the problems with threads on or off? (They're now off by default, unless overridden by command line or daemon.txt.)

Slurm mentioned to me he was getting some obj ref issues, but I still need to track those down with him. (He had issues in sd_DAL but it appears to be heavily modified, not the actual library at all.) Interestingly, that too was on the FreeBSD build. I wonder if something about the FreeBSD build just didn't build correctly, or if there's something on that architecture messing with the new code. Doesn't make a great deal of sense to me that that would be the case, but it's odd that the only known issues with the new stuff have shown up on that platform.

A test case I can run myself would help, though not being able to run on FreeBSD does make things a smidge problematic--I may simply never see the issues you're seeing if it's FreeBSD-specific. I did test the new optimization code on an SS13 variant, so it seems like I should have seen an issue there.
Threads off, it wouldn't go past the lobby with threads on

I'm the one who wrote that sd_DAL modification (which was then modified again by a bunch of other people), its not limited to there however, communications code and pooling code (both of these use pooling)

I'm not sure however SS13 variants keep working, even chemistry stopped working properly and didn't remove things from the reagents list properly.

I'm writing this since it didn't seem Slurm had brought it up at all, but apparently he has

I'll see if I can find a specific test case and if its only a FreeBSD issue
All Slurm saw were some bad ref error messages, but he didn't mention a crash with threads off.

This is high priority as we want to move 506 into stable, but I hate to do that if FreeBSD is choking.
It reportedly crashed several times, I sadly wasn't around for the test, but both servers are seeing crashes with BUG: Doubly premature return value (followed by a few refcount errors), even when not on 506

I'm suspecting some of them were simply crashing from an overload after the server was knocked into infinite loops from the bugs this created (i.e since chemistry was slightly broken life smoke would create an infinite avalanche of NPCs)
Running Beta, threads off on SS13 and having no issues here, at least nothing I didn't cause myself.

(Is a linux dedicated server)
Seems to be happening on linux too, threads on instantly crashes, threads off has a descent into madness

Note that me and Laser50 are running radically different branches of SS13 however
Tobba, it might help if I can have a copy of the source to compile and test. Chances are if your code is having issues on Linux, it's having issues on Windows too.

I've tested the new performance code with a relatively recent Baystation 12 build (threads off). However, it was not a test with other players; I logged in, ran around, made sure nothing cropped up. Given the level of background activity done by the game, I would tend to expect that even that much would have triggered any bad ref issues.
I had Baystation running (latest code, customized quite a bit), and managed to get to 60,000 MC controller iterations with threading enabled without a lot of trouble.

Obviously the inactivity kick kicked me off and there's a possibility sleep_offline kicked in, but I doubt that changes a lot.

Although, Tobba, what sourcecode are you running? TG or Bay?
Goon, anything that involves pooling seems to cause a hellfire for some reason
Ah, like that. Actually took me a bit to figure that one out..

To be honest, I believe I have experienced this issue once or twice before, long ago.. I suppose you've already tried to restart the entire server, update it, etc?
Beside that, have you tried running it on another machine/OS?
Slurm's servers both died horribly when we tried it (FreeBSD)
When I was setting up the new EU server I decided to try 506, it died equally horribly; instant crash with -threads on -map-threads on, without threading sorta runs but everything is messed up in strange ways and runtimes and BUG: messages are coming from everywhere
If you'd be capable of giving it a try, do another Linux distro, Ubuntu's working great for my server, unless of course you have a whole setup going which would make a transition rather difficult.
Again, it's high-priority to make sure that the performance changes work, at least with thread mode off. If you're having problems there must be something triggering that, and I really need something I can run tests on to narrow down the issue.
This is happening persistently with 506.1245 on FreeBSD 10.0-RELEASE-p1 amd64.

Log: http://sprunge.us/dVie
Ignore the runtimes, those are known and fixable.

First run crashed before world/New() finished. I can throw the core up somewhere if that's a help.

The refcount errors on the second one mostly reference this line:
https://github.com/Baystation12/Baystation12/blob/dev/code/ modules/reagents/Chemistry-Colours.dm#L28

Another one references this:
https://github.com/Baystation12/Baystation12/blob/dev/code/ game/mecha/mecha.dm#L1439
After a bunch of confused as hell messages to lummox, I somehow had 506 working for a short round with only one single refcount error, instead of the old avalanche if them, and that error stemmed from a ..() line in /datum/Del

The only difference is that LD_LIBRARY_PATH was set to .: and that 506 had been installed ontop of 504
In response to Mloc
That's quite helpful. I think I can use that to run some tests on the Baystation code I already have and see what pops up.
Its starting to seem to be like del() fails to clean out certain references properly, leading to the bad ref errors, specially when one of those bad refs sneaks into a pool it seems to wreck havoc
BUG: Bad ref (4:1870225967) in DecRefCount(DM _base_os.dm:416)

I'm starting to think something has deeply exploded
(that line contains a simple string format and list assignment)
Page: 1 2