Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLiM fails inscrutibly on Windows when there are funny characters in the script at treeSeqOutput(..., includeModel=T) #488

Open
petrelharp opened this issue Jan 5, 2025 · 8 comments

Comments

@petrelharp
Copy link
Collaborator

petrelharp commented Jan 5, 2025

This just took me way too long to figure out. A stdpopsim test was failing on windows because, turns out, we were calling slim with a script containing this line in the header (in a block comment):

 * Tremblay and Vézina, 2000: https://doi.org/10.1086/302770

... but the script that was actually getting passed through to SLiM, as produced by python, is printed out in the Windows bash shell as

 * Tremblay and V<E9>zina, 2000: https://doi.org/10.1086/302770

... I'm working on getting the actual character there. The result is:

  • if I run the script like slim test.slim then it exits at the treeSeqOutput( ) line with code 127 and no error message; run via python with subprocess.Popen the proc.returncode is 3221226505, which seems to mean "something bad happened"
  • if I change the arguments to includeModel=F then nothing bad happens

Note that running the following script in Windows works fine; the issue is what é is turned into by python, in windows, somehow.

// here is a "special" character: é
initialize() {
	initializeTreeSeq();
	initializeMutationRate(0.0);
	initializeMutationType("m1", 0.5, "f", 0.0);
	initializeGenomicElementType("g1", m1, 1.0);
	initializeGenomicElement(g1, 0, 99999);
	initializeRecombinationRate(1e-8);
}

1 early() {
	sim.addSubpop("p1", 5);
}

10 late() { 
	catn("about to write out trees without script");
	sim.treeSeqOutput("test_noscript.trees", includeModel=F); 
	catn("about to write out trees with script");
	sim.treeSeqOutput("test.trees"); 
	catn("done writing out trees");
}
@petrelharp
Copy link
Collaborator Author

Ah-ha! I got the file in question, and ran it locally, where it gives an informative error:

terminate called after throwing an instance of 'nlohmann::detail::type_error'
  what():  [json.exception.type_error.316] invalid UTF-8 byte at index 158: 0x7A
Aborted

I think one of the meanings of the inscrutible Windows error code was "unhandled exception"; so that's probably what's going on. And, here's a script that triggers it:
test.txt

@bhaller
Copy link
Contributor

bhaller commented Jan 5, 2025

Hmm. I'm not sure what exactly the bug even is here. :-> JSON should support UTF-8, so I'm surprised the JSON library is throwing an exception. SLiM should catch that exception; I guess I didn't realize that the call in question could even throw, so it isn't inside any try block. But really this shouldn't happen at all.

Ah, I'd check the character encoding of the input file. It is probably in some Windows-specific character encoding, whereas modern software (including SLiM) tends to assume UTF-8 encoding unless told otherwise (and there is no way to tell SLiM otherwise, since who but a crazy person would use anything but UTF-8 nowadays? :->). I don't think there's a reliable way for SLiM to detect that a different character encoding is being used. So. I'll add a try block around the JSON encoding call, so then you should get a normal error message from SLiM; but the test would still fail, just in a more transparent way. The way to fix the test is to use UTF-8 for input files, always; and then I think everything ought to work properly anyway. The JSON library is probably throwing the error because the é character in the Windows encoding is not valid UTF-8. Does that sound reasonable?

@bhaller
Copy link
Contributor

bhaller commented Jan 5, 2025

Probably if that é was anywhere other than in a comment SLiM would've thrown an error itself. You can use accented characters or even emojis, in places like strings and even variable names in Eidos; but UTF-8 is assumed, so I would think/hope that SLiM would throw an error if it saw a character that wasn't valid UTF-8; there is code in EidosTokenizer to check for that, I think. But characters inside a comment just get zipped over without checking, so this invalid encoding character got past that line of defense.

Ah, Windows. Always two steps behind the rest of the world.

@bhaller
Copy link
Contributor

bhaller commented Jan 5, 2025

If you open test.txt in SLiMgui, the bad encoding character is immediately visible; it draws as "�". (Not sure what that will look like on your end, but it isn't an é. :->).

Interestingly, it looks like Eidos is tolerant of this character, whatever it might be in UTF-8:

	v�zina = 10;
	vézina = 5;
	print("v�zina == " + v�zina);
	print("vézina == " + vézina);

This produces output of 10 and 5, with no problems. Again, I'm not sure whether to regard that as a bug or not. :-> Probably it ought to be; if SLiM is going to assume UTF-8 (and I think it should; I don't want to get into supporting other character encodings, that's crazy-town), then probably it should error if it hits any character that violates UTF-8, even in a comment, but certainly outside a comment. That won't be 100% reliable, though; just a heuristic check. There are lots of character encodings with special characters that would pass a UTF-8 sanity check, but would render incorrectly when interpreted as UTF-8 versus when interpreted in whatever the original encoding is; there's no avoiding that.

UTF-8 files are supposed to start with a special byte-order marker or something, though (which is also just an unreliable heuristic check, but is usually accurate); I guess it looks like SLiM isn't checking for that, and probably it should. So that's probably a bug I ought to fix, too, and would've caught this problem even if the � character is in fact valid in UTF-8. [EDIT: The UTF-8 BOM marker is actually not recommended (https://stackoverflow.com/a/2223926/2752221). So SLiM really shouldn't check for it or require it. So never mind, on that front.]

Character encodings are one of the huge anachronistic tar pits of programming, along with endianness, and signed vs. unsigned integers. Things computer science ought to have handled differently, in hindsight; and endless sources of bugs...

@bhaller
Copy link
Contributor

bhaller commented Jan 5, 2025

Note that running the following script in Windows works fine; the issue is what é is turned into by python, in windows, somehow.

So I guess it sounds like there's a character-encoding bug in Python on Windows, too, maybe, as a part of this big ball of ick? You could report that, I suppose, if you want to. But really most software in the world contains character-encoding bugs if you poke at it. They're extremely hard to find, debug, and fix, so they are more or less omnipresent. (But it sounds like this one in Python is maybe particularly bad, if indeed you're sending proper UTF-8 into Python, and it is spitting back out non-UTF-8 text, downgrading from UTF-8 to some Windows encoding? That is a Very Bad thing to do, since it manifests as a bug even in a modern all-UTF-8 world.)

@bhaller
Copy link
Contributor

bhaller commented Jan 5, 2025

Also a bug, of course, that the Windows error code is so ludicrously uninformative and apparently undocumented (at least through a Google search). :-> No end of bugs here!

@petrelharp
Copy link
Collaborator Author

Ah, right. The problem is that in Windows the default writing mode is ISO-8859. Here's a script that produces the bad slim script, even on other platforms:

import os

script = """
/*
 * Tremblay and Vézina, 2000: https://doi.org/10.1086/302770
 */

initialize() {
	initializeTreeSeq();
	initializeMutationRate(0.0);
	initializeMutationType("m1", 0.5, "f", 0.0);
	initializeGenomicElementType("g1", m1, 1.0);
	initializeGenomicElement(g1, 0, 99999);
	initializeRecombinationRate(1e-8);
}

1 early() {
	sim.addSubpop("p1", 5);
}

10 late() { 
	catn("about to write out trees without script");
	sim.treeSeqOutput("test_noscript.trees", includeModel=F); 
	catn("about to write out trees with script");
	sim.treeSeqOutput("test.trees"); 
	catn("done writing out trees");
}
"""

with open("written.slim", "w", encoding="ISO-8859-1") as f:
    f.write(script)

I think that if there is anything to do here, it is to catch the json exception and pass it along in the appropriate way, so that it's visible in Windows. But feel free to close if you would rather.

@bhaller
Copy link
Contributor

bhaller commented Jan 5, 2025

No, there is definitely at least one bug here for me to fix, maybe more. Thanks for the script, that will be useful. :->

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants