Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

choice randomization: better approximation of JR behaviour, fixes #49 #241

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
7 changes: 7 additions & 0 deletions .changeset/strange-brooms-rush.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
"@getodk/xpath": patch
---

Choice list order randomization seed handling: better correspondence with JavaRosa behaviour,
including the addition of derivation of seeds from non-numeric inputs.
Previously, entering a non-integer in a form field seed input would result in an exception being thrown.
39 changes: 36 additions & 3 deletions packages/xpath/src/functions/xforms/node-set.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import sha256 from 'crypto-js/sha256';

import type { XPathNode } from '../../adapter/interface/XPathNode.ts';
brontolosone marked this conversation as resolved.
Show resolved Hide resolved
import type { XPathDOMProvider } from '../../adapter/xpathDOMProvider.ts';
import { LocationPathEvaluation } from '../../evaluations/LocationPathEvaluation.ts';
Expand Down Expand Up @@ -384,8 +386,39 @@ export const randomize = new NodeSetFunction(

const nodeResults = Array.from(results.values());
const nodes = nodeResults.map(({ value }) => value);
const seed = seedExpression?.evaluate(context).toNumber();

return seededRandomize(nodes, seed);
if (seedExpression === undefined) return seededRandomize(nodes);
brontolosone marked this conversation as resolved.
Show resolved Hide resolved
const seed = seedExpression.evaluate(context);
const asNumber = seed.toNumber(); // TODO: There are some peculiarities to address: https://github.com/getodk/web-forms/issues/240
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this comment belongs here. It isn't specific to this cast, it's specific to casting to XPath number throughout. Fine to leave since we have an issue tracking it, but we'll probably just find it went stale some time after we address the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended for someone reading the randomization code when trying to figure out why WF and JR still produce different sort orders. If it goes stale (when the issue is resolved) then following the link to the issue will make that apparent. I don't see a big problem.

let finalSeed: number | bigint | undefined;
if (Number.isNaN(asNumber)) {
// Specific behaviors for when a seed value is not interpretable as numeric.
// We still want to derive a seed in those cases, see https://github.com/getodk/javarosa/issues/800
const seedString = seed.toString();
if (seedString === '') {
finalSeed = 0; // special case: JR behaviour
} else {
// any other string, we'll convert to a number via a digest function
finalSeed = toBigIntHash(seedString);
}
} else {
finalSeed = asNumber;
}
Comment on lines +394 to +407
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't "special case: JR behavior" apply to all of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Some of the behaviour is in the odk spec. The "zero-length-string becomes 0" behaviour was surprising though.

return seededRandomize(nodes, finalSeed);
}
);

function toBigIntHash(text: string): bigint {
// hash text with sha256, and interpret the first 64 bits of output
// (the first and second int32s ("words") of CryptoJS digest output)
// as a BigInt. Thus the entropy of the hash is reduced to 64 bits, which
// for some applications is sufficient.
// The underlying representations are big-endian regardless of the endianness
// of the machine this runs on, as is the equivalent JavaRosa implementation
// at https://github.com/getodk/javarosa/blob/ab0e8f4da6ad8180ac7ede5bc939f3f261c16edf/src/main/java/org/javarosa/xpath/expr/XPathFuncExpr.java#L718-L726
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function toBigIntHash(text: string): bigint {
// hash text with sha256, and interpret the first 64 bits of output
// (the first and second int32s ("words") of CryptoJS digest output)
// as a BigInt. Thus the entropy of the hash is reduced to 64 bits, which
// for some applications is sufficient.
// The underlying representations are big-endian regardless of the endianness
// of the machine this runs on, as is the equivalent JavaRosa implementation
// at https://github.com/getodk/javarosa/blob/ab0e8f4da6ad8180ac7ede5bc939f3f261c16edf/src/main/java/org/javarosa/xpath/expr/XPathFuncExpr.java#L718-L726
/**
* Hash text with sha256, and interpret the first 64 bits of output (the first
* and second int32s ("words") of CryptoJS digest output) as a BigInt. Thus the
* entropy of the hash is reduced to 64 bits, which for some applications is
* sufficient. The underlying representations are big-endian regardless of the
* endianness of the machine this runs on, as is the
* {@link https://github.com/getodk/javarosa/blob/ab0e8f4da6ad8180ac7ede5bc939f3f261c16edf/src/main/java/org/javarosa/xpath/expr/XPathFuncExpr.java#L718-L726 | equivalent JavaRosa implementation}.
*/
const toBigIntHash = (text: string): bigint => {

As a JSDoc comment, this allows the same documentation to be accessed at the call site.

Switching to an arrow function is somewhat a nit, but it's generally preferable to avoid unnecessary function functions as they have confusing behavior. (Maybe that's also a thing we could lint?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ea5c499 removes the function keyword.

As for multiline comments: I don't like them. My editor is not supremely ergonomic with it, especially with the decorative * in front of each line. Which, anyway, diminish the advantages of multiline comments — now one has to prefix each line with * instead of //, PLUS still manage the actual comment start and end markers - how is that a win over just plain simple // line comments, I wonder?
Github is also not super smart with them, look at the "keyword" syntax highlighting it applied to the diff just above! So I don't like to use that comment style myself but if someone else does, they're welcome to ;-)

As for JSDoc links, I don't like them. They move the description of the link to after the link (cf. Markdown). So then to read what the link is doing there, what it's for, I first need to scan to the end of a long URL. The hypothetical usability gain is that if you have an IDE that is smart with specifically JSDoc comments, you can click the link? Copy-pasting isn't so bad and anyway most things — my editor, my terminal — already make http(s)-URLs clickable (or ctrl-clickable). Not worth the disruption of the natural text reading flow to me, but I won't complain if someone else makes {@link https://asdsadsoewofihewofbcewnco.ewewrewrewrewconlnzc.cwefewr.few.cpoqwjeansls | these kind of links}, I just won't emit them myself ;-)

Less is more!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to make the case for JSDoc. If you're open to reconsidering here, that would be excellent. If not, I think we should come back to this discussion as a team.


JSDoc comments are a standard designed to encode structured documentation about any symbol they're attached to.

The editor support alone goes well beyond linking to URLs. For example, the ability to reference documentation across modules are invaluable. Linking to other symbols (both within and across modules) is also invaluable, both as a navigation tool and because they can be kept up to date as those symbols change.

Beyond editor support, being a standard for structured documentation and association with symbols, JSDoc can be used for documentation output. We are already using this to generate documentation for @getodk/xforms-engine. I'd quite like to expand that to other use cases.

I share your distaste for some of the syntax minutiae of JSDoc, and @link is particularly weird (I suspect this is because inline tags are relatively rare). But that distaste doesn't outweigh the overwhelming benefits of an extensible documentation standard which is widely adopted in tooling we already use. It is also widely adopted throughout this project, and across the ecosystem; which is to say, it is both locally and globally idiomatic.

I also noticed GitHub's odd presentation in a couple diff suggestions in this PR. It's worth noting that:

  • That's not representative of how GitHub presents JSDoc in complete source
  • It's not representative of how GitHub presents JSDoc in diffs broadly
  • GitHub's syntax highlighting is notoriously inconsistent across various views
  • The highlighting is applied to an incomplete (i.e. syntactically invalid) chunk of code, which likely exacerbates potential issues

Lastly, I am sensitive to poor authoring ergonomics. I'm a bit surprised to hear that your editor doesn't make adding/editing JSDoc comments easier than single line comments, as that's my experience in the editors I'm familiar with. If this is a major hangup, I'd be happy to help look into ways to make the authoring experience nicer for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make the change hoping that the benefits will become apparent at some point 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

const buffer = new ArrayBuffer(8);
const dataview = new DataView(buffer);
sha256(text)
.words.slice(0, 2)
.forEach((val, ix) => dataview.setInt32(ix * 4, val));
return dataview.getBigInt64(0);
}
40 changes: 31 additions & 9 deletions packages/xpath/src/lib/collections/sort.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,18 @@ class UnseededPseudoRandomNumberGenerator implements PseudoRandomNumberGenerator
}

class SeededPseudoRandomNumberGenerator implements PseudoRandomNumberGenerator {
// Park-Miller PRNG
protected seed: number;

constructor(seed: Int) {
let initialSeed = seed % SEED_MODULO_OPERAND;

constructor(seed: Int | bigint) {
let initialSeed: number;
if (typeof seed === 'bigint') {
// the result of the modulo operation is always smaller than Number.MAX_SAFE_INTEGER,
// thus it's safe to convert to a Number.
initialSeed = Number(BigInt(seed) % BigInt(SEED_MODULO_OPERAND));
} else {
initialSeed = Number(seed) % Number(SEED_MODULO_OPERAND);
brontolosone marked this conversation as resolved.
Show resolved Hide resolved
}
if (initialSeed <= 0) {
initialSeed += MAX_INT_32 - 1;
}
Expand All @@ -38,17 +45,32 @@ class SeededPseudoRandomNumberGenerator implements PseudoRandomNumberGenerator {
}
}

const isInt = (value: number): value is Int => value % 1 === 0;

export const seededRandomize = <T>(values: readonly T[], seed?: number): T[] => {
export const seededRandomize = <T>(values: readonly T[], seed?: number | bigint): T[] => {
let generator: PseudoRandomNumberGenerator;

if (seed == null) {
generator = new UnseededPseudoRandomNumberGenerator();
} else if (!isInt(seed)) {
throw 'todo not an int';
} else {
generator = new SeededPseudoRandomNumberGenerator(seed);
let finalSeed: number | bigint;
// Per issue #49 (https://github.com/getodk/web-forms/issues/49) this is intended to be "bug-or-feature-compatible"
// with JavaRosa's implementation; org.javarosa.core.model.ItemsetBinding.resolveRandomSeed takes the .longValue() of
// the double produced by randomSeedPathExpr.eval() — see https://github.com/getodk/javarosa/blob/6ce13527c/src/main/java/org/javarosa/core/model/ItemsetBinding.java#L311:L317 .
// That results in a 0L when the double is NaN, which happens (for instance) when there
// is a string that does not look like a number (which is a problem in itself, as any non-numeric
// looking string will then result in the same seed of 0 — see https://github.com/getodk/javarosa/issues/800).
// We'll emulate Java's Double -> Long conversion here (for NaN and some other double values)
// so that we produce the same randomization as JR.

// In Java, a NaN double's .longValue is 0
if (Number.isNaN(seed)) finalSeed = 0;
// In Java, an Infinity double's .longValue() is 2**63 -1, which is larger than Number.MAX_SAFE_INTEGER, thus we'll need a BigInt.
else if (seed === Infinity) finalSeed = 2n ** 63n - 1n;
// Analogous with the above conversion, but for -Infinity
else if (seed === -Infinity) finalSeed = -(2n ** 63n);
// A Java double's .longValue drops the fractional part.
else if (typeof seed === 'number' && !Number.isInteger(seed)) finalSeed = Math.trunc(seed);
else finalSeed = seed;
generator = new SeededPseudoRandomNumberGenerator(finalSeed);
brontolosone marked this conversation as resolved.
Show resolved Hide resolved
}

const { length } = values;
Expand Down
31 changes: 21 additions & 10 deletions packages/xpath/test/xforms/randomize.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ describe('randomize()', () => {
});

const SELECTOR = '//xhtml:div[@id="FunctionRandomize"]/xhtml:div';
const MIRROR = 'mirror';
const MIRROR_HASH_VALUE = 5989458117437254; // in Python: "from struct import unpack; from hashlib import sha256; unpack('>Q', sha256(b'mirror').digest()[:8])[0]"
brontolosone marked this conversation as resolved.
Show resolved Hide resolved
const MIRROR_HASH_SORT_ORDER = 'ACBEDF';

describe('shuffles nodesets', () => {
beforeEach(() => {
Expand All @@ -44,7 +47,10 @@ describe('randomize()', () => {
<p>3</p>
<p>4</p>
</div>
</body>
<div id="testFunctionNodeset3">
<p>${MIRROR}</p>
</div>
</body>
</html>`,
{ namespaceResolver }
);
Expand Down Expand Up @@ -74,8 +80,15 @@ describe('randomize()', () => {
{ seed: 1, expected: 'BFEACD' },
{ seed: 11111111, expected: 'ACDBFE' },
{ seed: 'int(1)', expected: 'BFEACD' },
{ seed: 1.1, expected: 'BFEACD' },
{ seed: 0, expected: 'CBEAFD' },
{ seed: NaN, expected: 'CBEAFD' },
{ seed: Infinity, expected: 'CBEAFD' },
{ seed: -Infinity, expected: 'CFBEAD' },
{ seed: 'floor(1.1)', expected: 'BFEACD' },
{ seed: '//xhtml:div[@id="testFunctionNodeset2"]/xhtml:p', expected: 'BFEACD' },
{ seed: MIRROR_HASH_VALUE, expected: MIRROR_HASH_SORT_ORDER },
{ seed: '//xhtml:div[@id="testFunctionNodeset3"]/xhtml:p', expected: MIRROR_HASH_SORT_ORDER },
].forEach(({ seed, expected }) => {
it(`with a seed: ${seed}`, () => {
const expression = `randomize(${SELECTOR}, ${seed})`;
Expand All @@ -88,15 +101,13 @@ describe('randomize()', () => {
});
});

[
{ expression: 'randomize()' },
{ expression: `randomize(${SELECTOR}, 'a')` },
{ expression: `randomize(${SELECTOR}, 1, 2)` },
].forEach(({ expression }) => {
it.fails(`${expression} with invalid args, throws an error`, () => {
testContext.evaluate(expression);
});
});
[{ expression: 'randomize()' }, { expression: `randomize(${SELECTOR}, 1, 2)` }].forEach(
({ expression }) => {
it.fails(`${expression} with invalid argument count, throws an error`, () => {
testContext.evaluate(expression);
});
}
);

it('randomizes nodes', () => {
testContext = createXFormsTestContext(`
Expand Down
Loading