[Nodes] Add Prebatch setting to ParallelMapper #1417

andrewkho · 2024-12-26T21:56:55Z

When ParallelMapper is used for very cheap operations, the overhead of sending items over queues can quickly add up. This is a nice parameter to be able to tune.

Fixes #1415

A few notes about the implementation:

I chose to compose 3 nodes (Batcher, ParallelMapper, Unbatcher) into one to implement this. This is the first time we're composing BaseNodes with other BaseNodes. This will require us to figure out graph-traversal for these options (see footnote).
this required us to have _ParallelMapperIter implement BaseNode, however getting reset to work correctly is going to be a bigger problem, so for now, just created an intermediate class with basically the current implementation of ParallelMapper, and this allows us to use torchdata.nodes composition to get things working easily.

Test Plan:

Unit tests
Ran a simple script to test this, output:

python examples/nodes/test_prebatch.py
[9999400009, 9999600004, 9999800001]
baseline: dt=3.0651697060093284s
[9999400009, 9999600004, 9999800001]
prebatch=16: dt=0.454918147996068s
[9999400009, 9999600004, 9999800001]
prebatch=256: dt=0.13740589004009962s
[9999400009, 9999600004, 9999800001]
prebatch=1024: dt=0.22711888700723648s

test script:

import time
import torchdata.nodes as tn


def run(prebatch):
    node = tn.IterableWrapper(range(100000))
    node = tn.ParallelMapper(node, map_fn=lambda x: x**2, prebatch=prebatch, method="thread", num_workers=8)
    loader = tn.Loader(node)
    x = list(loader)
    print(x[-3:])


if __name__ == "__main__":
    t0 = time.perf_counter()
    run(None)
    dt = time.perf_counter() - t0
    print(f"baseline: {dt=}s")

    for prebatch in (16, 256, 1024):
        t0 = time.perf_counter()
        run(prebatch)
        dt = time.perf_counter() - t0
        print(f"{prebatch=}: {dt=}s")

Footnote: Example of where this is a problem: In the ParallelMapper case here, traversing the dag with reflection (eg using instance.dict and checking for baseNode instances) would generate two sinks for the source, since self.source points to it, and self._it would eventally point to it as well. One way we could handle this is with an optional "get_source/get_parent" method on BaseNode, which returns the instance of where graph traversal should begin, and in this case it would return self._it, not self.source.

pytorch-bot · 2024-12-26T22:00:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1417

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6a99917 with merge base 88c7b96 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

divyanshk · 2024-12-26T22:59:16Z

torchdata/nodes/map.py

@@ -272,6 +281,77 @@ def _shutdown(self):
                    t.join(timeout=QUEUE_TIMEOUT * 5)


+class _ParallelMapperImpl(BaseNode[T]):
+    """This class implements _ParallelMapperIter as a BaseNode, allowing it


Nit: This class implements _ParallelMapperIter and _InlineMapperIter as a BaseNode, ....

good catch!

andrewkho added 3 commits December 26, 2024 12:00

Add unbatcher

4a9fc7e

fix type annotation

b771bea

add prebatch feature

8a9ba5b

andrewkho requested review from divyanshk and ramanishsingh December 26, 2024 21:56

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 26, 2024

divyanshk reviewed Dec 26, 2024

View reviewed changes

Base automatically changed from andrewkh/unbatcher to main December 30, 2024 17:11

andrewkho added 4 commits December 30, 2024 09:16

update docstring

b2bd1c9

Merge branch 'main' into andrewkh/prebatcher

284f20e

fix mypy

fbe616a

fix test

6a99917

ramanishsingh approved these changes Jan 2, 2025

View reviewed changes

andrewkho merged commit 0d2b0a0 into main Jan 2, 2025
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Nodes] Add Prebatch setting to ParallelMapper #1417

[Nodes] Add Prebatch setting to ParallelMapper #1417

andrewkho commented Dec 26, 2024 •

edited

Loading

pytorch-bot bot commented Dec 26, 2024 •

edited

Loading

divyanshk Dec 26, 2024

andrewkho Dec 30, 2024

[Nodes] Add Prebatch setting to ParallelMapper #1417

[Nodes] Add Prebatch setting to ParallelMapper #1417

Conversation

andrewkho commented Dec 26, 2024 • edited Loading

pytorch-bot bot commented Dec 26, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1417

✅ No Failures

divyanshk Dec 26, 2024

Choose a reason for hiding this comment

andrewkho Dec 30, 2024

Choose a reason for hiding this comment

andrewkho commented Dec 26, 2024 •

edited

Loading

pytorch-bot bot commented Dec 26, 2024 •

edited

Loading