-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Nodes] Add Prebatch setting to ParallelMapper #1417
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1417
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6a99917 with merge base 88c7b96 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
torchdata/nodes/map.py
Outdated
@@ -272,6 +281,77 @@ def _shutdown(self): | |||
t.join(timeout=QUEUE_TIMEOUT * 5) | |||
|
|||
|
|||
class _ParallelMapperImpl(BaseNode[T]): | |||
"""This class implements _ParallelMapperIter as a BaseNode, allowing it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This class implements _ParallelMapperIter and _InlineMapperIter as a BaseNode, ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
When ParallelMapper is used for very cheap operations, the overhead of sending items over queues can quickly add up. This is a nice parameter to be able to tune.
Fixes #1415
A few notes about the implementation:
_ParallelMapperIter
implement BaseNode, however getting reset to work correctly is going to be a bigger problem, so for now, just created an intermediate class with basically the current implementation of ParallelMapper, and this allows us to use torchdata.nodes composition to get things working easily.Test Plan:
test script:
Footnote: Example of where this is a problem: In the ParallelMapper case here, traversing the dag with reflection (eg using instance.dict and checking for baseNode instances) would generate two sinks for the source, since self.source points to it, and self._it would eventally point to it as well. One way we could handle this is with an optional "get_source/get_parent" method on BaseNode, which returns the instance of where graph traversal should begin, and in this case it would return self._it, not self.source.