-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atomate v2: high-level unit testing strategy #289
Comments
I am fully on board with this, strategy and examples both. Doing integration tests with real VASP are the only way we'll spot issues due to compilation errors or regressions in future VASP versions also. I think having real VASP integration tests are a necessary requirement to ensure quality of MP data in the future. The only question for me is how we test builders or ToDb tasks. But I think the builders are perhaps a separate discussion. |
I would say that there are three levels. Tier 1 is as defined by Anubhav. Tier 2 should be “fake” VASP. The idea here is to ensure that the logic of the code is correct and would always pass given a constant VASP version. Tier 3 is testing with actual VASP to ensure that VASP v20000 gives the same results with the same parameters as VASP v5.x. |
I would exclude Tier 3 from CI testing. |
I think fake VASP adds a large engineering burden for minimal reward, that would mostly be caught by actual VASP integration tests regardless. The important thing I think is to write the integration tests in a way that is sensible and robust to unimportant changes in output. |
@computron Totally agree. @mkhorton I think testing the API on a set of example files or an example database built as part of the test suite is reasonable for builders. ToDb tasks could tested by returning a dict if no DB file is specified. We could just check for key properties of interest or properties that are guaranteed to be pulled in. @shyuep yeah, on the discussions we've been having on the weekly meetings we decided that your Tier 3 basically cannot be run on CI because of the vasp install and potential time requirements to run all workflows, even for simple cases in the time window of the CI system (e.g. 50 mins on Travis). We have been discussing that your Tier 2 is what we have now. The tests are hard to write and easy to break and the idea is that they don't actually test that much (except for things being wired up correctly, which probably won't break). Better coverage of the low level stuff should make your Tier 2 tests redundant because yourTier 3 tests make sure every thing is wired up correctly. pytestCan we switch to pytest for all new tests? It's so much more convenient for writing tests, and it's backwards compatible with unittest so we don't have to change any old code (though it might be nice for maintainability sake to transition for any refactoring possible). A key benefit of pytest is that we can include the integration tests in this repo and mark them as integration tests and so running the tests for real with all the integration tests should use the same script as regular CI except we pass a flag to pytest to run the integration tests (easier to maintain) |
nose is deprecated anyway right? |
Yes, it is not supported anymore |
@mkhorton I disagree that "fake" VASP adds engineering burden. I would say that a basic integration test would be running a full Firework, but with "fake" output files being put into the directory. Given that the Tier1 tests do not test integration, the fake VASP tests minimally ensures that if you have a FW that does an OptimizeTask, Move some files, StaticTask, all three joining components are still working when they are combined. I strongly disagree with the notion that because FWs are "trivially" stringing FireTasks together, it means that this trivial code does not need to be tested. Far too often, we assume something is trivial (e.g., existence of a certain output file or that files are moving to the correct location) when it is not. The fact that LogicA and LogicB are sound does not mean that LogicA+B is sound. There is of course no need to go overboard and do a massively overengineered solution with different randomized files or output. |
@shyuep We initially had the "intermediate" tier of fake VASP tests for all the reasons you mention here. I agree completely that testing connections is important and often the most important test for these workflows (note that these tests are generally most useful for testing connections in between Fireworks, versus in between Firetasks). It is also nice to have connection tests that are independent of VASP (unlike the full running VASP integration test). Finally, another reason is that such a tier can at least run on CircleCI, as opposed to actually running VASP which needs some kind of separate testing framework, making them more automatic. Several people also did work to try to make writing such tests easier, like adding a "use_fake_vasp" powerup and creating a new UnitTest base class to inherit from for these fake tests. Yet, in the past couple of years, it's become clear that it's just not working out in practice. Whether or not the idea of fake VASP test is conceptually simple to you, people struggle with it and delay writing them. It's just another set of things for them to learn - i.e., not only how to write a "real" VASP workflow (which they can often barely do), but then afterward also how to write a proper "fake" VASP workflow (which they usually delay and delay and requires a lot more code and file commits). People are just not good at writing this code. In contrast, I think people will be open to having a full integration test for all workflows, since the usage and point of that is clear to everyone. Here is also a specific example of what I think will happen if we keep around "fake" VASP tests and also add the "real" VASP test as requirements:
Essentially, I agree with @mkhorton that people consider "fake VASP" tests to be burdensome and it has really slowed down people's development. Some kind of integration test is definitely essential however. I think the unit test + "real VASP" strategy is the best way forward which will allow us to go faster. If we really get hit by a lot of cases where having the fake VASP test would have avoided a problem, then we need to either add those tests back, or we need to see what's going on in our "real VASP" strategy that is not working to capture those cases. |
@computron RealVASP is no easier to write than FakeVasp. And I would say running RealVASP violates a critical principle of testing, which is that the developer is responsible for code that you manage, not code that is written by someone else. RealVASP tests not just what atomate's logic is, but also VASP's coding, which is entirely up to the powers that be at VASP. It would also mean that if you run a test using VASP 5.1 and I run the same test using VASP 5.2, there is a real possibility that a test would succeed for you and fail for me. That means essentially that atomate is no longer just dependent on Python and PMG versions (and FireWorks and other dependencies), but also VASP version. My suggestion would be that you insist on someone giving: Ultimately, testing is about reading in the output files (whether it comes from fake or real VASP) and checking that the outputs make sense (with assert statements). If the so-called REAL VASP tests amount to no more than just running the workflow and checking that the workflow has completed, that is not testing. Maybe my simpler question to you is: what is it about real VASP that you think would make it easier to write a test for than fake VASP? A person who thinks FakeVASP is difficult is never going to find real vasp tests more palatable, especially if your REALVasp tests are only going to be run once a day (or a week) and the developer has to wait for real vasp to complete (which minimally takes hours) before he knows whether the test has passed. That said, I do think the FakeVASP can be made a lot easier if it is integrated better. There is no reason to use a powerup. You can simply provide a command line script named "vasp" that simulates vasp-like behavior. That way, all the fireworks remain the same. |
Let me perhaps sketch out why I think FakeVASP is perfectly doable and readily substitutable with RealVASP. Let's start by defining a helper testing class: class AtomateWFTest(unittest.TestCase):
VASP_OUTPUT_DIR = None
def setUpClass(cls):
if not os.environ.get("USE_REAL_VASP") and cls.VASP_OUTPUT_DIR is not None:
os.environ["VASP_OUTPUT"] = cls.VASP_OUTPUT_DIR
os.environ["PATH"] = "/path/to/fake/vasp/:" + os.environ["PATH"] #alternatively, this can be simply set as part of CircleCI. An actual WF test will look something like this. class WFTest(AtomateWFTest):
VASP_OUTPUT_DIR = "example_output"
def test_run_workflow():
<run the workflow, where somewhere a vasp call will be run.>
self.assert(some checks of the output) You then have a script named #!/bin/bash
cp -r $VASP_OUTPUT/* . In this way, developers just need to concern themselves with defining the sample vasp output location and the specific things the workflow need to test for. When you want to do the REAL vasp tests, you simply set USE_REAL_VASP to True in the environment and have the vasp executable somewhere. The above will only work with serial testing of course, but as far as I know, atomate does not do parallel testing. |
Hmm, I may be coming around to the idea of fake VASP. I think perhaps my issue is more where the fake VASP tests are in the repo/CI, and how it's integrated. However, to respond to this point:
I think it's essential we test real VASP. In general, I completely agree that we do not test code we don't manage ourselves. However, for this specific use case, I think if we don't have integration tests that involve actually running VASP, then atomate CI can give us a false sense of security. If we're running 100,000 calculations for Materials Project but something has changed in VASP and these are bad calculations, we need to know that and be able to detect it. Pragmatically, I think this just has to be a part of how we approach testing. I do think that maybe there's a way we can incorporate fake and real VASP in a similar integration testing framework however, so that we can write a single set of tests. And this could live somewhere outside the core atomate repo. I will take some time to read the comments above and think some more before responding further. |
@mkhorton Just a note that I am completely on board for REAL vasp tests. Given the use of atomate as the backbone for MP, it is, of course, critical that when MP updates VASP, the results do not change. I am only objecting to it being the only integration test. Also, if tests pass on CI but fail on real vasp test machine, we immediately know that real vasp changes are likely the issue. |
Isn't what @shyuep proposed is basically how it works now? Except fake vasp is a powerup (currently) vs. a environment variable (here). I see some validity in the idea of a fake vasp in that you want to see that the workflow graphs are executed as expected, but I don't think there's a maintainable way to check "results" in the way suggested and how we are doing it now. @mkhorton I agree. @shyuep mentioned
As we discussed during the atomate meeting a week or two ago: we actually do want to test these things. The integration tests should prove and give confidence that atomate gives you scientifically correct results consistently over time. Ultimately, things that break in the integration tests might actually mean that a fix is needed for a dependency, but it's within scope of atomate to promise that the results will be stable across dependency versions. The regularly run unit tests should give faith that everything runs as intended. |
I like the idea of keeping fake vasp, but I'd like to lighten the contract relative to what we do now. We have "inputs" and "outputs" for every vasp firework that's run, which seems unnecessary. It's not particularly painful to generate it, but it can be to organize all of it. It would be great if there were a way to automate generation/organization of test data in the prescribed directory structure. |
@bocklund My final note on this subject - the whole "we hate fake VASP tests" is premised on the false idea that somehow real VASP tests are "easier" for the developer than fake VASP tests. All I am pointing out is that (a) it is not easier - if you have to write real VASP tests, those can be easily used in fake VASP environments, and (b) fake VASP runs a lot faster than real VASP and for a developer who is just tweaking code, I would like an answer of whether I broke anything to come in O(minutes) vs O(hours/days/weeks). Also, let me paint the picture of what is likely to happen. Some group of people is going to spend an inordinate amount of time setting up the REAL vasp test machine (since it can't be done on CircleCI). I would say weeks to get it working as you want to and given MP's list of priorities, we will see how that goes. They compile VASP v5.4.current on that fancy test mahine. All tests run on VASP v5.4.current and they go on their merry ways. 5 years down the road, every single person is busy with coding new Science-paper feature X and forgot all about updating that v5.4.current, even though the real world has moved to VASP v10.0. So while in a ideal nirvana, I am on board with the idea that if atomate is tested on v5.4, the workflows are guaranteed to give a certain result with v5.4, i.e., atomate essentially pegs itself to a certain VASP version like all other dependencies, I am not convinced that we are attacking a real problem that urgently needs a solution given the amount of effort that will be required to make it happen. I would think something more obvious like me moving from PSP PBE52 to PBE54 is more likely to cause issues. There are literally a gazillion things we can peg dependencies to, and VASP version is very low in my mind on that list. |
Hi all
Overall here are my major conclusions:
So I would lean towards going with what @shyuep said. We will essentially stick to what we have today, with maybe some minor modifications and clean ups. But the big change will be much better documentation on how to write atomate tests. Thoughts? |
For atomate v2, we all agreed that we want to ditch the current strategy of having "fake" VASP runs in a sort of pseudo-integration test. The tests are hard to write, hard to debug, and cause bloating in the repo.
Moving forward, we propose a two-tier testing strategy:
Tier 1: Atomate contains unit tests for anything that requires a unit test. This means that any new code / functions that are written generally requires a unit test, but simply connecting together existing functions do not require a unit test. Example 1: You write a custom FireTask that copies VASP files according to some rules - this requires a unit test since it is new functionality. Example 2: You write a FireTask that simply loads an existing VaspInputSet and writes VASP input files. This does not require a unit test since it is just calling existing functions and classes that are already tested elsewhere, e.g. in pymatgen. Example 3: You write a Firework that just connects together a bunch of Firetasks. This does not require a unit test since the Firework itself is just trivial code based on other code that is tested (the Firetasks). Example 4: you write a powerup that modifies some aspect of a workflow. This requires a unit test to make sure that the desired modification occurred after running the powerup.
Tier 2: Each workflow in atomate must contain an integration test that runs "real" VASP. These tests would need to be run periodically and on one or more machines that are capable of running VASP. Some kind of system would need to be in place for making sure these tests get run. A side benefit of these tests is that it will provide a real working example on how to run a workflow of that type.
So essentially:
Comments on general strategy?
The text was updated successfully, but these errors were encountered: