Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RepairManager::add_path_to_goal_WMEs sometimes fails to ground dangling symbols #520

Open
analog-cbarber opened this issue Nov 12, 2024 · 8 comments

Comments

@analog-cbarber
Copy link

I don't have an small reproducible example I can post, but I am getting a segfault when I turn on chunking using 9.6.3. on MacOS M1:

I get a stack dump like:

Thread 4 Crashed:
0   libSoar.dylib                 	       0x10a3c3798 Repair_Manager::add_path_to_goal_WMEs(chunk_element_struct*, unsigned long long) + 48
1   libSoar.dylib                 	       0x10a3c3f9c Repair_Manager::repair_rule(condition_struct*&, std::__1::list<chunk_element_struct*, soar_module::soar_memory_pool_allocator<chunk_element_struct*>>*) + 368
2   libSoar.dylib                 	       0x10a3bac80 Explanation_Based_Chunker::reorder_and_validate_chunk() + 608
3   libSoar.dylib                 	       0x10a3b8ed8 Explanation_Based_Chunker::learn_rule_from_instance(instantiation_struct*, instantiation_struct**) + 1408
4   libSoar.dylib                 	       0x10a3acc54 create_instantiation(agent_struct*, production_struct*, token_struct*, wme_struct*) + 2680
5   libSoar.dylib                 	       0x10a38bd84 do_preference_phase(agent_struct*) + 1136
6   libSoar.dylib                 	       0x10a42acbc do_one_top_level_phase(agent_struct*) + 1688
7   libSoar.dylib                 	       0x10a42c6d8 run_for_n_modifications_of_output(agent_struct*, long long) + 160
8   libSoar.dylib                 	       0x10a33ba1c sml::AgentSML::Step(sml::smlRunStepSize) + 804
9   libSoar.dylib                 	       0x10a347d34 sml::RunScheduler::RunScheduledAgents(bool, sml::smlRunStepSize, unsigned long long, sml::smlRunFlags, sml::smlRunStepSize, bool) + 2144
10  libSoar.dylib                 	       0x10a4abde8 cli::CommandLineInterface::DoRun(std::__1::bitset<10ul> const&, int, cli::eRunInterleaveMode) + 536
11  libSoar.dylib                 	       0x10a4e1d54 cli::RunCommand::Parse(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>&) + 656
12  libSoar.dylib                 	       0x10a4dadf0 soar::tokenizer::parse_command() + 836
13  libSoar.dylib                 	       0x10a4bc8dc cli::CommandLineInterface::Source(char const*, bool) + 116
14  libSoar.dylib                 	       0x10a4d7e44 cli::CommandLineInterface::DoCommand(sml::Connection*, sml::AgentSML*, char const*, bool, bool, soarxml::ElementXML*) + 496
15  libSoar.dylib                 	       0x10a33d740 sml::KernelSML::ProcessCommand(char const*, sml::Connection*, sml::AnalyzeXML*, soarxml::ElementXML*) + 696
16  libSoar.dylib                 	       0x10a3374ac sml::KernelSML::ProcessIncomingSML(sml::Connection*, soarxml::ElementXML*) + 144
17  libSoar.dylib                 	       0x10a373528 sml::Connection::InvokeCallbacks(soarxml::ElementXML*) + 232
18  libSoar.dylib                 	       0x10a3754a8 sml::EmbeddedConnectionAsynch::ReceiveMessages(bool) + 84
19  libSoar.dylib                 	       0x10a341fdc sml::ConnectionManager::ReceiveAllMessages() + 232
20  libSoar.dylib                 	       0x10a34cb58 sml::ReceiverThread::Run() + 104
21  libSoar.dylib                 	       0x10a3795c8 ThreadStartFunction(void*) + 28
22  libSoar.dylib                 	       0x10a379594 LinuxThreadFunc(void*) + 24
23  libsystem_pthread.dylib       	       0x18301a034 _pthread_start + 136
24  libsystem_pthread.dylib       	       0x183014e3c thread_start + 8

Thread 4 crashed with ARM Thread State (64-bit):
    x0: 0x0000000000000000   x1: 0x0000600003e00c30   x2: 0x0000000000000001   x3: 0x0000600003e00ab0
    x4: 0x0000000000000005   x5: 0x0000000000000a80   x6: 0x000000000000003e   x7: 0x0000600003e00d50
    x8: 0x73e21f51cb4e0035   x9: 0x73e21f51cb4e0035  x10: 0x000000013d878c30  x11: 0x000000009668e839
   x12: 0x00000000000007fb  x13: 0x00000000000007fd  x14: 0x000000009688f042  x15: 0x0000000000000042
   x16: 0x000000009668e839  x17: 0x000000000008f000  x18: 0x0000000000000000  x19: 0x00000000000005a9
   x20: 0x000000014d60efb0  x21: 0x0000600002574f40  x22: 0x00000000000005a9  x23: 0x000000014d616d58
   x24: 0x0000600003e009c0  x25: 0x0000000000000003  x26: 0x0000000000000172  x27: 0x0000600002574f50
   x28: 0x000000013d069bf0   fp: 0x000000016b9d6410   lr: 0x000000010a3c3798
    sp: 0x000000016b9d63c0   pc: 0x000000010a3c3798 cpsr: 0x60001000
   far: 0x0000000000000008  esr: 0x92000006 (Data Abort) byte read Translation fault

The disassembled add_path_to_goal_WMEs function starts with:

000000000017b768 <__ZN14Repair_Manager21add_path_to_goal_WMEsEP20chunk_element_structy>:
  17b768: a9ba6ffc     	stp	x28, x27, [sp, #-96]!
  17b76c: a90167fa     	stp	x26, x25, [sp, #16]
  17b770: a9025ff8     	stp	x24, x23, [sp, #32]
  17b774: a90357f6     	stp	x22, x21, [sp, #48]
  17b778: a9044ff4     	stp	x20, x19, [sp, #64]
  17b77c: a9057bfd     	stp	x29, x30, [sp, #80]
  17b780: 910143fd     	add	x29, sp, #80
  17b784: aa0203f3     	mov	x19, x2
  17b788: aa0103f4     	mov	x20, x1
  17b78c: aa0003f5     	mov	x21, x0
  17b790: f9400421     	ldr	x1, [x1, #8]
  17b794: 97fffa96     	bl	0x17a1ec <__ZN14Repair_Manager28find_path_to_goal_for_symbolEP13symbol_struct>
  17b798: f9400418     	ldr	x24, [x0, #8].  <<< HERE IS WHERE IT SEGFAULTS
  17b79c: eb00031f     	cmp	x24, x0
  17b7a0: 54000760     	b.eq	0x17b88c <__ZN14Repair_Manager21add_path_to_goal_WMEsEP20chunk_element_structy+0x124>

So you can see that it is getting a null pointer back from Repair_Manager::find_path_to_goal_for_symbol.

I don't know what else is wrong, but it seems that Repair_Manager::add_path_to_goal_WMEs should do a null check.

@garfieldnate
Copy link
Collaborator

Thank you for the thorough explanation! I did not know you could coun by 4's to a specified offset to find the offending instruction in this way.

Here's the method we're talking about: https://github.com/SoarGroup/Soar/blob/development/Core/SoarKernel/src/explanation_based_chunking/ebc_repair.cpp#L49

It doesn't have a comment, but going by the name it seems like it takes a symbol that's present in a matched production and finds the WME path from the current goal to the symbol.

If a path is not found, it does indeed return null! The one caller, add_path_to_goal_WMEs, assumes that it doesn't return null, which is equivalent to assuming that pTargetSym->instantiated_sym has a path from the current goal. For your segfault, we do know that this value is not null, but it must instead be dangling.

All of this logic occurs within repair_rule, within reorder_and_validate_chunk where the reorder result is either reorder_failed_unconnected_conditions or reorder_failed_reordering_rhs. At that point it's apparently assumed that the ungrounded symbols in the chunk can be grounded by doing a search for a path from the current goal.

I can't tell if there's some logic issue with finding the path to the symbol, or if it's actually not possible to guarantee that a path will be found, and we need to check for this case and reject the chunk. Having your code to debug with would help.

Comments from others in this area would help, as well (Steven or John, perhaps, since I know they have stories of hitting chunking bugs).

analog-cbarber added a commit to analog-cbarber/Soar that referenced this issue Nov 13, 2024
@analog-cbarber
Copy link
Author

So adding the simple null check at least keeps my program from crashing. I have no idea if anything else needs to be done but this is at least an improvement.

@garfieldnate garfieldnate changed the title Segfault in RepairManager::add_path_to_goalWMEs RepairManager::add_path_to_goal_WMEs sometimes fails to ground dangling symbols Nov 14, 2024
@garfieldnate
Copy link
Collaborator

Good to know! 🤔 Trying to think of what the implication here is. If the symbol cannot be grounded, then wouldn't the resulting chunk be ungrounded? I think we might already have some warnings in there with a message to that effect. I think an ungrounded chunk should be rejected outright. The PR improves Soar's stability by eliminating a segfault, but I think we should add an else{} branch that prints an informative warning here.

@garfieldnate
Copy link
Collaborator

Is a chunk still learned after this fix? If so, can you share it? I want to see if it is ungrounded.

@garfieldnate
Copy link
Collaborator

Mazin's thesis, page 74 footnote:

There are no known situations where repair currently fails.

So I want to know what situation caused this failure!

@garfieldnate
Copy link
Collaborator

Maybe one way that this could happen: the symbol is connected to one of the supergoals. I actually thought this was handled in Mazin's thesis already, but perhaps at this point in the code, the specified symbol is not grounded in the lowest goal. If line 85 does not include the superstate, then it would never find a path to the symbol. That might be a stretch, though.

@analog-cbarber
Copy link
Author

There are chunks that are learned but I don't know if any of them are the one that triggered this because I didn't bother to try to get this running in a c++ debugger. I guess if the code outputted a suitable warning I could tell.

@garfieldnate
Copy link
Collaborator

Can you share all the learned chunks here? Or, if you paste the chunks into a new agent, do any give you a warning about ungrounded symbols?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants