Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent crash in LLVM_Util::getPointerToFunction(llvm::Function* func) #1712

Open
ZapAndersson opened this issue Aug 18, 2023 · 14 comments

Comments

@ZapAndersson
Copy link

Problem

In 3ds max, we have lots of users crashing with a callstack that seems to be caused by this problem.
We have a scene that "reproduces" the problem, but the reproduction is intermittent and seems to a race condition of sorts.
Basically, you load a particular file, you start an interactive render and the material editor at the same time, then start changing parameters in the material many many many many many times. Eventually, we get this crash. Or not. Depending on phase of the moon, the wind direction, humidity, etc.

Crash is reported on this line:

image

...i.e. in the case this function is reached before the shader has been optimized. Somehow, it seems like the call to exec->finalizeObject(); crashes.

The call stack is something like this:

oslexec.dll!OSL_v1_12::pvt::LLVM_Util::getPointerToFunction(llvm::Function * func) Line 1712 C++
oslexec.dll!OSL_v1_12::pvt::BackendLLVM::run() Line 1674 C++
oslexec.dll!OSL_v1_12::pvt::ShadingSystemImpl::optimize_group(OSL_v1_12::ShaderGroup & group, OSL_v1_12::ShadingContext * ctx, bool do_jit) Line 3595 C++
oslexec.dll!OSL_v1_12::ShadingContext::execute_init(OSL_v1_12::ShaderGroup & sgroup, int shadeindex, OSL_v1_12::ShaderGlobals & ssg, void * userdata_base_ptr, void * output_base_ptr, bool run) Line 91 C++
oslexec.dll!OSL_v1_12::ShadingContext::execute(OSL_v1_12::ShaderGroup & sgroup, int shadeindex, OSL_v1_12::ShaderGlobals & ssg, void * userdata_base_ptr, void * output_base_ptr, bool run) Line 217 C++
oslexec.dll!OSL_v1_12::pvt::ShadingSystemImpl::execute(OSL_v1_12::ShadingContext & ctx, OSL_v1_12::ShaderGroup & group, int index, OSL_v1_12::ShaderGlobals & ssg, void * userdata_base_ptr, void * output_base_ptr, bool run) Line 3264 C++
[Inline Frame] OSLMap.dlt!OSL_v1_12::ShadingSystem::execute(OSL_v1_12::ShadingContext &) Line 688 C++
[Inline Frame] OSLMap.dlt!OSL_v1_12::ShadingSystem::execute(OSL_v1_12::ShadingContext *) Line 695 C++
OSLMap.dlt!OSLTex::EvalColor(ShadeContext & sc, int output, bool bump) Line 3227 C++
OSLMap.dlt!OSLTex::EvalColor(ShadeContext & sc) Line 2936 C++
3dsmax.exe!RenderTexmapRange::l5::<lambda_1>::operator()(const tbb::blocked_range & rng) Line 1951 C++
[Inline Frame] 3dsmax.exe!tbb::interface9::internal::start_for<tbb::blocked_range,RenderTexmapRange'::5'::<lambda_1>,tbb::auto_partitioner const>::run_body(tbb::blocked_range &) Line 115 C++
3dsmax.exe!tbb::interface9::internal::dynamic_grainsize_mode<tbb::interface9::internal::adaptive_modetbb::interface9::internal::auto_partition_type>::work_balance<tbb::interface9::internal::start_for<tbb::blocked_range,RenderTexmapRange'::5'::<lambda_1>,tbb::auto_partitioner const>,tbb::blocked_range>(tbb::interface9::internal::start_for<tbb::blocked_range,RenderTexmapRange'::5'::<lambda_1>,tbb::auto_partitioner const> & start, tbb::blocked_range & range) Line 439 C++
3dsmax.exe!tbb::interface9::internal::start_for<tbb::blocked_range,RenderTexmapRange'::5'::<lambda_1>,tbb::auto_partitioner const>::execute() Line 143 C++
[External Code]
[Inline Frame] 3dsmax.exe!tbb::task::spawn_root_and_wait(tbb::task &) Line 809 C++
[Inline Frame] 3dsmax.exe!tbb::interface9::internal::start_for<tbb::blocked_range,RenderTexmapRange'::5'::<lambda_1>,tbb::auto_partitioner const>::run(const tbb::blocked_range &) Line 95 C++
[Inline Frame] 3dsmax.exe!tbb::parallel_for(const tbb::blocked_range &) Line 201 C++
3dsmax.exe!RenderTexmapRange(HWND
* hwnd, Texmap * tx, Bitmap * bm, FBox2 * range, float scale3d, int filter, int display, int t, const wchar_t * name, float z, int mono, bool disableBitmapProxies, bool bake) Line 1925 C++
3dsmax.exe!RenderTexmap(HWND__ * hwnd, Texmap * tex, Bitmap * bm, float scale3d, int filter, int display, int t, const wchar_t * name, float z, int mono, bool disableBitmapProxies, bool bake) Line 1877 C++
3dsmax.exe!InterfaceImp::Execute(int cmd, unsigned __int64 arg1, unsigned __int64 arg2, unsigned __int64 arg3, unsigned __int64 arg4, unsigned __int64 arg5, unsigned __int64 arg6) Line 6844 C++
core.dll!Texmap::GetVPDisplayDIB(int t, TexHandleMaker & thmaker, Interval & valid, int mono, int forceW, int forceH) Line 3851 C++

Expected behavior:

It not to crash?

Actual behavior:

It crash. Sometimes.

Steps to Reproduce

  1. [First Step]
  2. [Second Step]
  3. [and so on...]

Versions

  • OSL branch/version: Internal Autodesk fork of OSL 1.12.13
  • OS: Windows
  • C++ compiler: Visual Studio 2019 / 2022
  • LLVM version: llvm-14.0.6-3dsmax-001-osl_subset-vc142.7z
  • OIIO version: OpenImageIO-2.4.5.0-3dsmax-002-vc142.zip
@ZapAndersson
Copy link
Author

Due to the intermittivity of this it's hard to debug, and often I get a crash with no useful callstack, only an "abort was called" exception. I will try to figure more out, but if the above gives you any "Heureka" ideas @lgritz let me know

@ZapAndersson
Copy link
Author

I'm wondering if it can have anything to do with issue #1427 ?

@ZapAndersson
Copy link
Author

ZapAndersson commented Sep 4, 2023

Better call stack, with some of the LLVM stuff untangled: @lgritz

 	oslexec.dll!llvm::report_fatal_error(const llvm::Twine & Reason, bool GenCrashDiag) Line 122	C++
 	oslexec.dll!llvm::report_fatal_error(const char * Reason, bool GenCrashDiag) Line 83	C++
>	oslexec.dll!llvm::RuntimeDyldCOFFX86_64::resolveRelocation(const llvm::RelocationEntry & RE, unsigned __int64 Value) Line 117	C++
 	oslexec.dll!llvm::RuntimeDyldImpl::resolveRelocationList(const llvm::SmallVector<llvm::RelocationEntry,64> & Relocs, unsigned __int64 Value) Line 1106	C++
 	oslexec.dll!llvm::RuntimeDyldImpl::resolveLocalRelocations() Line 149	C++
 	oslexec.dll!llvm::RuntimeDyldImpl::resolveRelocations() Line 145	C++
 	oslexec.dll!llvm::MCJIT::finalizeLoadedModules() Line 244	C++
 	oslexec.dll!llvm::MCJIT::finalizeObject() Line 270	C++
 	oslexec.dll!OSL_v1_12::pvt::LLVM_Util::getPointerToFunction(llvm::Function * func) Line 1714	C++
 	oslexec.dll!OSL_v1_12::pvt::BackendLLVM::run() Line 1674	C++
 	oslexec.dll!OSL_v1_12::pvt::ShadingSystemImpl::optimize_group(OSL_v1_12::ShaderGroup & group, OSL_v1_12::ShadingContext * ctx, bool do_jit) Line 3595	C++
 	oslexec.dll!OSL_v1_12::ShadingContext::execute_init(OSL_v1_12::ShaderGroup & sgroup, int shadeindex, OSL_v1_12::ShaderGlobals & ssg, void * userdata_base_ptr, void * output_base_ptr, bool run) Line 91	C++
 	oslexec.dll!OSL_v1_12::ShadingContext::execute(OSL_v1_12::ShaderGroup & sgroup, int shadeindex, OSL_v1_12::ShaderGlobals & ssg, void * userdata_base_ptr, void * output_base_ptr, bool run) Line 217	C++
 	oslexec.dll!OSL_v1_12::pvt::ShadingSystemImpl::execute(OSL_v1_12::ShadingContext & ctx, OSL_v1_12::ShaderGroup & group, int index, OSL_v1_12::ShaderGlobals & ssg, void * userdata_base_ptr, void * output_base_ptr, bool run) Line 3264	C++
 	[Inline Frame] OSLMap.dlt!OSL_v1_12::ShadingSystem::execute(OSL_v1_12::ShadingContext &) Line 688	C++
 	[Inline Frame] OSLMap.dlt!OSL_v1_12::ShadingSystem::execute(OSL_v1_12::ShadingContext *) Line 695	C++
 	OSLMap.dlt!OSLTex::EvalColor(ShadeContext & sc, int output, bool bump) Line 3227	C++
 	OSLMap.dlt!OSLTex::EvalColor(ShadeContext & sc) Line 2936	C++
 	3dsmax.exe!RenderTexmapRange::__l5::<lambda_1>::operator()(const tbb::blocked_range<int> & rng) Line 1950	C++
 	[Inline Frame] 3dsmax.exe!tbb::interface9::internal::start_for<tbb::blocked_range<int>,`RenderTexmapRange'::`5'::<lambda_1>,tbb::auto_partitioner const>::run_body(tbb::blocked_range<int> &) Line 115	C++
 	3dsmax.exe!tbb::interface9::internal::dynamic_grainsize_mode<tbb::interface9::internal::adaptive_mode<tbb::interface9::internal::auto_partition_type>>::work_balance<tbb::interface9::internal::start_for<tbb::blocked_range<int>,`RenderTexmapRange'::`5'::<lambda_1>,tbb::auto_partitioner const>,tbb::blocked_range<int>>(tbb::interface9::internal::start_for<tbb::blocked_range<int>,`RenderTexmapRange'::`5'::<lambda_1>,tbb::auto_partitioner const> & start, tbb::blocked_range<int> & range) Line 439	C++
 	3dsmax.exe!tbb::interface9::internal::start_for<tbb::blocked_range<int>,`RenderTexmapRange'::`5'::<lambda_1>,tbb::auto_partitioner const>::execute() Line 143	C++

@ZapAndersson
Copy link
Author

The actual abort is here
image

Called from here:
image

called from here:
image

called from here:
image

Called from here:

image

Called from:
image

Called from OSL here (as per the original message above):
image

@ZapAndersson
Copy link
Author

I react especially to this line.....
image

@ThiagoIze
Copy link
Contributor

Also, the llvm comment says 2GB and yet the check is done with UINT32_MAX which is 4GB. Is the comment wrong or should the code be changed to INT32_MAX? I don't know if that's the source of these problems (if anything it would make the errors happen more often if changed to 2GB).

@ZapAndersson
Copy link
Author

So it seems this IMAGE_REL_AMD64_ADDR32NB mode is a 32-bit offset based thing, but the one at the end of the above screenshot, IMAGE_REL_AMD65_ADDR64 is true 64 bit.

I mad a Godawful Hack(tm) in LLVM code like so, so any function that made the decision to use the former mode instead used the latter mode....:

image

....and the problem disappeared.

Now is this a good fix?

I highly doubt it, but.....??

/Z

@lgritz

@lgritz
Copy link
Collaborator

lgritz commented Sep 7, 2023

I think we should report this on the llvm-dev forums, probably in the "code generation" board?

Zap, can you take care of that? I feel like it's more efficient for you to do that communication rather than me having to be the go-between. You're much more familiar with the relevant LLVM stack traces and internals than I am at this point.

I think there are three things to try to get out of that interaction:

  1. Have somebody on the LLVM team confirm that we're on the right track, that this patch is essentially correct and does no additional harm, or else that we're totally misguided and there is something different we should be doing to address the problem.
  2. Convince somebody there to take the ball and turn this (or any other approach they prefer) into a patch that will permanently fix future LLVM releases.
  3. If they have a suggestion for something we can do on the OSL side to avoid this, that's even better. Like, are we hitting a 32 bit limit only because we are being exceptionally silly about what we're handing LLVM, forgetting to clear something between shader group builds, or something like that?

Now, on our end, we are in a bit of a pickle in that we still have a lot of work to make OSL work with LLVM 16+. They are close to releasing 17, and definitely will not backport fixes as far back as 15. So you may be forced to maintain those patches on your end at Autodesk (you seem to be the only ones running into this problem) until we can all upgrade to the latest LLVM that would have a fix. But like I said, if they have a suggestion for how to ameliorate the problem from our side, that's the best option.

@ZapAndersson
Copy link
Author

ZapAndersson commented Sep 8, 2023 via email

@ZapAndersson
Copy link
Author

ZapAndersson commented Sep 8, 2023

Yes, lots of good replies at llvm/llvm-project#65641 ...

OSL has a line that reads (in llvm_util.cpp https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/liboslexec/llvm_util.cpp#L1442)

//engine_builder.setCodeModel(llvm::CodeModel::Default);

I'll try to set it to "::Large" or "::Medium" and see if this changes things (apparently ::Small is default?(? Does this make sense?

@ZapAndersson
Copy link
Author

1 Have somebody on the LLVM team confirm that we're on the right track, that this patch is essentially correct and does no additional harm, or else that we're totally misguided and there is something different we should be doing to address the problem.

Well, we have that already. My hack is most certainly WRONG :)

@lgritz
Copy link
Collaborator

lgritz commented Sep 8, 2023

Exceptions: we're definitely not relying on them. But perhaps there is there a way to explicitly turn them off, which we have neglected to do?

setCodeModel: that may be fruitful. What happens if you make this call, and pass llvm::CodeModel::Large?

@ZapAndersson
Copy link
Author

In my quick test, setting CodeModel::Large did not change anything, but it was a very late friday semi-aborted test so I will double check. But I could see the condition for this fatal error still getting hit (tho I didn't spend enough time to truly get the crash, I just verified that the "type" of relocation block was still in use.)

Note the latest post on the LLVM project here llvm/llvm-project#65641 (comment) in reply to my question about "Memory Managers"

If the "MemoryManager" is what doles out this memory to LLVM, then, maybe that is the problem....? According to them OSL is using it's own "MemeoryManager" because....(?)

@ZapAndersson
Copy link
Author

Okay.... some new info....

OSL uses a custom memory manager, that is held by rendering threads per-thread-info stuff. And this memory manager is kept around until the last rendering thread dies.

Sounds reasonable on paper....

Except... we use TBB for rendering. TBB actually has a set of worker threads that are always in flight. So those threads never die. So the no destructor is ever hit on the per-thread data.

So the memory manager ends up being kept around forever.

That wouldn't be a big deal, in the normal case. Except I also see this in the OSL wrapped memory manager (https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/liboslexec/llvm_util.cpp#L244):

image

Okay, so if memory is never ever thrown away, of course we can get beyond a 2GB limit.

I tested it, and in max, the memory manager isn't destroyed until the app closes.....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants