Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making memray third-party allocator-aware #577

Open
1 task done
pitrou opened this issue Apr 10, 2024 · 24 comments
Open
1 task done

Making memray third-party allocator-aware #577

pitrou opened this issue Apr 10, 2024 · 24 comments
Labels
enhancement New feature or request

Comments

@pitrou
Copy link

pitrou commented Apr 10, 2024

Is there an existing proposal for this?

  • I have searched the existing proposals

Is your feature request related to a problem?

It seems that memray currently reports the different "kinds" of allocations based on which libc function was called (malloc, mmap...). (*) However, third-party allocators such as mimalloc and jemalloc are growing in use because of their desirable performance characteristics. When those are used instead of the system allocator, allocations which are logically malloc-like are reported as mmap calls with very large allocation sizes.

There is an example in this issue report where a bunch of 64MiB blocks are reported by memray as allocated (one per thread, roughly), resulting in a large reported footprint of more than 1GiB, while those are the page reservations by mimalloc and the corresponding allocations on the application side are tiny (1kiB each).

This is a problem that is bound to produce many user reports of memory leaks or overconsumption, while actually the program is operating at normal.

(*) I may be wrong in this interpretation of mine, in which case please do correct me.

Describe the solution you'd like

Ideally, memray would also detect calls to third-party allocator routines and report a mi_malloc(1024) as allocating 1024 bytes, not 64 MiB :-)

Several technical solutions can be considered and I'm not an expert in the field. Here are two that comes to mind:

  • Hard-code support for the most popular 3rd-party allocators, by looking at their respective API names. This seems conceptually easy but will have limited benefits, because those allocators are often privately vendored and sometimes their symbols are mangled to avoid symbol clashes. Also, this means that less popular allocators will not get any coverage.

  • Devise some sort of runtime protocol where the allocator themselves may tag API functions (how? I have no idea :-)) as being malloc-like, realloc-like, etc. This is obviously more complex technically and requires cooperation to come up with a suitable protocol, but would work better in the long term.

Alternatives you considered

No response

@pablogsal
Copy link
Member

Thanks @pitrou for bringing this to us. This is a very interesting problem indeed.

The key to either method is that we need:

  • A specific name for a symbol. We could mangle it ourselves according to the Itanium ABI but we need something to override that's constant and predicable.
  • We need the symbol to have a PLT/GTO entry. This basically means that the symbol is in the dynamic table of the executable or shared library. I assume that this will be the case in most cases but it may not happen in others. For instance, if someone (looking at CPython) would compile mimalloc statically, the symbols won't be exposed and there is no way for us to properly override them. This has an easy fix for CPython because we can ensure these have semantic interposition for this purpose, but anything else has the same problem.

If we have these two things, we could offer a way to either override automatically by having constant symbol names or to offer some kind of dynamic naming via some configuration.

I suppose that the next step is for us to investigate how some of the applications/libraries out there are interacting with this allocators. Do you think you can give us some example with pyarrow that uses mimalloc or jmalloc?

@pitrou
Copy link
Author

pitrou commented Apr 10, 2024

Here is a quick REPL example:

>>> import pyarrow as pa

# mimalloc
>>> pool = pa.mimalloc_memory_pool()
>>> a = pa.array([0]*1_000_000, memory_pool=pool)
>>> pool.bytes_allocated()
8000000

# jemalloc
>>> pool = pa.jemalloc_memory_pool()
>>> a = pa.array([0]*1_000_000, memory_pool=pool)
>>> pool.bytes_allocated()
8000000

Note that mimalloc_memory_pool and jemalloc_memory_pool return singleton instances.

You'll find the corresponding C++ code here:

Note that jemalloc symbols are mangled to avoid polluting the standard libc namespace (malloc etc.) so it's probably easier to look at mimalloc first.

We need the symbol to have a PLT/GTO entry. This basically means that the symbol is in the dynamic table of the executable or shared library.

Ah, interesting. So it must appear in nm --dynamic otherwise memray wouldn't find it?
To avoid potential name clashes, we un-expose most third-party symbols from libarrow.so.

For example:

$ nm libarrow.so.1500 | rg -w mi_malloc
0000000001d3c210 t mi_malloc
$ nm --dynamic libarrow.so.1500 | rg -w mi_malloc
$

$ nm libarrow.so.1500 | rg "je_arrow_" | head -n 4
0000000001cc7bb0 t je_arrow_aligned_alloc
0000000001cc8180 t je_arrow_calloc
0000000001ccd070 t je_arrow_dallocx
0000000001cc9a30 t je_arrow_free
$ nm --dynamic libarrow.so.1500 | rg "je_arrow_"
$

@pablogsal
Copy link
Member

pablogsal commented Apr 10, 2024

Ah, interesting. So it must appear in nm --dynamic otherwise memray wouldn't find it?

That is a sufficient condition but not necessary. The other option is that it should have a symbol called mi_malloc@plt or similar (in the normal symbol table). Otherwise it seems that you may be statically compiling against mimalloc (all the allocator code is within the shared lib) and in that case all bets are off because we cannot relocate the symbol (it could even be inlined for what is worth).

@pitrou
Copy link
Author

pitrou commented Apr 10, 2024

The other option is that it should have a symbol called mi_malloc@plt or similar (in the normal symbol table).

Hmm. How would you do that using gcc or clang? Is there a function attribute (preferably) or perhaps compiler/linker flag?

Also, yes, we are statically compiling mimalloc and jemalloc.

@pablogsal
Copy link
Member

pablogsal commented Apr 10, 2024

Hmm. How would you do that using gcc or clang? Is there a function attribute (preferably) or perhaps compiler/linker flag?

I think you can do it with __attribute__((visibility("default"))) but that has other effects (like exporting the symbol).

@pitrou
Copy link
Author

pitrou commented Apr 10, 2024

Hmm, actually, a function attribute wouldn't work, because we would have to patch the mimalloc source code for that...

(also, we use -fno-semantic-interposition and I'm unsure how it influences __attribute__((visibility("default"))))

@pablogsal
Copy link
Member

An alternative view of this problem is that code with LD_PRELOAD should be able to interpose the symbol. We do the same but reimplementing the linker

(also, we use -fno-semantic-interposition and I'm unsure how it influences attribute((visibility("default"))))

That deactivates PLT entries for intra-calls in the shared library. This means that if the definition of the symbol it's inside the executable/shared lib there won't be a PLT entry, which is faster and maybe inalienable but it means it cannot be interposed.

@pablogsal
Copy link
Member

pablogsal commented Apr 10, 2024

It looks like if you statically compile the allocator and use -fno-semantic-interposition you are preventing any memory profiler to interpose calls to the allocators. (This also includes LD_PRELOAD based ones like https://github.com/KDE/heaptrack/). This is because it's impossible without rewriting the machine code to interpose the call. And sometimes this won't be enough because the call may be inlined.

I am afraid this is the classic compromise between performance and observability.

@pitrou
Copy link
Author

pitrou commented Apr 10, 2024

I am afraid this is the classic compromise between performance and observability.

I agree. We could definitely make an exception for mimalloc and jemalloc calls, however, it's just that I don't know how to do that without affecting other symbols.

Also, a radical solution might be to first try dlsyming the symbols, and then fallback on the local symbol.

@pablogsal
Copy link
Member

however, it's just that I don't know how to do that without affecting other symbols.

I think trying to use a __attribute__((visibility("default"))) or marking the symbol as weak (__attribute__((weak))) may be worth a try.

@pablogsal
Copy link
Member

A quick check you can do when trying things out is to load a library with the same definition via LD_PRELOAD and check if its interposed or not.

@pitrou
Copy link
Author

pitrou commented Apr 10, 2024

I think trying to use a __attribute__((visibility("default"))) or marking the symbol as weak (__attribute__((weak))) may be worth a try.

I thought so, but I realized it required patching the mimalloc or jemalloc source, something we'd like to avoid if possible (also, it could be pre-compiled and we would be linking against an existing libmimalloc.a).

That said, the dlsym route would probably be ok for us. I might give it a quick try.

@pablogsal
Copy link
Member

pablogsal commented Apr 10, 2024

Some interesting info: Apparently the way QT does this is to use -Bsymbolic-functions and:

--dynamic-list=dynamic-list-file
Specify the name of a dynamic list file to the linker. This is typically used when creating shared libraries to specify a list of global symbols whose references shouldn’t be bound to the definition within the shared library, or creating dynamically linked executables to specify a list of symbols which should be added to the symbol table in the executable. This option is only meaningful on ELF platforms which support shared libraries.

The format of the dynamic list is the same as the version node without scope and node name. See [VERSION Command](https://sourceware.org/binutils/docs/ld/VERSION.html) for more information.

Example: https://github.com/qt/qtbase/blob/aa896ca9f51252b6d01766e19a03e41bd49857f3/src/gui/CMakeLists.txt#L324

@pablogsal
Copy link
Member

Also, a radical solution might be to first try dlsyming the symbols, and then fallback on the local symbol.

I think that won't work for profilers that attach or that don't use LD_PRELOAD because the interposition will happen at arbitrary late points (after the initial relocation has been made).

@pablogsal
Copy link
Member

Maybe you can wrap the allocator in some call that's exported and use that internally and mark that wrapper as __attribute__((visibility("default"))). We could override the wrapper.

@pitrou
Copy link
Author

pitrou commented Apr 10, 2024

I think that won't work for profilers that attach or that don't use LD_PRELOAD because the interposition will happen at arbitrary late points (after the initial relocation has been made).

I might misunderstanding how relocation works, but do these profilers patch all call sites at runtime?

@pablogsal
Copy link
Member

pablogsal commented Apr 10, 2024

I might misunderstanding how relocation works, but do these profilers patch all call sites at runtime?

No, they patch the Global Offset Table at runtime. All call sites point to a PLT entry. For calls that have a PLT/GOT pair, the code normally trampolines through a small assembly code that grabs an address from the Global Offset Table and calls that. Call sites point to the trampoline and the trampoline grabs the address on every call. At first, the address in the GOT is in the linker resolution routine and once the linker finds the real address (lazy loading) the GOT is updated.

Profilers like memray and heap track work by locating the GOT and rewriting the address with their own functions. This can be done at runtime so it allows attaching and activating/deactivating.

LD_PRELOAD works the same except that interposes the symbol when the linker resolves it so it ends in the first GOT update, but it has several disadvantages (like it cannot be deactivated and attaching won't work).

The mechanism needs your function to have a PLT/GOT pair.

@pablogsal
Copy link
Member

With this explanation you can see the cost: PLT trampolines require an extra read from the GOT and an extra jump, which makes every call a bit more inefficient.

@pablogsal
Copy link
Member

pablogsal commented Apr 10, 2024

-fno-semantic-interposition deactivates this mechanism for inter-library-calls. For example malloc in LIBC needs to be exposed for other libraries to call malloc, so libraries linking to malloc will need a PLT/GOT entry because they don't know where malloc lives so they need to allow the linker to resolve the address at load time (the linker could resolve every call site instead of trampolining but that requires as many relocations as call sites which is very inefficient, so the way it works is via indirection where the linker relocates it once and everyone reads from the indirect relocation), but LIBC itself doesn't really need this mechanism because malloc lives inside. You could still use PLT jumps to allow interposing malloc inside LIBC (so profilers and debuggers work) or you could use -fno-semantic-interposition to avoid internal malloc calls to go though the indirection, but then profilers won't see those calls.

@pitrou
Copy link
Author

pitrou commented Apr 10, 2024

Ok, so --dynamic-list doesn't work for a statically linked mimalloc:

ld.gold: warning: Cannot export local symbol 'mi_malloc'

I think this might work, though it would be worse performance-wise:

Maybe you can wrap the allocator in some call that's exported and use that internally and mark that wrapper as attribute((visibility("default"))). We could override the wrapper.

@pablogsal
Copy link
Member

ld.gold: warning: Cannot export local symbol 'mi_malloc'

You may need to mark it as __attribute__((visibility("default"))) I am afraid :(

@pitrou
Copy link
Author

pitrou commented Apr 10, 2024

Ok, I've got a PR which creates such interposable wrappers in Arrow. I've checked that they can be interposed using LD_PRELOAD:
apache/arrow#41128

@pablogsal
Copy link
Member

Ok I will discuss with @godlygeek whats the best way to support something like this soon

@pitrou
Copy link
Author

pitrou commented Apr 10, 2024

Also note you can download prebuilt wheels from the aforementioned PR using these links. Click on one of the green "Crossbow" badges, then click on the "Summary" link on the Github Actions page, then download the artifact at the bottom of the summary page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants