Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor handling of embedded magic.mgc. #4989

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

teo-tsirpanis
Copy link
Member

SC-25167
SC-47655
SC-47656
SC-47657
SC-47658

This PR overhauls the facilities to embed and load the magic.mgc file that is needed by libmagic:

  • The most important change is the removal of magic_mgc_gzipped.bin.tar.bz2. This file contained a copy of magic.mgc that was compressed, converted to escaped C characters, packed and compressed again to take less space, and stored in source control, so that at build time to get unpacked and #included in mgc_dict.cc. Because this file was being prepared ahead of time by a manually invoked C++ program, this approach had the disadvantage that it tied the Core to a specific version of libmagic. This was made evident in Update libmagic to version 5.45 #4673, where just updating libmagic was not enough; we also had to update magic_mgc_gzipped.bin.tar.bz2.

    What we do now is rely on CMake to find magic.mgc and perform its entire preparation at build time. The C++ program was rewritten to be a CMake script, which makes it much simpler and enables it to run on cross-compilation scenarios. The script accepts the uncompressed magic.mgc file, compresses it and produces a header file of the following format:

    static const unsigned char magic_mgc_compressed_bytes[] = {
    0x28, 0xb5, 0x2f, 0xfd, …
    };
    constexpr size_t magic_mgc_compressed_size = sizeof(magic_mgc_compressed_bytes);
    // Editorial note: we used to prepend the decompressed size at the start of the
    // binary blob, but this was non-standard and could not be easily done by CMake.
    constexpr size_t magic_mgc_decompressed_size = 7041352;
  • The algorithm to compress magic.mgc was changed from gzip to zstd, resulting in a 17.9% reduction of the compressed size (from 333067 το 273500 bytes).

  • Tests for mgc_dict were also updated to use Catch2, and were wired to run along with the other standalone unit tests.

    • This necessitated to make an object library for mgc_dict, which was done as well.

Validated by successfully running unit_mgc_dict locally.


TYPE: BUILD
DESC: Improve embedding of magic.mgc and allow compiling with any libmagic version.

It was rewritten from C++ to CMake, and compression is now being done with CMake's commands.
This allows us to pack the upstream `magic.mgc` file and stop keeping a pre-compressed and pre-escaped one in source control.
The size of the uncompressed file used to be kept at the start of the binary file. We no longer have the capability to easily modify binary files with CMake, so the script generates a complete header, alongside a constexpr variable with the uncompressed size.
It was also simplified a bit and `gzip_wrappers.cc` is now unused and got removed.
Compressed size dropped from 333067 to 270578 bytes.
Changes to the gzip compressor were reverted. The script was also renamed and slightly updated.
Higher levels require CMake 3.26+.

add_custom_command(
OUTPUT "${MGC_GZIPPED_H_OUTPUT_FILE}"
DEPENDS "${libmagic_DICTIONARY}" "${PROJECT_SOURCE_DIR}/scripts/generate_embedded_data_header.cmake"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DEPENDS "${libmagic_DICTIONARY}" "${PROJECT_SOURCE_DIR}/scripts/generate_embedded_data_header.cmake"
DEPENDS "${libmagic_DICTIONARY}"

The Docker build apparently does not like the dependency to generate_embedded_data_header.cmake. I tried to reproduce it locally on WSL (with similar versions of CMake and make) but failed. I can just remove it, but magic_mgc.zst.h_ will not be regenerated in the (rare) case the script changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to reproduce this locally on Docker, and spent lots of time trying to figure it out without success. I even docker cped the generated build tree and compared the makefiles to those on my plain WSL environment (it builds there for some reason), but could not find anything suspicious. 😕

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant