{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":16143904,"defaultBranch":"master","name":"blis","ownerLogin":"flame","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2014-01-22T15:58:24.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/6494486?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1716236783.0","currentOid":""},"activityList":{"items":[{"before":null,"after":"2307a4be4555ff1192f908e047402a09092371ba","ref":"refs/heads/stable-feb19-cand0","pushedAt":"2024-05-20T20:26:23.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Restored ArmSVE general storage case. (#708)\n\nDetails:\n- Restored general storage case in armsve kernels.\n- Reason for doing this: Though real `g`-storage is difficult to\n speedup, `g`-codepath here can provide a good support for\n transposed-storage. i.e. at least good for `GEMM_UKR_SETUP_CT_AMBI`.\n- By experience, this solution is only *a little* slower than in-reg\n transpose. Plus in-reg transpose is only possible for a fixed VL in\n our case.\n- (cherry picked from commit 4e18cd34f909c5045597f411340ede3a5e0bc5e1)\n\nRefined emacs handling of indentation. (#717)\n\nDetails:\n- This refines the emacs autoformatting to be better in line with\n contribution guidelines.\n- Removed a stray shebang in a .mk file which confuses emacs about the\n file mode, which should be makefile-mode. (emacs also removes stray\n whitespace at the ends of lines.)\n- (cherry picked from 0ba6e9eafb1e667373d9dbc2aa045557921f33e2)\n\nUpdated hpx namespace for make_count_shape. (#725)\n\nDetails:\n- The hpx namespace for *counting_shape changed. This PR updates the use\n of counting_shape in blis to comply with the change in hpx.\n- Co-authored-by: ctaylor \n- (cherry picked from 059f15105b1643fe56084f883c22b3cadf368b39)\n\nAdded an 'arm64' entry to `.travis.yml`. (#726)\n\nDetails:\n- Added a new 'arm64' entry to the .travis.yml file in an attempt to get\n Travis CI to compile both NEON and SVE kernels, even if only NEON\n kernels are exercised in the testing. With this new 'arm64' entry, the\n 'cortexa57' entry becomes redundant and may be removed. Thanks to\n RuQing Xu for this suggestion.\n- Previously, the macro BLIS_SIMD_MAX_SIZE was *not* being set in\n bli_kernels_arm64.h, which meant that the default value of 64 was\n being used. This caused a runtime consistency check to fail in\n bli_gks.c (in Travis CI), one which requires that\n\n mr * nr * dt_size > BLIS_STACK_BUF_MAX_SIZE\n\n for all datatype sizes dt_size, where BLIS_STACK_BUF_MAX_SIZE is\n defined as\n\n BLIS_SIMD_MAX_NUM_REGISTERS * BLIS_SIMD_MAX_SIZE * 2\n\n This commit increases BLIS_SIMD_MAX_SIZE to 128 for the 'arm64'\n configuration, thus overriding the default and (hopefully) avoiding\n the aforementioned consistency check failures.\n- Appended '|| cat ./output.testsuite' to all 'make' commands in\n travis/do_testsuite.sh. Thanks to RuQing Xu for this suggestion.\n- Whitespace changes.\n- (cherry picked from 0b421eff130b5c896edcc09e7358d18564d177e9)\n\nRedirect grep stderr to /dev/null. (#723)\n\nDetails:\n- In common.mk, added a redirection of stderr to /dev/null for the grep\n command being used to gather a list of header files #included from\n bli_cntx_ref.c. The redirection is desirable because as of grep 3.8,\n regular expressions with \"stray\" backslashes trigger warnings [1].\n But removing the backslash seems to break the BLIS build system when\n using pre-3.8 versions of grep, so this seems to be easiest way to\n satisfy the BLIS build system for both pre- and post-3.8 grep\n environments.\n\n [1] https://lists.gnu.org/archive/html/info-gnu/2022-09/msg00001.html\n- (cherry picked from b1d3fc7e5b0927086e336a23f16ea59aa3611ccb)\n\nAdded runtime selection of 'power' config family. (#718)\n\nDetails:\n- Created a 'power' umbrella configuration family, which, when targeted\n at configure-time, will build both 'power9' and 'power10' subconfigs.\n (With this feature, a BLIS shared library could be compiled on a\n power9 system and run on power10 and vice-versa. Unoptimised code\n will execute if it is linked and run on any other generic system.)\n- This new configuration family will only work with gcc, since that is\n the only compiler supported by both power9 and power10 subconfigs in\n BLIS.\n- Documented power9 and power10 as supported microarchitectures in the\n docs/HardwareSupport.md document.\n- (cherry picked from e3d352f1fcc93e6a46fde1aa4a7f0a18fb27bd42)\n\nDefine `BLIS_VERSION_STRING` in `blis.h`. (#720)\n\nDetails:\n- Previously, the version string was communicated from configure to\n config.mk (via the config.mk.in template), where it was included via\n the top-level Makefile, where it was then used to define the\n preprocessor macro BLIS_VERSION_STRING via a command line argument to\n the compiler (via -D). This macro is then used within bli_info.c to\n initialize a static string which can then be queried via the\n bli_info_get_version_str() function. However, there are some\n applications that may find utility in being able to access the version\n string by inspecting the monolithic (flattened) blis.h header file\n that is created at compile time and installed alongside the library.\n This commit moves the definition of BLIS_VERSION_STRING into\n bli_config.h (via the bli_config.h.in template) so that it is\n embedded in blis.h. The version string is now available in three\n places:\n - the static/shared library, which is installed in the 'lib'\n subdirectory of the install prefix (query-able via the\n bli_info_get_version_str() function);\n - the config.mk makefile fragment, which is installed in the 'share'\n subdirectory of the install prefix (in the VERSION variable);\n - the blis.h header file, which is installed in the 'include'\n subdirectory of the install prefix (via the BLIS_VERSION_STRING\n macro constant).\n Thanks to Mohsen Aznaveh and Tim Davis for providing the idea for this\n change.\n- CREDITS file update.\n- (cherry picked from e730c685d09336b3bd09e86c94330c4eba967f3e)\n\nTypecast printf() args to avoid compiler warnings. (#716)\n\nDetails:\n- In bli_thread_range_tlb.c, typecast integer arguments passed to\n printf() -- which are typically disabled unless debugging -- to type\n \"long\" to guarantee a match to the \"%ld\" format specifiers used in\n those calls. This avoids spurious warnings with certain compilers in\n certain toolchain environments, such as 32-bit RISC-V (rv32iv).\n- (cherry picked from dc5d00a6ce0350cd82859d8c24f23d98f205d8db)\n\nUse here-document for 'configure --help' output. (#714)\n\nDetails:\n- Changed the configure script function that outputs \"--help\" text to do\n so via so-called \"here-document\" syntax for improved readability and\n maintainability. The change eliminates hundreds of echo statements and\n makes it easier to change existing configure options' help text, along\n with other benefits such as eliminating the need to escape double-\n quote characters (\").\n- (cherry picked from ecbcf4008815035c695822fcaf106477debff89a)\n\nMerge tlb- and slab/rr-specific gemm macrokernels. (#711)\n\nDetails:\n- Merged the tlb-specific gemm macrokernel (_var2b) with the slab/rr-\n specific one (var2) so that a single function can be compiled with\n either tlb or slab/rr support, depending on the value of the\n BLIS_ENABLE_JRIR_TLB, _SLAB, and _RR. This is done by incorporating\n information from both approaches: the start/end/inc for the JR and IR\n loops from slab or rr partitioning; and the number of assigned\n microtiles, plus the starting IR dimension offset for all iterations\n after the first (ir_next). With these changes, slab, rr, and tlb can\n all be parameterized by initializing a similar set of variables prior\n to the jr loop.\n- Removed the wrap-around logic that sets the \"b_next\" field of the\n auxinfo_t struct, which executes during the last IR iteration of the\n last JR iteration. The potential benefit of this code is so minor\n (and hinges on the microkernel making use of the b_next field) that\n it's arguably not worth including. The code also does the wrong\n thing for some threads whenever JR_NT > 1, since only thread 0 (in the\n JR group) would even compute with the first micropanel of B.\n- Re-expressed the definition of bli_is_last_iter_slrr so that slab and\n tlb use the same code rather than rr and tlb.\n- Adjusted the initialization of the gemm control tree accordingly.\n- (cherry picked from c334ec278f5e2a101625629b2e13bbf1b38dede5)\n\nFixed mis-mapped instruction for VEXTRACTF64X2. (#713)\n\nDetails:\n- This commit fixes a typo in the macro definition for the extended\n inline assembly macro VEXTRACTF64X2 in bli_x86_asm_macros.h. The macro\n was previously defined (incorrectly) in terms of the vextractf64x4\n instruction rather than vextractf64x2.\n- CREDITS file update.\n- (cherry picked from 5793a77937aee9847a5692c8e44b36a6380800a1)\n\nDefined lt, lte, gt, gte + misc. other updates. (#712)\n\nDetails:\n- Changed invertsc operation to be a non-destructive operation; that is,\n it now takes separate input and output operands. This change applies\n to both the object and typed APIs.\n- Defined an alternative square root operation, sqrtrsc, which, when\n operating on complex scalars, assumes the imaginary part of the input\n to be zero.\n- Changed the semantics of addm, subm, copym, axpym, scal2m, and xpbym\n so that when the source matrix has an implicit unit diagonal, the\n operation leaves the diagonal of the destination matrix untouched.\n Previously, the operations would interpret an implicit unit diagonal\n on the source matrix as a request to manifest the unit diagonal\n *explicitly* on output (either as something to copy in the case of\n copym, or something to compute with in the cases of addm, subm, axpym,\n scal2m, and xpbym). It turns out that this behavior was too cute by\n half and could cause unintended headaches for practical use cases.\n (This change in behavior also required small modifications to the trmv\n and trsv testsuite modules so that they would properly test matrices\n with unit diagonals.)\n- Added missing dependencies for copym to gemv, ger, hemv, trmv, and\n trsv testsuite modules.\n- Implemented level-0-like ltsc, ltesc, gtsc, gtesc operations in\n frame/util, which use lt, lte, gt, and gte level-0 scalar macros.\n- Trivial variable rename in bli_part.c to harmonize with other\n variable naming conventions.\n- (cherry picked from 16d2e9ea9ca0853197b416eba701b840a8587bca)\n\nImplement cntx_t pointer caching in gks. (#709)\n\nDetails:\n- Refactored the gks cntx_t query functions so that: (1) there is a\n clearer pattern of similarity between functions that query a native\n context and those that query its induced (1m) counterpart; and (2)\n queried cntx_t pointers (for both native and induced cntx_t pointers)\n are cached (by default), or deep-queried upon each invocation,\n depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is defined.\n- Refactored query-related functions in bli_arch.c to cache the queried\n arch_t value (by default), or deep-query the arch_t value upon each\n invocation, depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is\n defined.\n- Tweaked the behavior of bli_gks_query_ind_cntx_impl() (formerly named\n bli_gks_query_ind_cntx()) so that the induced method cntx_t struct is\n repopulated each time the function is called. (It is still only\n allocated once on first call.) This was mostly done in preparation for\n some future in which the arch_t value might change at runtime. In such\n a scenario, the induced method context would need to be recalculated\n any time the native context changes.\n- Added preprocessor logic to bli_config_macro_defs.h to handle enabling\n or disabling of cntx_t pointer caching (via BLIS_ENABLE_GKS_CACHING).\n- For now, cntx_t pointer caching is enabled by default and does not\n correspond to any official configure option. Disabling can be done\n by inserting a #define for BLIS_DISABLE_GKS_CACHING into the\n appropriate bli_family_*.h header file within the configuration of\n interest.\n- Thanks to Harihara Sudhan S (AMD) for suggesting that cntxt_t pointers\n (and not just arch_t values) be cached.\n- Comment updates.\n- (cherry picked from 9a366b14fe52c469f4664ef5dd93d85be8d97baa)\n\nFixing type-mismatch errors in power10 sandbox (#701)\n\nDetails:\n- This commit fixes a mismatch between the function type signature of\n bli_gemm_ex() required by BLIS and the version of the function defined\n within the power10 sandbox. It also performs typecasting upon calling\n bli_gemm_front() to attain type consistency with the type signature\n defined by BLIS for bli_gemm_front().\n- (cherry picked from b895ec9f1f66fb93972589c06bff171337153a31)\n\nDefine new global scalar (obj_t) constants. (#703)\n\nDetails:\n- This commit defines the following new global scalar constants:\n - BLIS_ONE_I: This constant encodes the imaginary unit.\n - BLIS_MINUS_ONE_I: This constant encodes the negative imaginary unit.\n - BLIS_NAN: This constant encodes a not-a-number value. Both real and\n imaginary parts are set to NaN for complex datatypes.\n- (cherry picked from 38d88d5c131253066cad4f98eea06fa9299cae3b)\n\nDisable power10 kernels other than sgemm, dgemm. (#705)\n\nDetails:\n- There is a power10 sandbox which uses microkernels for datatypes other\n than float and double (or scomplex/dcomplex). In a regular power10-\n configured build (that is, with the sandbox disabled), there were\n compile errors for some of these other non-sgemm/non-dgemm\n microkernels. This commit protects those kernels with a new cpp macro\n guard (which is defined in sandbox/power10/bli_sandbox.h) that\n prevents that kernel code from being compiled for normal, non-sandbox\n power10 builds.\n- (cherry picked from cdb22b8ffa5b31a0c16ac1a7bcecefeb5216f669)\n\nFix k = 0 edge case in power10 microkernels (#706)\n\nDetails:\n- When power10 sgemm and dgemm microkernels are called with k = 0, they\n become caught in infinite loops and segfault. This is fixed now via an\n early exit in the case of k = 0.\n- (cherry picked from d220f9c436c0dae409974724d42ab6c52f12a726)","shortMessageHtmlLink":"Restored ArmSVE general storage case. (#708)"}},{"before":null,"after":"8c29b37d7405ac37aab537a8071b7fa9b7d01dc3","ref":"refs/heads/stable-jan10-cand0","pushedAt":"2024-05-20T20:04:15.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Tile-level partitioning in jr/ir loops (ex-trsm). (#695)\n\nDetails:\n- Reimplemented parallelization of the JR loop in gemmt (which is\n recycled for herk, her2k, syrk, and syr2k). Previously, the\n rectangular region of the current MC x NC panel of C would be\n parallelized separately from from the diagonal region of that same\n submatrix, with the rectangular portion being assigned to threads via\n slab or round-robin (rr) partitioning (as determined at configure-\n time) and the diagonal region being assigned via round-robin. This\n approach did not work well when extracting lots of parallelism from\n the JR loop and was often suboptimal even for smaller degrees of\n parallelism. This commit implements tile-level load balancing (tlb) in\n which the IR loop is effectively subjugated in service of more\n equitably dividing work in the JR loop. This approach is especially\n potent for certain situations where the diagonal region of the MC x NR\n panel of C are significant relative to the entire region. However, it\n also seems to benefit many problem sizes of other level-3 operations\n (excluding trsm, which has an inherent algorithmic dependency in the\n IR loop that prevents the application of tlb). For now, tlb is\n implemented as _var2b.c macrokernels for gemm (which forms the basis\n for gemm, hemm, and symm), gemmt (which forms the basis of herk,\n her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and\n trmm3). Which function pointers (_var2() or _var2b()) are embedded in\n the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp\n macro is defined, which is controlled by the value passed to the\n existing --thread-part-jrir=METHOD (or -r METHOD) configure option.\n This script adds 'tlb' as a valid option alongside the previously\n supported values of 'slab' and 'rr'. ('slab' is still the default.)\n Thanks to Leick Robinson for abstractly inspiring this work, and to\n Minh Quan Ho for inquiring (in PR #562, and before that in Issue #437)\n about the possibility of improved load balance in macrokernel loops,\n and even prototyping what it might look like, long before I fully\n understood the problem.\n- In bli_thread_range_weighted_sub(), tweaked the the way we compute the\n area of the current MC x NC trapezoidal panel of C by better taking\n into account the microtile structure along the diagonal. Previously,\n it was an underestimate, as it assumed MR = NR = 1 (that is, it\n assumed that the microtile column of C that overlapped with microtiles\n exactly coincided with the diagonal). Now, we only assume MR = NR.\n This is still a slight underestimate when MR != NR, so the additional\n area is scaled by 1.5 in a hackish attempt to compensate for this, as\n well as other additional effects that are difficult to model (such as\n the increased cost of writing to temporary tiles before finally\n updating C). The net effect of this better estimation of the\n trapezoidal area should be (on average) slightly larger regions\n assigned to threads that have little or no overlap with the diagonal\n region (and correspondingly slightly smaller regions in the diagonal\n region), which we expect will lead to slightly better load balancing\n in most situations.\n- Spun off the contents of bli_thread.[ch] that relate to computing\n thread ranges into one of three source/header file pairs:\n - bli_thread_range.[ch], which define functions that are not specific\n to the jr/ir loops;\n - bli_thread_range_slab_rr.[ch], which define functions that implement\n slab or round-robin partitioning for the jr/ir loops;\n - bli_thread_range_tlb.[ch], which define functions that implement\n tlb for the jr/ir loops.\n- Fixed the computation of a_next in the last iteration of the IR loop\n in bli_gemmt_l_ker_var2(). Previously, it always \"wrapped\" back around\n to the first micropanel of the current MC x KC packed block of A.\n However, this is almost never actually the micropanel that is used\n next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next\n correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use\n in the upper-stored case (which *does* actually always choose the\n first micropanel of A as its a_next at the end of the IR loop).\n- Removed adjustments for a_next/b_next (a2/b2) for the diagonal-\n intersecting case of gemmt_l_ker_var2() and the above-diagonal case\n of gemmt_u_ker_var2() since these cases will only coincide with the\n last iteration of the IR loop in very small problems.\n- Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of\n which explicitly considers whether the current microtile is the last\n tile that intersects the diagonal. (The former does the same, but the\n computation coincides with the original bli_is_last_iter().) These\n functions are now used in gemmt to test when a_next (or a2) should\n \"wrap\" (as discussed above). Also defined bli_is_last_iter_tlb_l()\n and bli_is_last_iter_tlb_u(), which are similar to the aforementioned\n functions but are used when employing tlb in gemmt.\n- Redefined macros in bli_packm_thrinfo.h, which test whether an\n iteration of work is assigned to a thread, as static inline functions\n in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h).\n In the process of redefining these macros, I also renamed them from\n bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl().\n- Renamed\n bli_thread_range_jrir_rr() -> bli_thread_range_rr()\n bli_thread_range_jrir_sl() -> bli_thread_range_sl()\n bli_thread_range_jrir() -> bli_thread_range_slrr()\n- Renamed\n bli_is_last_iter() -> bli_is_last_iter_slrr()\n- Defined\n bli_info_get_thread_jrir_tlb()\n and renamed:\n - bli_info_get_thread_part_jrir_slab() ->\n bli_info_get_thread_jrir_slab()\n - bli_info_get_thread_part_jrir_rr() ->\n bli_info_get_thread_jrir_rr()\n- Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism\n into the JR loop when tlb is enabled for non-trsm level-3 operations.\n- Added a sanity check to prevent bli_prune_unref_mparts() from being\n used on packed objects. This prohibition is necessary because the\n current implementation does not take into account the atomicity of\n packed micropanel widths relative to the diagonal of structured\n matrices. That is, the function prunes greedily without regard to\n whether doing so would prune off part of a micropanel *which has\n already been packed* and assigned to a thread for inclusion in the\n computation.\n- Further restricted early returns in bli_prune_unref_mparts() to\n situations where the primary matrix is not only of general structure\n but also dense (in terms of its uplo_t value). The addition of the\n matrix's dense-ness to the conditional is required because gemmt is\n somewhat unusual in that its C matrix has general structure but is\n marked as lower- or upper-stored via its uplo_t. By only checking\n for general structure, attempts to prune gemmt C matrices would\n incorrectly result in early returns, even though that operation\n effectively treats the matrix as symmetric (and stored in only one\n triangle).\n- Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges\n were computed when 1 < bf. Thankfully, this bug was not yet\n manifesting since all current invocations used bf == 1.\n- Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2()\n that would perform incorrect pruning of unreferenced regions above\n where the diagonal of a lower-stored matrix intersects the right edge.\n Thankfully, the bug was not harming anything since those unreferenced\n regions were being pruned prior to the macrokernel.\n- Rewrote slab/rr-based gemmt macrokernels so that they no longer carved\n C into rectangular and diagonal regions prior to parallelizing each\n separately. The new macrokernels use a unified loop structure where\n quadratic (slab) partitioning is used.\n- Updated all level-3 macrokernels to have a more uniform coding style,\n such as wrt combining variable declarations with initializations as\n well as the use of const.\n- Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and\n bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and\n bli_thrinfo_thread_id(), respectively. This change probably should\n have been included in aeb5f0c.\n- Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that\n corresponded to functions that were removed in aeb5f0c.\n- Other very minor cleanups.\n- Comment updates.\n- (cherry picked from commit 2e1ba9d13c23a06a7b6f8bd326af428f7ea68c31)","shortMessageHtmlLink":"Tile-level partitioning in jr/ir loops (ex-trsm). (#695)"}},{"before":null,"after":"656463948e236e529e9c9cf48fe60d5a62da2119","ref":"refs/heads/stable-jan6-cand0","pushedAt":"2024-05-20T20:04:10.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Refactor structure awareness in packm_blk_var1.c. (#707)\n\nDetails:\n- Factored some of the structure awareness out of the loop in\n bli_packm_blk_var1(). So instead of having a single loop with\n conditionals in the body to handle various kinds of structure (and\n stored/unstored submatrix placement), we now have a conditional branch\n to handle various structure/storage scenarios with a loop in each\n section. This change was originally motivated to choose slab or round-\n robin partitioning (in the context of triangular matrices) based on\n the structure of the entire block (or panel) being packed rather than\n each micropanel individually. Previously, the code would attempt to\n limit rr to the portion of the block that intersects the diagonal and\n use slab for the remainder. However, that approach was not well-thought\n out and in many situations this would lead to inferior load balancing\n when compared to using round-robin for the entire block (or panel).\n This commit has the added benefit of incurring less overhead during\n the packing process now that each of the new loops is simpler.\n- (cherry picked from commit b6735ca26b9d459d9253795dc5841ae8de9e84c9)\n\nSwitch to l3 sup decorator in gemmlike sandbox. (#704)\n\nDetails:\n- Modified the gemmlike sandbox to call bli_l3_sup_thread_decorator()\n rather than a local analogue of that code. This reduces redundant\n logic and makes it easier for the sandbox to inherit future\n improvements to the framework's threading code.\n- Moved addon/gemmd to addon/old/gemmd. This code has fallen out of date\n and is taking too much effort to maintain. We will very likely\n reimplement it completely once future changes are made to the\n framework proper.\n- (cherry picked from f956b79922da412791e4c8b8b846b3aafc0a5ee0)","shortMessageHtmlLink":"Refactor structure awareness in packm_blk_var1.c. (#707)"}},{"before":null,"after":"b8ffda1e2df635b6c624917bd0ef8de629756691","ref":"refs/heads/stable-dec16-cand0","pushedAt":"2024-05-20T19:12:05.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Skip 1m optimization when forcing hemm_l/symm_l. (#697)\n\nDetails:\n- Fixed a bug in right-sided hemm when:\n - using the 1m method,\n - #defining BLIS_DISABLE_HEMM_RIGHT in the active subconfiguration,\n and\n - the storage of C matches the gemm microkernel IO preference PRIOR to\n the right-sidedness being detected and recast in terms of the left-\n side code path.\n It turns out that bli_gemm_ind_recast_1m_params() was applying its\n optimization (recasting a complex-domain macrokernel calling a 1m\n virtual microkernel to a real-domain macrokernel calling the real-\n domain microkernel) in situations in which it should not have. The\n optimization was silently assuming that the storage of C always\n matched that of the microkernel preference, since the front-end (in\n this case, bli_hemm_front()) would have already had a chance to\n transpose the operation to bring the two into agreement. However, by\n disabling right-sided hemm, we deprive BLIS of that flexibility (as a\n transposed left-sided case would necessarily have to become a right-\n sided case), and thus the assumption was no longer holding in all\n cases. Thanks to Nisanth M P for reporting this bug in Issue #621.\n- The aforementioned bug, and its bugfix, also apply to symm when\n BLIS_DISABLE_SYMM_RIGHT is defined.\n- Comment updates.\n- CREDITS file update.\n- (cherry picked from commit 3accacf57d11e9b109339754f91bf22329b6cb6a)\n\nFixed perf of mt sup with packing, and mt gemmlike. (#696)\n\nDetails:\n- Brought the gemmsup code path up to date relative to the latest\n thrinfo_t semantics introduced in the October Omnibus commit\n (aeb5f0c). This was done by passing the prenode (instead of the\n current node) into the packm variant within bli_l3_sup_packm.c as well\n as creating the prenodes and attaching them to the thrinfo_t tree in\n bli_l3_sup_thrinfo_create(). These changes erase the performance\n degradation introduced in the omnibus when running multithreaded sup\n with optional packing enabled. Special thanks to Devin Matthews for\n sussing out this fix in short order.\n- Fixed the gemmlike sandbox in a manner similar to that of sup with\n packing, described above. This also involved passing the prenode into\n the local gemmlike packm variant. (Recall that gemmlike recycles the\n use of bli_l3_sup_thrinfo_create(), so it automatically inherits that\n part of the sup fix described above.)\n- Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and\n bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and\n bli_thrinfo_thread_id(), respectively.\n- (cherry picked from 4833ba224eba54df3f349bcb7e188bcc53442449)\n\nFixed _gemm_small() prototype; disabled gemm_small.\n\nDetails:\n- Fixed a mismatch between the prototype for bli_gemm_small() in\n bli_gemm_front.h and the actual definition of bli_gemm_small() in\n kernels/zen/3/bli_gemm_small.c. The former was erroneously declaring\n the cntl_t* argument as 'const'. Thanks to Jeff Diamond for reporting\n this issue.\n- Commented out BLIS_ENABLE_SMALL_MATRIX, BLIS_ENABLE_SMALL_MATRIX_TRSM\n macro definitions in config/zen3/bli_family_zen3.h. AMD's small matrix\n implementation should probably remain disabled in vanilla BLIS, at\n least for now.\n- (cherry picked from db10dd8e11a12d85017f84455558a82c0093b1da)\n\nTrival whitespace/comment tweaks.\n\nDetails:\n- Trivial whitespace and comment changes, most of which ideally would\n have been part of the previous commit pertaining to HPX (2b05948).\n- (cherry picked from f0337b784d164ae505ca0e11277a1155680500d1)\n\nblis support for hpx (#682)\n\n- Implement threading backend via HPX.\n- HPX is an asynchronous many task runtime system used in high\n performance computing applications. The runtime implements the\n ISO C++ parallelism specification and provides a user-space\n thread implementation.\n- This PR provides BLIS a thread backend implementation using HPX\n and resolves feature request #681. The configuration script,\n makefiles, and testsuite have been updated to support an HPX\n build option. The addition of HPX support provides other\n developers an exemplar for integrating other C++ threading\n backends into BLIS.\n- (cherry picked from 2b05948ad2c9785bc53f376d53a7141cbc917447)\n\nFixed subtle barrier_fpa bug in bli_thrcomm.c. (#690)\n\nDetails:\n- In bli_thrcommo.c, correctly initialize the BLIS_OPENMP element of the\n barrier function pointer array (barrier_fpa) to NULL when\n BLIS_ENABLE_OPENMP is *not* defined. Similarly, initialize the\n BLIS_POSIX element of barrier_fpa to NULL when BLIS_ENABLE_PTHREADS is\n not enabled. This bug was introduced in a1a5a9b and was likely the\n result of an incomplete edit. The effects of the bug would have\n likely manifested when querying a thrcomm_t that was initialized with\n a timpl_t value corresponding to a threading implementation that was\n omitted from the -t option at configure-time.\n- (cherry picked from e1ea25da43508925e33d4e57e420cfc0a9de793f)\n\nEnhance emacs formatting of C files to remove trailing whitespace and ensure\n a newline at the end of file\n- (cherry picked from dc6e5f3f5770074ba38554541b8b64711a68c084)\n\nDelete mpi_test garbage. (#689)\n\nDetails:\n- tlrmchlsmth: \"What even is this? No comments, no commit message, not\n used by anything. Trash.\"\n- (cherry picked from 713d078075a4a563a43d83fd0880ab5091c2e4a4)\n\nSome decluttering of the top-level directory.\n\nDetails:\n- Relocated 'mpi_test' directory to test/mpi_test.\n- Relocated 'so_version' and 'version' files from top-level directory to\n 'build' directory.\n- Updated build/bump-version.sh script to accommodate relocation of\n 'version' file to 'build' directory.\n- Updated configure script to accommodate relocation of 'so_version'\n file to 'build' directory.\n- Updated INSTALL file to replace pointers to blis-devel mailing list\n with a pointer to docs/Discord.md.\n- Updated RELEASING file to contain a reminder to consider whether the\n so_version file should be updated prior to the release.\n- (cherry picked from 8d813f7f12732d52c95570ae884d5defbfd19234)\n\nFix typo in configure --help text. (#686)\n\nDetails:\n- Fixed a misspelling in the --help description for the --int-size (-i)\n configure option.\n- (cherry picked from 6774bf08c92fc6983706a91bbb93b960e8eef285)\n\nSupport --nosup, --sup configure options. (#684)\n\nDetails:\n- Added --nosup and --sup as alternative ways of requesting that sup be\n disabled or enabled. These are analagous to --disable-sup-handling and\n --enable-sup-handling, respectively. (I got tired of typing out\n --disable-sup-handling and needed a shorthand notation.)\n- Tweaked message output by configure when sup is enable/disabled for\n clarity and specificity.\n- Whitespace changes.\n- (cherry picked from edcc2f9940449f7d9cefcfc02159d27b013e7995)\n\nAdd mention of Wilkinson Prize to README.md. (#683)\n\nDetails:\n- Added blurbs and links to Wilkinson Prize to README.md.\n- Added mention of both Best Paper and Wilkinson Prizes to the top of\n README.md.\n- Other minor tweaks.\n- (cherry picked from 5eea6ad9eb25f37685d1ae4ae08c73cd1daca297)","shortMessageHtmlLink":"Skip 1m optimization when forcing hemm_l/symm_l. (#697)"}},{"before":"aa62b3dd8d6be878d48329845636799e49c610c7","after":null,"ref":"refs/heads/plugins","pushedAt":"2024-05-15T18:38:00.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"}},{"before":"4cf2a99832c7e2c572493d358d972ed3da3b0f4e","after":null,"ref":"refs/heads/stable-oct27-cand4","pushedAt":"2024-05-15T18:27:23.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"}},{"before":"a72680080dc446ec4f948a9b6be114f77d5ed8b1","after":null,"ref":"refs/heads/stable-oct27-cand3","pushedAt":"2024-05-15T18:27:03.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"}},{"before":"efebd1fe46ecd6b814922551ffdb6fc9e936b6e9","after":null,"ref":"refs/heads/stable-oct27-cand2","pushedAt":"2024-05-15T18:26:58.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"}},{"before":"01e151a9658cbe07ee0cac8b03fa13fef26df19e","after":"6d0ab74f6975fdf4d19cee06d946b09b6ca89656","ref":"refs/heads/master","pushedAt":"2024-05-06T21:07:08.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Updates to README.md section on downloading.\n\nDetails:\n- Updated the text in README.md in the \"How to Download BLIS\" section.\n The new text no longer recommends that the reader use the 'master'\n branch over official releases, as the previous text did. The text was\n tweaked since (a) the 'master' branch is now akin to a development\n branch, and (b) the reader will no longer forgo bugfixes by sticking\n to official releases since we will (going forward) publish bugfix\n releases for the most recent version.","shortMessageHtmlLink":"Updates to README.md section on downloading."}},{"before":"06dddf1e51ccff70d77ee8cb731c3217e70eb730","after":"01e151a9658cbe07ee0cac8b03fa13fef26df19e","ref":"refs/heads/master","pushedAt":"2024-05-06T20:40:08.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Updated RELEASING file; fixes to ReleaseNotes.md.\n\nDetails:\n- Updated RELEASING file to reflect new release protocols, given the\n more sophisticated policy of maintaining release candidate branches\n separate from 'master' (which is now more akin to a development\n branch). Further refinements to this file will likely follow.\n- Fixed typos in ReleaseNotes.md. Thanks to Robert van de Geijn for\n reporting these.","shortMessageHtmlLink":"Updated RELEASING file; fixes to ReleaseNotes.md."}},{"before":null,"after":"49af2243c2a60ed8fedb44f237f4ec100465cd89","ref":"refs/heads/1.0-final","pushedAt":"2024-05-06T19:19:01.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"ReleaseNotes.md update.\n\nDetails:\n- (cherry picked from commit 06dddf1e51ccff70d77ee8cb731c3217e70eb730)\n\nCHANGELOG update (1.0)\n\nDetails:\n- (cherry picked from commit a876918c8c79a1c3d3d95de1f283350b7249b8ae)\n\nVersion file update (1.0)\n\nDetails:\n- (cherry picked from commit c2af113c7ba6d0dcc128ba36ec6e140d89180cf3)","shortMessageHtmlLink":"ReleaseNotes.md update."}},{"before":"49af2243c2a60ed8fedb44f237f4ec100465cd89","after":null,"ref":"refs/heads/1.0","pushedAt":"2024-05-06T19:18:05.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"}},{"before":null,"after":"49af2243c2a60ed8fedb44f237f4ec100465cd89","ref":"refs/heads/1.0","pushedAt":"2024-05-06T19:09:13.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"ReleaseNotes.md update.\n\nDetails:\n- (cherry picked from commit 06dddf1e51ccff70d77ee8cb731c3217e70eb730)\n\nCHANGELOG update (1.0)\n\nDetails:\n- (cherry picked from commit a876918c8c79a1c3d3d95de1f283350b7249b8ae)\n\nVersion file update (1.0)\n\nDetails:\n- (cherry picked from commit c2af113c7ba6d0dcc128ba36ec6e140d89180cf3)","shortMessageHtmlLink":"ReleaseNotes.md update."}},{"before":"a876918c8c79a1c3d3d95de1f283350b7249b8ae","after":"06dddf1e51ccff70d77ee8cb731c3217e70eb730","ref":"refs/heads/master","pushedAt":"2024-05-06T18:47:52.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"ReleaseNotes.md update.","shortMessageHtmlLink":"ReleaseNotes.md update."}},{"before":"5ab286f61525f8ead35ecc258305a5ccd4ee096b","after":"a876918c8c79a1c3d3d95de1f283350b7249b8ae","ref":"refs/heads/master","pushedAt":"2024-05-06T18:38:32.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"CHANGELOG update (1.0)","shortMessageHtmlLink":"CHANGELOG update (1.0)"}},{"before":"cad51491e8a0b306015a5a02881dc2a9b60dd8d9","after":"5ab286f61525f8ead35ecc258305a5ccd4ee096b","ref":"refs/heads/master","pushedAt":"2024-05-06T18:22:25.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Added a script to help create new rc branches.\n\nDetails:\n- Added a new script, build/start-new-rc.sh, which:\n 1. Updates the version file with a new version string.\n 2. Commits (locally) the version string update.\n 3. Updates the CHANGELOG file with the output of 'git log'.\n 4. Commits (locally) the CHANGLOG file update.\n 5. Creates a new branch whose name is equal to \"-rc0\" where\n is the new version string.\n 6. Reminds the user to execute some final steps if everything looks\n good.\n This new script will help in the future when it's time to start a new\n release candidate branch/lineage off of 'master'. Note that this\n script is based on build/bump-version.sh (which itself may change in\n the future due to changes in the way versions/releases will be handled\n going forward).","shortMessageHtmlLink":"Added a script to help create new rc branches."}},{"before":null,"after":"7d486312c8c04afb81e2e424daf25aa65f758069","ref":"refs/heads/1.0-rc2","pushedAt":"2024-04-30T22:13:34.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Use \"-i auto\" by default in test/3 drivers.\n\nDetails:\n- Request default induced method behavior of BLIS via \"-i auto\" when\n running the standalone performance drivers in test/3 via the runme.sh\n script present in that directory. (Previously, the runme.sh script\n would use \"-i native\" by default.) This change was originally intended\n for fd1a7e3.\n- (cherry picked from commit cad51491e8a0b306015a5a02881dc2a9b60dd8d9)","shortMessageHtmlLink":"Use \"-i auto\" by default in test/3 drivers."}},{"before":null,"after":"c5ed72aac20aaf89052f5742769219a4f3efc41e","ref":"refs/heads/stable-oct27-cand5","pushedAt":"2024-04-30T22:04:17.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Omnibus PR - Oct 2023 (#678)\n\nDetails:\n- This is an \"omnibus\" commit, consisting of multiple medium-sized\n commits that affect non-trivial aspects of BLIS. The major highlights:\n - Relocated the pba, sba pool (from the rntm_t), and mem_t (from the\n cntl_t) to the thrinfo_t object. This allows the rntm_t to be\n effectively const (although it is sometimes copied internally and\n modified to reflect different ways of parallelism). Moving the mem_t\n sets the stage for sharing a global control tree amongst all\n threads.\n - De-templatized the macrokernels for gemmt, trmm, and trsm to match\n the macrokernel for gemm, which has been de-templatized since\n 54fa28b.\n - Reimplemented bli_l3_determine_kc() by separating out the logic for\n adjusting KC based on MR/NR for triangular A and/or B into a new\n function, bli_l3_adjust_kc(). For now, this function is still called\n from bli_l3_determine_kc(), but in the future we plan to have it\n called once when constructing the control tree.\n - Refactored the level-3 thread decorator into two parts:\n - One part deals only with launching threads, each one calling a\n generic thread entry function. This code resides in frame/thread\n and constitutes the definition of bli_thread_launch(). Note that\n it is specific to the threading implementation (OpenMP, pthreads,\n single, etc.)\n - The other part deals with passing the matrix operands and related\n information into bli_thread_launch(). This is the \"l3 decorator\"\n and now resides in frame/3. It is agnostic to the threading\n implementation.\n - Modified the \"level\" of the thread control tree passed in at each\n operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was\n passed in a communicator representing the active thread teams which\n would share the available work. Now, the *parent* thread comm is\n passed in. The operation then grabs the child comm and uses it to\n partition the work. The difference is in bli_trsm_blk_var1(), where\n there are now two children nodes for this single operation (i.e. the\n thread control tree is split one level above where the control tree\n is). The sub-prenode is used for the trsm subproblem while the\n normal sub-node is used for the gemm part. Importantly, the parent\n comm is used for the barrier between them.\n- Removed cntl_t* arguments from bli_*_front() functions. These will be\n added back in the future when the control tree's creation is moved so\n that it happens much sooner (provided that bli_*_front() have not been\n absorbed into their respective bli_*_ex() functions).\n- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),\n for consistency. This includes _num_threads(), _thread_id(), _n_way(),\n _work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and\n _am_chief().\n- Removed extraneous barrier from _blk_var3() of gemm and trsm.\n- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was\n misspelled.\n- (cherry picked from commit aeb5f0cc19665456e990a7ffccdb09da2e3f504b)\n\nFixed performance bug caused by redundant packing. (#680)\n\nDetails:\n- Fixed a performance bug whereby multiple threads were redundantly\n packing the same (rather than separate) micropanels. This bug was\n caused by different parts of the code using the num_threads/thread_id\n field of the thrinfo_t vs. the n_way/work_id fields. The fix was to\n standardize on the latter and provide a \"fake\" thrinfo_t sub-prenode\n in the thrinfo tree which consists of single-member thread teams. The\n single team with multiple threads node is still required since it and\n only it can be used to perform barriers and broadcasts (e.g. of the\n packed buffer pointer).\n- (cherry picked from commit 29f79f030e939969d4f3876c4fdaac7b0c5daa63)\n\nFixed random segfault in test/3 drivers. (#788)\n\nDetails:\n- Fixed a segfault in the non-gemm test drivers in test/3 that was the\n result of sometimes leaving either .n_str or .k_str fields of the\n params_t struct uninitialized, depending on the operation in question.\n For example, in test_hemm.c, init_def_params() would only initialize\n the .m_str and .n_str fields, but not the .k_str field. Even though\n hemm doesn't use a 'k' dimension, the proc_params() function (called\n via parse_cl_params()) universally attempts to convert all three into\n integers via sscanf(), which was understandably failing when one of\n those strings was a NULL pointer. I'm not sure how this code ever\n worked to begin with. Special thanks to Leick Robinson for finding and\n reporting this bug.\n- (cherry picked from commit 1236ddab455ef3a6293ab394ff06b3a19c2913d9)\n\nFixed staleness in kernels/zen/3/bli_gemm_small.c.\n\nDetails:\n- Added missing 'const' keyword in function prototypes for\n bli_gemm_small() and friends.\n- Updated pba usage to reflect new APIs.\n- Fixed syntax typo in 'export GOMP_CPU_AFFINITY' line in ul2128\n conditional of test/3/runme.sh.\n- Thanks to Jeff Diamond for reporting these issues.","shortMessageHtmlLink":"Omnibus PR - Oct 2023 (#678)"}},{"before":null,"after":"110430f337b1db0b0d5737dc9dfc2e9004b3b2d6","ref":"refs/heads/stable-oct26-cand1","pushedAt":"2024-04-30T22:03:09.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Add check to disable armsve on Apple M1.\n\n- (cherry picked from commit c803b03e52a7a6997a8d304a8cfa9acf7c1c555b)\n\nFix auto-detection of firestorm (Apple M1).\n\n- (cherry picked from commit 2dd692b710b6a9889f7ebdd7934a2108be5c5530)\n\nAdded Discord documentation (#677)\n\nDetails:\n- Added a docs/Discord.md markdown document that walks the reader\n through creating a Discord account, obtaining the invite link, and\n using the link to join the BLIS Discord server.\n- Updated README.md to reference the new Discord.md document in multiple\n places, including via the official Discord logo (used with explicit\n permission from representatives at Discord Inc.).\n- (cherry picked from commit 88105dbecf0f9dfbfa30215743346e8bd6afb971)\n\nShuffled checked properties in bli_l3_check.c. (#676)\n\nDetails:\n- Added certain checks for matrix structure to the level-3 operations'\n _check() functions, and slightly reorganized existing checks.\n- (cherry picked from commit 23f5b8df3e802a27bacd92571184ec57bbdfa646)\n\nCREDITS file update.\n\nDetails:\n- This attribution was intended to go in PR #647.\n- (cherry picked from commit 9453e0f163503f64a290256b4be53d8882224863)\n\nReinstate sanity check in bli_pool_finalize. (#671)\n\nDetails:\n- Added a reinit argument to bli_pool_finalize(). This bool will signal\n whether or not the function is being called from bli_pool_reinit(). If\n it is not being called from _reinit(), we can safely check to confirm\n that .top_index == 0 (i.e., all blocks have been checked in). But if\n it *is* being called from _reinit(), then that check will be skipped\n since one of the predicted use cases for bli_pool_reinit() anticipates\n that some blocks are (probably) checked out when the pool_t is\n reinitialized.\n- Updated existing invocations of bli_pool_finalize() to pass in either\n FALSE (from bli_apool_free_block() or bli_pba_finalize_pools()) or\n TRUE (from bli_pool_reinit()) for the new reinit argument.\n- (cherry picked from commit 76a23bd8c33e161221891935a489df9a9fb9c8c0)\n\nFix some bugs in bli_pool.c (#670)\n\nDetails:\n- Add a check for premature pool exhaustion when checking in blocks via\n bli_pool_checkin_block(). This detects \"double-free\" and other bad\n conditions that don't necessarily result in a segfault.\n- Make sure to copy all block pointers when growing the pool size.\n Previously, checked-out block pointers (which are guaranteed to be set\n to NULL) were not being copied, leading to the presence of\n uninitialized data.\n- (cherry picked from commit 63470b49e3b9b15e00a8f666e86ccd70c6005fe9)\n\nAdd AddressSanitizer (-fsanitize=address) option. (#669)\n\nDetails:\n- Added support for AddressSanitizer (ASan), a compiler-integrated\n memory error detector. The option (disabled by default) enables\n compiling and linking with the -fsanitize=address flag supported by\n clang, gcc, and probably others. This flag is employed during\n compilation of all BLIS source files *except* for optimized kernels,\n which are exempted because ASan usually requires an extra register,\n which violates the constraints for many gemm microkernels.\n- Minor whitespace, comment, ordering, and configure help text updates.\n- (cherry picked from commit 42d0e66318b186d25eeb215b40ce26115401ed8b)\n\nAdd consistent NaN/Inf handling in sumsqv. (#668)\n\nDetails:\n- Changed sumsqv implementation as follows:\n - If there is a NaN (either real or imaginary), then return a sum of\n NaN and unit scale.\n - Else, if there is an Inf (either real or imaginary), then return a\n sum of +Inf and unit scale.\n - Otherwise behave as normal.\n- (cherry picked from commit b861c71b50c6d48cb07282f44aa9dddffc1f1b3f)\n\nParameterized test/3 drivers via command line args. (#667)\n\nDetails:\n- Rewrote the drivers in test/3, the Makefile, and the runme.sh script\n so that most of the important parameters, including parameter combo,\n datatype, storage combo, induced method, problem size range, dimension\n bindings, number of repeats, and alpha/beta values can be passed in\n via command line arguments. (Previously, most of these parameters were\n hard-coded into the driver source, except a few that were hard-coded\n into the Makefile.) If no argument is given for any particular option,\n it will be assigned a sane default. Either way, the values employed at\n runtime will be printed to stdout before the performance data in a\n section that is commented out with '%' characters (which is used by\n matlab and octave for comments), unless the -q option is given, in\n which case the driver will proceed quietly and output only performance\n data. Each driver also provides extensive help via the -h option, with\n the help text tailored for the operation in question (e.g. gemm, hemm,\n herk, etc.). In this help text, the driver reminds the user which\n implementation it was linked to (e.g. blis, openblas, vendor, eigen).\n Thanks to Jeff Diamond for suggesting this CLI-based reimagining of\n the test/3 drivers.\n- In the test/3 drivers: converted cpp macro string constants, as well\n as two string literals (for the opname and pc_str) used in each test\n driver, to global (or static) const char* strings, and replaced the\n use of strncpy() for storing the results of the command line argument\n parsing with pointer copies from the corresponding strings in argv.\n This works because the argv array is guaranteed by the C99 standard\n to persist throughout the life of the program. This new approach uses\n less storage and executes faster. Thanks to Minh Quan Ho for\n recommending this change.\n- Renamed the IMP_STR cpp macro that gets defined on the command line,\n via the test/3/Makefile, to IMPL_STR.\n- Updated runme.sh to set the problem size ranges for single-threaded\n and multithreaded execution independently from one another, as well as\n on a per-system basis.\n- Added a 'quiet' variable to runme.sh that can easily toggle quiet mode\n for the test drivers' output.\n- Very minor typecast fix in call to bli_getopt() in bli_utils.c.\n- In bli_getopt(), changed the nextchar variable from being a local\n static variable to a field of the getopt_t state struct. (Not sure why\n it was ever declared static to begin with.)\n- Other minor changes to bli_getopt() to accommodate the rewritten test\n drivers' command line parsing needs.\n- (cherry picked from commit ee81efc7887374c974a78bfb3e0865776b2f97a8)\n\nAllow test/3 drivers to use default ind_t method. (#804)\n\nDetails:\n- Previously, the standalone performance drivers in test/3 were written\n under the assumption that the user would want to explicitly test\n either native execution *or* 1m. But because the accompanying runme.sh\n script defaults to passing \"native\" in for the -i command line option\n (which explicitly sets the induced method type), running the script\n without modification causes the test drivers to use slow reference\n microkernels on systems where native complex-domain microkernels are\n not registered -- which will yield poor performance for complex-domain\n level-3 operations. Furthermore, even if a user was aware of this, the\n test drivers did not support any single value for the -i option that\n would test BLIS using the library's default behavior -- that is, using\n 1m on systems where it is needed and native execution on systems that\n have native microkernels implemented and registered.\n- This commit addresses the aforementioned issue by supporting a new\n value for the -i option: \"auto\". The \"auto\" value causes the driver\n to avoid explicitly setting the induced method altogether, leaving\n BLIS's default behavior in place. This \"auto\" option is also now the\n default setting within the runme.sh script. Thanks to Leick Robinson\n for finding and reporting this issue.\n- Also added support for \"nat\" as a shorthand for \"native\", which\n the help text already (erroneously) claimed was supported.\n- (cherry picked from commit fd1a7e3ca9547718aa61c806848099705216182b)\n\nUse \"-i auto\" by default in test/3 drivers.\n\nDetails:\n- Request default induced method behavior of BLIS via \"-i auto\" when\n running the standalone performance drivers in test/3 via the runme.sh\n script present in that directory. (Previously, the runme.sh script\n would use \"-i native\" by default.) This change was originally intended\n for fd1a7e3.\n- (cherry picked from commit cad51491e8a0b306015a5a02881dc2a9b60dd8d9)","shortMessageHtmlLink":"Add check to disable armsve on Apple M1."}},{"before":"fd1a7e3ca9547718aa61c806848099705216182b","after":"cad51491e8a0b306015a5a02881dc2a9b60dd8d9","ref":"refs/heads/master","pushedAt":"2024-04-30T21:51:08.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Use \"-i auto\" by default in test/3 drivers.\n\nDetails:\n- Request default induced method behavior of BLIS via \"-i auto\" when\n running the standalone performance drivers in test/3 via the runme.sh\n script present in that directory. (Previously, the runme.sh script\n would use \"-i native\" by default.) This change was originally intended\n for fd1a7e3.","shortMessageHtmlLink":"Use \"-i auto\" by default in test/3 drivers."}},{"before":"685dcb53e2d04ba9879360f2a2da831c88274ee6","after":"3b0e244e2e98e90f9aa68a4c987fd3b0d66d6b0b","ref":"refs/heads/adat","pushedAt":"2024-04-27T05:16:12.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"devinamatthews","name":"Devin Matthews","path":"/devinamatthews","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5246113?s=80&v=4"},"commit":{"message":"Implement A*D*B (gemdm), A*D*A^{T,H} (syrkd/herkd), and A*D*B^{T,H}+B*D*A^{T,H} (syrk2d/her2kd) operations (and gemdmt for good measure). Some rough edges still:\n\n- Complex herk2 and her2kd will not work.\n- No mixed-type/mixed-domain or 1m.\n- Not integrated into testsuite yet.","shortMessageHtmlLink":"Implement A*D*B (gemdm), A*D*A^{T,H} (syrkd/herkd), and A*D*B^{T,H}+B…"}},{"before":"06829d0fdbfa442741585c9ce215ac466b417acd","after":"129ce041808a49855934427109bda3b02dbd8b8e","ref":"refs/heads/skew-blas","pushedAt":"2024-04-26T23:10:37.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"devinamatthews","name":"Devin Matthews","path":"/devinamatthews","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5246113?s=80&v=4"},"commit":{"message":"Trigger CI build","shortMessageHtmlLink":"Trigger CI build"}},{"before":null,"after":"06829d0fdbfa442741585c9ce215ac466b417acd","ref":"refs/heads/skew-blas","pushedAt":"2024-04-26T22:56:17.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"devinamatthews","name":"Devin Matthews","path":"/devinamatthews","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5246113?s=80&v=4"},"commit":{"message":"Typo fix in `configure`\n\n[ci skip]","shortMessageHtmlLink":"Typo fix in configure"}},{"before":"f51d4739d7bfeb253eb044fdaae658e274c47f10","after":"453c60b6ab59af65b0306d75a9b904ab9be362a6","ref":"refs/heads/sk","pushedAt":"2024-04-26T22:55:51.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"devinamatthews","name":"Devin Matthews","path":"/devinamatthews","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5246113?s=80&v=4"},"commit":{"message":"Update `gemmt` test\n\nThe `gemmt` test in the testsuite currently doesn't detect improper modification of the unstored region. This commit attempts to correct this problem.","shortMessageHtmlLink":"Update gemmt test"}},{"before":"005dcce97b96114183d409e1b7e0301510b0e0c6","after":"f51d4739d7bfeb253eb044fdaae658e274c47f10","ref":"refs/heads/sk","pushedAt":"2024-04-26T19:16:50.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"devinamatthews","name":"Devin Matthews","path":"/devinamatthews","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5246113?s=80&v=4"},"commit":{"message":"Update `gemmt` test\n\nThe `gemmt` test in the testsuite currently doesn't detect improper modification of the unstored region. This commit attempts to correct this problem.","shortMessageHtmlLink":"Update gemmt test"}},{"before":null,"after":"4cf2a99832c7e2c572493d358d972ed3da3b0f4e","ref":"refs/heads/stable-oct27-cand4","pushedAt":"2024-04-25T20:14:17.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Omnibus PR - Oct 2023 (#678)\n\nDetails:\n- This is an \"omnibus\" commit, consisting of multiple medium-sized\n commits that affect non-trivial aspects of BLIS. The major highlights:\n - Relocated the pba, sba pool (from the rntm_t), and mem_t (from the\n cntl_t) to the thrinfo_t object. This allows the rntm_t to be\n effectively const (although it is sometimes copied internally and\n modified to reflect different ways of parallelism). Moving the mem_t\n sets the stage for sharing a global control tree amongst all\n threads.\n - De-templatized the macrokernels for gemmt, trmm, and trsm to match\n the macrokernel for gemm, which has been de-templatized since\n 54fa28b.\n - Reimplemented bli_l3_determine_kc() by separating out the logic for\n adjusting KC based on MR/NR for triangular A and/or B into a new\n function, bli_l3_adjust_kc(). For now, this function is still called\n from bli_l3_determine_kc(), but in the future we plan to have it\n called once when constructing the control tree.\n - Refactored the level-3 thread decorator into two parts:\n - One part deals only with launching threads, each one calling a\n generic thread entry function. This code resides in frame/thread\n and constitutes the definition of bli_thread_launch(). Note that\n it is specific to the threading implementation (OpenMP, pthreads,\n single, etc.)\n - The other part deals with passing the matrix operands and related\n information into bli_thread_launch(). This is the \"l3 decorator\"\n and now resides in frame/3. It is agnostic to the threading\n implementation.\n - Modified the \"level\" of the thread control tree passed in at each\n operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was\n passed in a communicator representing the active thread teams which\n would share the available work. Now, the *parent* thread comm is\n passed in. The operation then grabs the child comm and uses it to\n partition the work. The difference is in bli_trsm_blk_var1(), where\n there are now two children nodes for this single operation (i.e. the\n thread control tree is split one level above where the control tree\n is). The sub-prenode is used for the trsm subproblem while the\n normal sub-node is used for the gemm part. Importantly, the parent\n comm is used for the barrier between them.\n- Removed cntl_t* arguments from bli_*_front() functions. These will be\n added back in the future when the control tree's creation is moved so\n that it happens much sooner (provided that bli_*_front() have not been\n absorbed into their respective bli_*_ex() functions).\n- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),\n for consistency. This includes _num_threads(), _thread_id(), _n_way(),\n _work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and\n _am_chief().\n- Removed extraneous barrier from _blk_var3() of gemm and trsm.\n- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was\n misspelled.\n- (cherry picked from commit aeb5f0cc19665456e990a7ffccdb09da2e3f504b)\n\nFixed performance bug caused by redundant packing. (#680)\n\nDetails:\n- Fixed a performance bug whereby multiple threads were redundantly\n packing the same (rather than separate) micropanels. This bug was\n caused by different parts of the code using the num_threads/thread_id\n field of the thrinfo_t vs. the n_way/work_id fields. The fix was to\n standardize on the latter and provide a \"fake\" thrinfo_t sub-prenode\n in the thrinfo tree which consists of single-member thread teams. The\n single team with multiple threads node is still required since it and\n only it can be used to perform barriers and broadcasts (e.g. of the\n packed buffer pointer).\n- (cherry picked from commit 29f79f030e939969d4f3876c4fdaac7b0c5daa63)\n\nFixed random segfault in test/3 drivers. (#788)\n\nDetails:\n- Fixed a segfault in the non-gemm test drivers in test/3 that was the\n result of sometimes leaving either .n_str or .k_str fields of the\n params_t struct uninitialized, depending on the operation in question.\n For example, in test_hemm.c, init_def_params() would only initialize\n the .m_str and .n_str fields, but not the .k_str field. Even though\n hemm doesn't use a 'k' dimension, the proc_params() function (called\n via parse_cl_params()) universally attempts to convert all three into\n integers via sscanf(), which was understandably failing when one of\n those strings was a NULL pointer. I'm not sure how this code ever\n worked to begin with. Special thanks to Leick Robinson for finding and\n reporting this bug.\n- (cherry picked from commit 1236ddab455ef3a6293ab394ff06b3a19c2913d9)\n\nFixed staleness in kernels/zen/3/bli_gemm_small.c.\n\nDetails:\n- Added missing 'const' keyword in function prototypes for\n bli_gemm_small() and friends.\n- Updated pba usage to reflect new APIs.\n- Fixed syntax typo in 'export GOMP_CPU_AFFINITY' line in ul2128\n conditional of test/3/runme.sh.\n- Thanks to Jeff Diamond for reporting these issues.\n\nAllow test/3 drivers to use default ind_t method. (#804)\n\nDetails:\n- Previously, the standalone performance drivers in test/3 were written\n under the assumption that the user would want to explicitly test\n either native execution *or* 1m. But because the accompanying runme.sh\n script defaults to passing \"native\" in for the -i command line option\n (which explicitly sets the induced method type), running the script\n without modification causes the test drivers to use slow reference\n microkernels on systems where native complex-domain microkernels are\n not registered -- which will yield poor performance for complex-domain\n level-3 operations. Furthermore, even if a user was aware of this, the\n test drivers did not support any single value for the -i option that\n would test BLIS using the library's default behavior -- that is, using\n 1m on systems where it is needed and native execution on systems that\n have native microkernels implemented and registered.\n- This commit addresses the aforementioned issue by supporting a new\n value for the -i option: \"auto\". The \"auto\" value causes the driver\n to avoid explicitly setting the induced method altogether, leaving\n BLIS's default behavior in place. This \"auto\" option is also now the\n default setting within the runme.sh script. Thanks to Leick Robinson\n for finding and reporting this issue.\n- Also added support for \"nat\" as a shorthand for \"native\", which\n the help text already (erroneously) claimed was supported.\n- (cherry picked from commit fd1a7e3ca9547718aa61c806848099705216182b)","shortMessageHtmlLink":"Omnibus PR - Oct 2023 (#678)"}},{"before":"2eb98b0764a92730e5d97545f03e4e963a845ffc","after":null,"ref":"refs/heads/test3-default-ind-fix","pushedAt":"2024-04-25T20:01:03.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"}},{"before":"a49238e6141c96a41aa3c2a4adb0b0663d0b4968","after":"fd1a7e3ca9547718aa61c806848099705216182b","ref":"refs/heads/master","pushedAt":"2024-04-25T20:00:59.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Allow test/3 drivers to use default ind_t method. (#804)\n\nDetails:\r\n- Previously, the standalone performance drivers in test/3 were written\r\n under the assumption that the user would want to explicitly test\r\n either native execution *or* 1m. But because the accompanying runme.sh\r\n script defaults to passing \"native\" in for the -i command line option\r\n (which explicitly sets the induced method type), running the script\r\n without modification causes the test drivers to use slow reference\r\n microkernels on systems where native complex-domain microkernels are \r\n not registered -- which will yield poor performance for complex-domain\r\n level-3 operations. Furthermore, even if a user was aware of this, the \r\n test drivers did not support any single value for the -i option that \r\n would test BLIS using the library's default behavior -- that is, using \r\n 1m on systems where it is needed and native execution on systems that \r\n have native microkernels implemented and registered.\r\n- This commit addresses the aforementioned issue by supporting a new\r\n value for the -i option: \"auto\". The \"auto\" value causes the driver\r\n to avoid explicitly setting the induced method altogether, leaving\r\n BLIS's default behavior in place. This \"auto\" option is also now the\r\n default setting within the runme.sh script. Thanks to Leick Robinson\r\n for finding and reporting this issue.\r\n- Also added support for \"nat\" as a shorthand for \"native\", which\r\n the help text already (erroneously) claimed was supported.","shortMessageHtmlLink":"Allow test/3 drivers to use default ind_t method. (#804)"}},{"before":"a18252d28de85b4bf738aee6e60c4adcb66cf9cc","after":null,"ref":"refs/heads/new_control_trees","pushedAt":"2024-04-25T19:56:42.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"}},{"before":null,"after":"2eb98b0764a92730e5d97545f03e4e963a845ffc","ref":"refs/heads/test3-default-ind-fix","pushedAt":"2024-04-24T22:12:55.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"fgvanzee","name":"Field G. Van Zee","path":"/fgvanzee","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5487570?s=80&v=4"},"commit":{"message":"Allow test/3 drivers to use default ind_t method.\n\nDetails:\n- Previously, the standalone performance drivers in test/3 were written\n under the assumption that the user would want to explicitly test\n either native execution *or* 1m. But because the accompanying runme.sh\n script defaults to passing \"native\" in for the -i command line option\n (which explicitly sets the induced method type), using it without\n modification causes the test drivers to use reference microkernels on\n systems where native complex-domain microkernels are not registered.\n Furthermore, even if a user was aware of this, the test drivers did\n not support any single value for the -i option that would test BLIS\n using the library's default behavior -- that is, using 1m on systems\n where it is needed and native execution on systems that have native\n microkernels.\n- This commit addresses the aforementioned issue by supporting a new\n value for the -i option: \"auto\". The \"auto\" value causes the driver\n to avoid explicitly setting the induced method altogether, leaving\n BLIS's default behavior in place. This \"auto\" option is now the\n default setting within the runme.sh script. Thanks to Leick Robinson\n for finding and reporting this issue.\n- Also added support for \"nat\" as a shorthand for \"native\", which\n the help text already (erroneously) claimed was supported.","shortMessageHtmlLink":"Allow test/3 drivers to use default ind_t method."}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAETyokqgA","startCursor":null,"endCursor":null}},"title":"Activity · flame/blis"}