Proposal to add SIMD enhancements to FreeBSD's libc. ==================================================== Robert Clausecker 2023-04-04 DRAFT Summary ------- Modern computer architectures provide SIMD (single instruction multiple data) instruction set extensions to operate on multiple data at once. Commonly used for numerical applications such as video codecs, graphics rendering, and scientific computing, use of SIMD techniques also aids in basic data processing tasks such as those implemented by libc functions. While other libc implementations already provide SIMD enhanced variants of standard libc functions [12][13][14], the FreeBSD libc largely does not. The objective of this proposal is to provide such SIMD-enhanced versions of relevant libc library functions and thus improving the performance of software linked against it. As these libc functions are used by most software available for FreeBSD, these enhancements are projected to give a broad benefit for a wide range of programs. Scope ----- The primary focus of this project is amd64. We aim to produce SIMD optimised implementations based on the architecture levels defined by the x86_64 psABI [1]: Level Features baseline CMOV CX8 FPU FXSR MMX OSFXSR SCE SSE SSE2 x86-64-v2 CMPXCHG16B LAHF-SAHF POPCNT SSE3 SSE4_1 SSE4_2 SSSE3 x86-64-v3 AVX AVX2 BMI1 BMI2 F16C FMA LZCNT MOVBE OXSAVE x86-64-v4 AVX512F AVX512BW AVX512DQ AVX512VL Multiple implementations of each routine are planned to be provided, if the routine benefits from additional instructions available at higher architecture levels. This usually means one routine for baseline or x86-64-v2 and one routine each for x86-64-v3 and x86-64-v4. A benchmark suite is planned to ascertain the impact of these routines on the performance of the libc. In future work, these routines could be adapted for i386 or ported to other architectures including arm64 (ASIMD, SVE) and ppc64/ppc64le if there is sufficient interest. Technical Details ----------------- It is planned to implement the optimised routines in assembly to ensure toolchain independency. For dynamically linked executables, the ifunc mechanism [2] is planned to be used to select the best implementation of each routine at runtime. If possible, an environment variable will be queried to permit the user to select a different architecture level or to disable SIMD enhancements altogether. For statically linked executables or when the function is called directly (e.g. from inside the libc through a hidden alias), we plan to provide dispatch trampolines. In the first call to the trampoline, the call resolves to a dispatch function determining which implementation to use. The dispatch function writes the target of the dispatch into the function pointer used by the trampoline and then tail calls the selected routine. Next iteration, the correct function is then directly called. Both mechanisms will be implemented in a thread-safe and async-signal-safe manner. The best implementation is usually the one using the highest architecture level supported by the CPU. However, hardware constraints such as thermal licensing [3] and AVX-SSE transition penalties [4] may render architecture levels v3 and v4 unattractive on some processors. To be decided: implementations may be written such that they overrun the end of strings during reads, but ensure that no page boundary is crossed. Such an overrun is harmless unless a segment limit is set but may confuse analysis tools such as valgrind(1). This is especially required for fast performance on NUL-terminated strings. Functions to be Enhanced ------------------------ HEADER FUNCTION NOTES string.h stpcpy String copy functions stpncpy strcat strncat strcpy strncpy strlcpy strlcat strstr String search functions strnstr strcasestr *for C locale only strchr strchrnul strrchr strcspn strspn strpbrk strsep String tokenisation functions strtok_r strcmp String comparison functions strncmp memcpy Memory copy functions memccpy memset Memory initialisation functions memchr Memory search functions memrchr memmem memcmp Memory comparison function swab Byte swap strlen String length timingsafe_bcmp Constant time functions timingsafe_memcmp strings.h strcasecmp *for C locale only strncasecmp *for C localy only Specific list of functions to be implemented to be decided. The locale dependent functions can only be optimised with reasonable effort for locales in which the letter case correspondence is the same as in the C locale (this e.g. excludes Turkish locales). Additional functions that will benefit from these enhancements: strings.h index *legacy, through strchr rindex *legacy, through strrchr bcmp *legacy, through memcmp bcopy *legacy, through memmove bzero *legacy, through memset explicit_bzero through memset memset_s through memset mempcpy through memmove memcpy through memmove strdup through memcpy, strlen strtok through strtok_r Qualifications -------------- The submitter is an experienced C and assembly programmer with experience in SIMD-acceleration of text processing routines. One example of his work can be found in the upcoming publication "Transcoding Unicode Characters with AVX-512 Instructions" [5][6]. The submitter is a FreeBSD ports committer. Previous ports work includes maintenance of assorted ports and general QA work to fix broken ports on armv6, armv7, aarch64, and riscv64. He has not previously worked on the libc specifically. See [7][8][9] for an overview over other open source work in which the submitter is involved. A current CV [10] is available. Technical Oversight ------------------- Mateusz Guzik volunteers to oversee the project in technical matters. He is the one to initially propose this project to the submitter. Development Process ------------------- The submitter plans to start development by writing a framework for the dispatch mechanism outlined in "Technical Details" as well as a benchmark harness for performance testing. It is planned that this benchmark harness produces output compatible with "go test", for which tooling is available in port devel/go-perf and with which the submitter is familiar. For each function to be vectorised, unit tests will be written if not already present and added to the FreeBSD test suite. The function will be hooked up into the benchmark harness. Once this is done for a given function, the development of SIMD-enhanced variants is performed. The submitter intends to perform this work by first writing one variant for each of the listed functions, revisiting the functions later for additional variants (corresponding to higher architecture level). This way, as many functions as possible benefit from some acceleration even if the project has to be cut short at a later point. The proposed order is in rising order of architecture level. Development is planned to take place in a public fork of the freebsd-src git repository. Changes are planned to be submitted following testing through the Phabricator platform [11]. Documentation ------------- The presence of SIMD-enhanced functions will be documented in a new manual page simd(7). This page will explain to the user how the libc chooses which implementation to use and how to configure this behaviour. Other manual pages such as environ(7), string(3), and bstring(3) will be enhanced with cross-references and additional information as appropriate. Internal documentation will be produced, explaining the dispatch and function selection mechanisms. As it is not planned to make these mechanisms available to user code, no end-user documentation will be produced. Additional documentation on the benchmark and test setup may be produced as needed. If requested, a final report describing the techniques used and giving the final performance improvements can be produced. Testing ------- The submitter plans to use existing unit tests for the funcions to be SIMD-enhanced where possible. For the other functions, custom unit tests will be written. These tests will be hooked up with the FreeBSD test suite to ensure testability of the SIMD enhancements. Note that by design, SIMD-enhanced implementations can only be tested on processors that support the required architecture level. Running the test suite on other processors will skip unsupported implementations. Deliverables and Timeline ------------------------- The submitter is currently pursuing a doctoral degree in Scientific Computing and intends to pursue this project at a time budget of 10 h/week. Deliverables include: - simd function dispatch framework - benchmark harness - additional unit tests - manual page simd(7) and other man page enhancements - SIMD-enhanced implementations, each as applicable - for baseline - for x86-64-v2 - for x86-64-v3 - for x86-64-v4 The submitter proposes the following project milestones: groundwork - simd function dispatch framework - benchmark harness - simd(7) draft - additional unit tests for strlen(3) - proof of concept SIMD-enhancement for strlen(3) baseline - additional unit tests for all functions - benchmark code for all functions - literature research on existing work - SIMD enhancements for x86-64 baseline and x86-64-v2 avx - SIMD enhancements for x86-64-v3 avx512 - SIMD enhancements for x86-64-v4 Each of the three milestones baseline, avx, avx512 likely encompasses writing a new SIMD implementation for each of the functions listed in section "Functions to be enhanced." Milestone "baseline" includes literature research over existing accelerations for these functions. Where appropriate and legal, such existing implementations are planned to be adapted for FreeBSD. Due to the higher amount of work, the baseline milestone can be split into two or three smaller milestones, each half or a third of the functions. It is expected that future work on other architectures will be much less involved as it can draw from the same algorithmic ideas developed in the baseline milestone. Costing ------- The submitter is based in Berlin, Germany. All costs involved in this proposal are for programmer time. The submitter will provide his own computing facilities for completing the project. If microarchitectural optimisation for a specific microarchitecture is desired, the submitter must be provided with a machine of that microarchitecture. Due to scholarship constraints, it is advantageous if the grant declares the work to be of scientific nature. Due to high uncertainty in the amount of time needed to complete this work, the submitter proposes that the grant pays 10 hours of work per week at EUR XX.XX/h until the work is completed or it is decided not to continue the grant. The submitter plans to submit monthly progress reports on work covered by the grant. References ---------- [1]: https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/master/raw/x86-64-ABI/abi.pdf?job=build [2]: https://sourceware.org/glibc/wiki/GNU_IFUNC [3]: https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html [4]: https://www.intel.com/content/dam/develop/external/us/en/documents/11mc12-avoiding-2bavx-sse-2btransition-2bpenalties-2brh-2bfinal-809104.pdf [5]: https://arxiv.org/abs/2212.05098 [6]: https://github.com/simdutf/simdutf/tree/clausecker [7]: https://codeberg.org/schilytools/schilytools [8]: http://fuz.su/~fuz [9]: https://github.com/clausecker [10]: http://fuz.su/~fuz/cv/cv.pdf [11]: https://reviews.freebsd.org [12]: https://sourceware.org/git/?p=glibc.git;a=tree;f=sysdeps/x86_64 [13]: https://android.googlesource.com/platform/bionic/+/master/libc/arch-x86_64/string/ [14]: https://git.musl-libc.org/cgit/musl/tree/src/string/x86_64