Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identical calls to getindex not optimised out #54281

Open
juliaUser42 opened this issue Apr 27, 2024 · 6 comments
Open

Identical calls to getindex not optimised out #54281

juliaUser42 opened this issue Apr 27, 2024 · 6 comments
Labels
domain:arrays [a, r, r, a, y, s] performance Must go faster

Comments

@juliaUser42
Copy link

juliaUser42 commented Apr 27, 2024

Hi,
thanks for the great job you do on julia. I think I spotted a potential performance issue. Maybe this is know or expected or I am missing something. In any case, I could not find a relevant post that helped me to make sense of this.

  1. The output of versioninfo()

Julia Version 1.10.2
Commit bd47eca (2024-03-01 10:14 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × AMD EPYC 7313 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:
JULIA_DIR = /snap/julia/current

  1. How you installed Julia

snap on Ubuntu

  1. A minimal working example (MWE), also known as a minimum reproducible example
x::Vector{Float64} = rand(10000)
y::Vector{Float64} = rand(10000)
u::Vector{Int32} = rand(1:30000, 30000)
l::Vector{Int32} = rand(1:30000, 30000)
L::Vector{Float64} = rand(30000)
U::Vector{Float64} = rand(30000)

myFunction(x, y, u, l, L, U)
    @inbounds for f in 1:length(L)
        y[l[f]] += U[f] * x[u[f]]
        y[u[f]] += L[f] * x[l[f]]
    end
end

produces

.LBB0_24:                               # %L69.epil
	movsxd	rdx, dword ptr [r11 + 4*rdi]
	movsxd	rbx, dword ptr [rsi + 4*rdi]
	vmovsd	xmm0, qword ptr [rcx + 8*rdx - 8] # xmm0 = mem[0],zero
	vmulsd	xmm0, xmm0, qword ptr [r10 + 8*rdi]
	vaddsd	xmm0, xmm0, qword ptr [rax + 8*rbx - 8]
	vmovsd	qword ptr [rax + 8*rbx - 8], xmm0
	movsxd	rbx, dword ptr [rsi + 4*rdi]
	movsxd	rdx, dword ptr [r11 + 4*rdi]
	vmovsd	xmm0, qword ptr [rcx + 8*rbx - 8] # xmm0 = mem[0],zero
	vmulsd	xmm0, xmm0, qword ptr [r9 + 8*rdi]
	inc	rdi
	vaddsd	xmm0, xmm0, qword ptr [rax + 8*rdx - 8]
	vmovsd	qword ptr [rax + 8*rdx - 8], xmm0
	cmp	r8, rdi
	jne	.LBB0_24

The issue is that the identical calls to Base.getindex(Main.l, f) and Base.getindex(Main.u, f) are not optimised out and consequently executed twice. This is also evident in the output of @code_lowered and @code_typed.

Of course, this can be fixed easily by hand.

On the other hand, I would not expect having to do this because clang, icc and gcc perform this optimisation automatically all producing machine code similar to

myFunction(double*, double*, int*, int*, double*, double*, double*, int):
        mov     r9, qword ptr [rsp + 8]
        xor     r11d, r11d
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        movsd   xmm0, qword ptr [r9 + 8*r11]    # xmm0 = mem[0],zero
        movsxd  r10, dword ptr [rdx + 4*r11]
        mulsd   xmm0, qword ptr [rdi + 8*r10]
        movsxd  rax, dword ptr [rcx + 4*r11]
        addsd   xmm0, qword ptr [rsi + 8*rax]
        movsd   qword ptr [rsi + 8*rax], xmm0
        movsd   xmm0, qword ptr [r8 + 8*r11]    # xmm0 = mem[0],zero
        mulsd   xmm0, qword ptr [rdi + 8*rax]
        addsd   xmm0, qword ptr [rsi + 8*r10]
        movsd   qword ptr [rsi + 8*r10], xmm0
        add     r11, 1
        jmp     .LBB0_1

see https://gcc.godbolt.org/z/z8e99c8oj

@oscardssmith oscardssmith added performance Must go faster domain:arrays [a, r, r, a, y, s] labels Apr 27, 2024
@Keno
Copy link
Member

Keno commented Apr 27, 2024

The C version relies on TBAA between double pointers and int pointers to prove noalias. Our current TBAA hierachy isn't fine enough to require this, although the whole ReinterpretArray change years ago was in preparation for making such a change. Still, making the change would probably require some careful evaluation, I'm sure some people are using TBAA violating workarounds.

@juliaUser42
Copy link
Author

Hi Keno,
Thank you for your quick answer. I am an engineer literate in assembly - not a compiler guy. For my purposes, it's enough to know the limits of what the compiler can do or cannot do ;-)

@oscardssmith
Copy link
Member

to remove a little bit of the jargon from Keno's answer, the optimization that "obviously" can be done is only valid if writing to y doesn't write to either u or l. In this case, this should be possible to prove via tbaa (type based alias analysis) since y is a vector of floats, while the other arrays are vectors of Ints.

@sgaure
Copy link

sgaure commented Apr 30, 2024

Perhaps there should be a possibility for the user to tell the compiler there's no aliasing? A @noalias macro applied to a method, or a loop? I can predict disastrous consequences since it's really a call site thing, not a method or instance thing, so it should probably not be used in generic code. It would be nice for the more speed aware users, though. If I recall correctly, I think I've used a fortran compiler many years ago (HP, IBM, PGI?) with a compiler flag -noaliasing with this effect.

@nsajko
Copy link
Contributor

nsajko commented Apr 30, 2024

there should be a possibility for the user to tell the compiler there's no aliasing

That already exists. There's both:

  1. An experimental API in Base: Base.Experimental.Const and Base.Experimental.@aliasscope.

  2. The user package UnsafeAssume.jl, which allows telling the compiler other things, too.

Also see #52851.

@KristofferC
Copy link
Sponsor Member

Still, making the change would probably require some careful evaluation, I'm sure some people are using TBAA violating workarounds.

After ReshapedArrays what (safe) ways are there to get aliasing between two Array of different data type? I know there is still vec so aliasing between matrix and vector of the same type is possible. Would it be hard to make that change and run a PkgEval?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants