Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8332265: RISC-V: Materialize pointers faster by using a temp register #19246

Closed
wants to merge 12 commits into from

Conversation

robehn
Copy link
Contributor

@robehn robehn commented May 15, 2024

Hi, please consider!

Materializing a 48-bit pointer, using an additional register, we can do with:
lui + lui + slli + add + addi
This 15% faster both on VF2 and in CPU models, compared to movptr().

As we often materialize during calls there is free registers.

I have choose just a few spot to use it, many more can use.
E.g. la() with tmp register can use li48 instead of movptr.

Running tests now (so far so good), as if I screwed up IC calls it should be seen fast.
And benchmarks when hardware is free.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8332265: RISC-V: Materialize pointers faster by using a temp register (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19246/head:pull/19246
$ git checkout pull/19246

Update a local copy of the PR:
$ git checkout pull/19246
$ git pull https://git.openjdk.org/jdk.git pull/19246/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19246

View PR using the GUI difftool:
$ git pr show -t 19246

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19246.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented May 15, 2024

👋 Welcome back rehn! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 15, 2024

@robehn This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8332265: RISC-V: Materialize pointers faster by using a temp register

Reviewed-by: fyang, luhenry, mli

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the rfr Pull request is ready for review label May 15, 2024
@openjdk
Copy link

openjdk bot commented May 15, 2024

@robehn The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label May 15, 2024
@mlbridge
Copy link

mlbridge bot commented May 15, 2024

Webrevs

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, This looks interesting. Could all the movptr callsites be changed? I am asking this as I am a bit worried about the complexity / reward ratio when we have both movptr and li48 which are the same in functionality.

@robehn
Copy link
Contributor Author

robehn commented May 16, 2024

Yes, but it's a long term job, as you need to free a register in many cases. (in non-call sites places)
All callsites should be easy to change as you have plenty of callee saved registers which are already saved when using movptr.

As li48 is faster than li when using more than 32-bits these cases should also use li48.
I.e. mv t0, addr
But mv is fishy partly because of RegisterOrConstant constructor, so we can't tell in mv if this was an address or not.
I have been looking into cleaning that up, so mv with literal and mv with address is two seperate cases.
To keep them apart would be to use e.g. "li reg, literal" and "li48 reg, temp_reg, address".

As there is much work, this PR is intended as the first step with the hardest peices implemented already, i.e. li48 is ready to go.

If we also fix mov_metadata la()->li48 we reduce static call stub size down from 12 to 10 instruction, which is significant.
That one is on my todo list.

@RealFYang
Copy link
Member

RealFYang commented May 20, 2024

Yes, but it's a long term job, as you need to free a register in many cases. (in non-call sites places) All callsites should be easy to change as you have plenty of callee saved registers which are already saved when using movptr.

OK, I guess this might be a good compromise. Inspired by PPC's Assembler::load_const, MacroAssembler::get_const and MacroAssembler::patch_const [1-3], I think we could have a similar design. Adding one extra tmp register param for movptr like void movptr(Register Rd, address addr, int32_t &offset, Register tmp=noreg);, we can factor out li48 then.
The only difference compared with PPC's solution is that that we will have different sizes depending on whether we could find a tmp register for movptr. But I guess that's not a big issue? We can add a reference param (say, size) to existing is_movptr_at/is_movptr, we get the correct size when checking the instruction sequence and return this in size.
I prefer this way as I don't think it's a good idea to have more li variants like li48 and it's really bad to have both movptr and li48 which duplicate the functionality. We should also remove the existing li64 which is not used anywhere. Could you please consider this?

[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/assembler_ppc.cpp#L323
[2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp#L327
[3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp#L349

@robehn
Copy link
Contributor Author

robehn commented May 20, 2024

Hey, I did an update, not fully what you are saying.
We are missing a lot of 'abstraction' e.g. take a look at CodeInstaller::pd_next_offset in jvmciCodeInstaller.
I think this code should look like:

jint CodeInstaller::pd_next_offset(NativeInstruction* inst, jint pc_offset, JVMCI_TRAPS) {
  if(inst->is_call() || inst->is_jump() || inst->is_movptr()) {
    return pc_offset + inst->size();
  }
  JVMCI_ERROR_0("unsupported type of instruction for call site");
}

But I need to add a bunch of stuff to unrelated NativeInst, I think that is better suited in another PR.

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Taking a more closer look.

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Nice work! I only have several minor comments.

src/hotspot/cpu/riscv/nativeInst_riscv.hpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/riscv.ad Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/macroAssembler_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/nativeInst_riscv.hpp Show resolved Hide resolved
src/hotspot/cpu/riscv/macroAssembler_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/macroAssembler_riscv.cpp Outdated Show resolved Hide resolved
@robehn
Copy link
Contributor Author

robehn commented May 22, 2024

Thanks @luhenry ! Thanks for the second review pass @RealFYang !

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three more minor comments, looks good otherwise. Thanks.
BTW: You need to merge master and resolve conflicts :-)

src/hotspot/cpu/riscv/nativeInst_riscv.hpp Show resolved Hide resolved
src/hotspot/cpu/riscv/riscv.ad Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/nativeInst_riscv.hpp Outdated Show resolved Hide resolved
@robehn
Copy link
Contributor Author

robehn commented May 22, 2024

Yes, thanks, done!

@robehn
Copy link
Contributor Author

robehn commented May 23, 2024

Thanks again!

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated change looks good. It would be nice to see how much this will benefit performance.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 23, 2024
@Hamlin-Li
Copy link

And a general question about nativeInst_riscv.cpp and macroAssembler_riscv.cpp. I saw the functions in these 2 files call each other, that make the code a bit mess to me. It's not an issue introduced in this pr.
I wonder if this could be refactored? If so, I can work on it. But just in case you have easy answer already, so I don't have to do further investigation.

@robehn
Copy link
Contributor Author

robehn commented May 24, 2024

Updated change looks good. It would be nice to see how much this will benefit performance.

I tried todo some benchmarks but it seems like the error of them are larger than the benefit.
I'll try todo some longer runs, and minimize the error.

@robehn
Copy link
Contributor Author

robehn commented May 24, 2024

And a general question about nativeInst_riscv.cpp and macroAssembler_riscv.cpp. I saw the functions in these 2 files call each other, that make the code a bit mess to me. It's not an issue introduced in this pr. I wonder if this could be refactored? If so, I can work on it. But just in case you have easy answer already, so I don't have to do further investigation.

I agree, I would prefer having classes for the instruction where all the instruction functionality would be.
As it's now the opcodes are reapted everywhere, instead it should just be in in-place, this class.
And then have classes for instruction sequence where we keep all functionality gathered.

@Hamlin-Li
Copy link

And a general question about nativeInst_riscv.cpp and macroAssembler_riscv.cpp. I saw the functions in these 2 files call each other, that make the code a bit mess to me. It's not an issue introduced in this pr. I wonder if this could be refactored? If so, I can work on it. But just in case you have easy answer already, so I don't have to do further investigation.

I agree, I would prefer having classes for the instruction where all the instruction functionality would be. As it's now the opcodes are reapted everywhere, instead it should just be in in-place, this class. And then have classes for instruction sequence where we keep all functionality gathered.

OK, let me do some further investigation to see if we can make it more readable and maintainable.

Copy link

@Hamlin-Li Hamlin-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's move forward. :)
I create some bugs to track the further work.
https://bugs.openjdk.org/browse/JDK-8332899
https://bugs.openjdk.org/browse/JDK-8332900

Feel free to take them if you're also interested in them.

@robehn
Copy link
Contributor Author

robehn commented May 24, 2024

OK, let's move forward. :) I create some bugs to track the further work. https://bugs.openjdk.org/browse/JDK-8332899 https://bugs.openjdk.org/browse/JDK-8332900

Feel free to take them if you're also interested in them.

Thank you!

@robehn
Copy link
Contributor Author

robehn commented May 28, 2024

Updated change looks good. It would be nice to see how much this will benefit performance.

Here are 'some' number, it still unclear if these actually are significant:

BASELINE                       | movptr2
fop                  3120 msec |  2811 msec =0.900962
h2                  19156 msec | 17600 msec =0.918772
jython              24060 msec | 23343 msec =0.9702
luindex              3222 msec |  3226 msec =1.00124
lusearch             4383 msec |  4380 msec =0.999316
lusearch-fix         4096 msec |  4359 msec =1.06421
pmd                  7417 msec |  7342 msec =0.989888
jython              24060 msec | 23343 msec =0.9702
fop(Xcomp)           3060 msec |  3058 msec =0.999346
h2(Xcomp)           38724 msec | 38717 msec =0.999819
jython(Xcomp)       29999 msec | 29694 msec =0.989833
luindex(Xcomp)       5259 msec |  5195 msec =0.98783
lusearch(Xcomp)      6364 msec |  6269 msec =0.985072
lusearch-fix(Xcomp)  6430 msec |  6534 msec =1.01617
pmd(Xcomp)           7360 msec |  6999 msec =0.950951
jython(Xcomp)       29999 msec | 29694 msec =0.989833
Avg:0.983353

The issue is that reaching steady state AKA varmup takes tremondiusly long time.
As it takes long time it causes compilations differences between runs and it's unclear if we actually have finished varmup. Hence I run with Xcomp to get a more stable result.

Integrating later today!

@robehn
Copy link
Contributor Author

robehn commented May 28, 2024

/integrate

@openjdk
Copy link

openjdk bot commented May 28, 2024

Going to push as commit 7b52d0a.
Since your change was applied there have been 23 commits pushed to the master branch:

  • aa4c83a: 8332505: JEP 457: ClassRemapper forgets to remap bootstrap method references
  • cabe337: 8331921: Hotspot assembler files should use common logic to setup exported functions
  • 2edb6d9: 8330386: Replace Opaque4Node of Initialized Assertion Predicate with new OpaqueInitializedAssertionPredicateNode
  • 1850914: 8332864: Parallel: Merge ParMarkBitMapClosure into MoveAndUpdateClosure
  • 2f2cf38: 8332883: Some simple cleanup in vectornode.cpp
  • b5e1615: 8292955: Collections.checkedMap Map.merge does not properly check key and value
  • 86eb5d9: 8329958: Windows x86 build fails: downcallLinker.cpp(36) redefinition
  • be1d374: 8332825: ubsan: guardedMemory.cpp:35:11: runtime error: null pointer passed as argument 2, which is declared to never be null
  • ed81a47: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic
  • 08face8: 8332890: Module imports don't work inside the same module
  • ... and 13 more: https://git.openjdk.org/jdk/compare/16dba04e8dfa871f8056480a42a9baeb24a2fb24...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 28, 2024
@openjdk openjdk bot closed this May 28, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 28, 2024
@openjdk
Copy link

openjdk bot commented May 28, 2024

@robehn Pushed as commit 7b52d0a.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
4 participants