Create a valid Neoverse N1 target. #623

everton1984 · 2022-03-14T19:42:18Z

This PR adds a valid Arm Neoverse N1 compilation target using Armv8 kernels. It creates the appropriate registry information and can autodetect a N1 cpu.

egaudry · 2022-03-24T08:37:07Z

config/neoversen1/bli_cntx_init_neoversen1.c

+
+	// Initialize level-3 blocksize objects with architecture-specific values.
+	//                                           s      d      c      z
+	bli_blksz_init_easy( &blkszs[ BLIS_MR ],     8,     6,    -1,    -1 );


Hi, that's great to see neoverse n1 tuning. Can I ask you how you came up with these blocksize values ?

Hi! To be honest I just wanted the compiler to generate tuned neoverse-n1 code with this patch so blocksize values were taken from thunderx2. If BLIS has a standard procedure to generate those value I am all up for it, please just let me know.

I value what you did, however i don't have the answer for this.
@devinamatthews any pointer you could share ?

@egaudry Do you think the fine tuning is essential to merge?

Having a clear interface and arch detection makes sense indeed, however without proper tuning, mergers/reviewers might not see this as a priority.
Just guessing.

Jeff Diamond has better tuning parameters for N1.

@jeffhammond Thanks for commenting. Could you please point me to Jeff Diamond so I could ask him if he is able to share his parameters please?

devinamatthews · 2022-04-07T15:43:10Z

Having a clear interface and arch detection makes sense indeed, however without proper tuning, mergers/reviewers might not see this as a priority.
Just guessing.

"The establishment" here. @everton1984 thanks for your work but @egaudry is pretty much right; it is best to have specifically-tuned block sizes and/or kernels with performance numbers before creating a new sub-configuration. Otherwise it is just easier to use the thunderx2 subconfig directly. I'll ask Jeff Diamond on the status of the tuned N1 parameters since that code may still be in the clutches of Oracle's lawyers.

everton1984 · 2022-04-07T15:55:43Z

Having a clear interface and arch detection makes sense indeed, however without proper tuning, mergers/reviewers might not see this as a priority.
Just guessing.

"The establishment" here. @everton1984 thanks for your work but @egaudry is pretty much right; it is best to have specifically-tuned block sizes and/or kernels with performance numbers before creating a new sub-configuration. Otherwise it is just easier to use the thunderx2 subconfig directly. I'll ask Jeff Diamond on the status of the tuned N1 parameters since that code may still be in the clutches of Oracle's lawyers.

@devinamatthews Thanks for answering. No problem it makes sense, I can generate the parameters just wanted to know before trying something ad-hoc if there is a particularly defined procedure to obtain them.

devinamatthews · 2022-04-07T16:10:49Z

The block sizes can, to some extent, be determined analytically, see https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf. A basic non-analytical strategy is:

Run a series of problems with m=MR, n=NR, and increasing k. Note that you will need to use a row- or column-major C matrix as preferred by the microkernel. Plot performance vs. k; the optimal kc should be:
a. The peak of the plot if the curve is sharpy peaked.
b. The smallest value such that good performance is achieved if the plot has a large plateau.
Run a series of problems with n=NR, k=KC, and increasing m (you might want to try different transpose options for A as well). As before, the optimal MC is either the peak or the smallest value that gives good performance.
The value of NC doesn't usually affect performance much, but you can also try a similar procedure as for KC and MC. Note that NC should in general be fairly large compared to MC.
Confirm performance for large square matrices and tweak as necessary. Finding the best threading parameters is another challenge which perhaps I can describe separately if you're interested.

Final note: The block sizes must satisfy MC%MR == 0 and NC%NR == 0. If possibly it doesn't hurt to have all three cache block sizes as multiples of both MR and NR unless this choice is too restrictive. It may also help to avoid large powers of 2.

everton1984 · 2022-04-07T16:37:29Z

The block sizes can, to some extent, be determined analytically, see https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf. A basic non-analytical strategy is:
1. Run a series of problems with m=MR, n=NR, and increasing k. Note that you will need to use a row- or column-major C matrix as preferred by the microkernel. Plot performance vs. k; the optimal kc should be:
   a. The peak of the plot if the curve is sharpy peaked.
   b. The smallest value such that good performance is achieved if the plot has a large plateau.

2. Run a series of problems with n=NR, k=KC, and increasing m (you might want to try different transpose options for A as well). As before, the optimal MC is either the peak or the smallest value that gives good performance.

3. The value of NC doesn't usually affect performance much, but you can also try a similar procedure as for KC and MC. Note that NC should in general be fairly large compared to MC.

4. Confirm performance for large square matrices and tweak as necessary. Finding the best threading parameters is another challenge which perhaps I can describe separately if you're interested.
Final note: The block sizes must satisfy MC%MR == 0 and NC%NR == 0. If possibly it doesn't hurt to have all three cache block sizes as multiples of both MR and NR unless this choice is too restrictive. It may also help to avoid large powers of 2.

@devinamatthews Thanks a lot! Let me find the correct parameters then.

Create a valid Neoverse N1 target.

c2f9f8f

egaudry reviewed Mar 24, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a valid Neoverse N1 target. #623

Create a valid Neoverse N1 target. #623

everton1984 commented Mar 14, 2022

egaudry Mar 24, 2022

everton1984 Mar 24, 2022

egaudry Mar 24, 2022

everton1984 Apr 5, 2022

egaudry Apr 5, 2022

jeffhammond Apr 5, 2022

everton1984 Apr 7, 2022

devinamatthews commented Apr 7, 2022

everton1984 commented Apr 7, 2022

devinamatthews commented Apr 7, 2022 •

edited

everton1984 commented Apr 7, 2022

Create a valid Neoverse N1 target. #623

Are you sure you want to change the base?

Create a valid Neoverse N1 target. #623

Conversation

everton1984 commented Mar 14, 2022

egaudry Mar 24, 2022

Choose a reason for hiding this comment

everton1984 Mar 24, 2022

Choose a reason for hiding this comment

egaudry Mar 24, 2022

Choose a reason for hiding this comment

everton1984 Apr 5, 2022

Choose a reason for hiding this comment

egaudry Apr 5, 2022

Choose a reason for hiding this comment

jeffhammond Apr 5, 2022

Choose a reason for hiding this comment

everton1984 Apr 7, 2022

Choose a reason for hiding this comment

devinamatthews commented Apr 7, 2022

everton1984 commented Apr 7, 2022

devinamatthews commented Apr 7, 2022 • edited

everton1984 commented Apr 7, 2022

devinamatthews commented Apr 7, 2022 •

edited