Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential encoder speedup on non-win32 #585

Open
chocolate42 opened this issue Apr 10, 2023 · 32 comments
Open

Potential encoder speedup on non-win32 #585

chocolate42 opened this issue Apr 10, 2023 · 32 comments

Comments

@chocolate42
Copy link

This ( https://hydrogenaud.io/index.php/topic,123889.msg1024853.html#msg1024853 ) post prompted me to poke around fwrite and hack in setting the fwrite output buffer size for non-win32 targets: https://github.com/chocolate42/flac/tree/vbuf

Tested by re-encoding a ~150MB 24 bit flac file to -0 to the same drive, on a 2.5" HDD and a slowish SSD on a skylake laptop. Tried a buffer of 10MB to match the win32 path, and a buffer of 10MB+8 based on the recommendations in this post: https://www.enterprisestorageforum.com/hardware/a-trip-down-the-data-path-i-o-and-performance/

These are the average wall times of 10 runs each in seconds on Linux 64 bit:

        HDD    SSD
default 15.156 7.564
10mb    10.081 6.799
10mb+8   9.904 6.752
  • This shows the best case of -0, as compression level increases the relevance of I/O performance reduces
  • I had to provide a buffer to setvbuf, apparently when NULL is provided the implementation is free to ignore the request to increase the internal buffer and glibc does ignore it.
  • The default on my machine is either 4096 or 8192 (or 8200 if the +8 is important)
  • Potentially something similar could be done for reading, haven't looked into it at all yet
  • Similarly benefits stdout with -c too
  • Haven't (and can't) tested win32 path to confirm if it is actually using a 10MB buffer. It probably is but unconfirmed

Didn't make a proper PR directly because there's some open questions

  • Only tried on Linux 64 bit
  • Does it work with stdout always? Is there a weird architecture or something an external program could do (like redefining stdout) that might make it fail? Should we care even if an oddball config might fail?
  • Embedded can't malloc a large buffer, embedded should probably be excluded
  • Is setvbuf even available on embedded?
  • Does it matter if setvbuf fails? Surely it'll just revert to the default instead of leaving the stream in some bad state
  • What's the deal with the win32 path checking that it's a "real" file? /dev/null seems to work. Could the paths be unified?
  • Is the +8 actually beneficial or is it just a case of something that used to be true
  • Haven't tried buffer sizes other than ~10MB
@H2Swine
Copy link

H2Swine commented Apr 11, 2023

I have made noe attempt at finding relevant code (since I'm incompetent at it), but the behaviour seems to be introduced with 1.3.1, which is consistent with the changelog entry that reads "I/O buffering improvements on Windows to reduce disk fragmentation when writing files."
Speed might be just a side effect ... ?

An observation that might be relevant: After "first write" (i.e. first size update) Windows reports 10 485 760 bytes and Size on disk: 10 485 760 bytes too. But, after second update when Windows reports file size of 20 971 520, "Size on disk" is bigger. It is not constant across settings. Also as the number of writes (that is, the multiple of 10 MiB) increases, the difference between "Size on disk" and "size" also oscillates. But that difference seems to be pinned to zero at first write.

@chocolate42
Copy link
Author

Also tried setting a 10MiB+8 buffer in the decoder reader (10mb8r). Same re-encode test, average of 10 runs:

       HDD   SSD
10mb8  9.561 7.244
10mb8r 9.552 7.083

There might be a benefit to a large buffer when reading but it's close to margin of error. Reading will also be harder to measure as OS caching likely obscures things. Not added to the repo.

But the decoder writes too, so tried hacking in a 10MiB+8 buffer into ./flac's decode.c writer. Same file as before decoded to wav. Average of 10 runs:

         HDD    SSD
default  11.788 6.29
10mb8dec 10.244 5.73

Less pronounced but still a definite benefit on my system (writing larger chunks may better hide the drawback of small buffers, could explain why the benefit on decode is smaller). If windows also sees a performance benefit to a large buffer this should apply. If someone wants to test potential performance improvements on windows they can try compiling the current state of my repo with the current state of the xiph repo and comparing decode when a file is generated.

I have made noe attempt at finding relevant code (since I'm incompetent at it), but the behaviour seems to be introduced with 1.3.1, which is consistent with the changelog entry that reads "I/O buffering improvements on Windows to reduce disk fragmentation when writing files."
Speed might be just a side effect ... ?

That tracks.

@ktmf01
Copy link
Collaborator

ktmf01 commented Apr 11, 2023

I don't know what to make of this. FLAC calls fflush after every fwrite, so large buffers should not make a difference? Even with setvbuf, fflush is honoured as far as I know. Also, I can't explain the difference between encoding and decoding really. They both seem to write the same way (once for every frame).

Feel free to explore this further, I'll probably won't spend much time on this until I clear some bugs found by fuzzing that seem rather hard to debug.

@chocolate42
Copy link
Author

I don't know what to make of this. FLAC calls fflush after every fwrite, so large buffers should not make a difference? Even with setvbuf, fflush is honoured as far as I know.

I don't think fflush is called every frame, fflush is present only when FLAC__VALGRIND_TESTING is defined on encode and decode, presumably only in test and/or debug builds.

Also, I can't explain the difference between encoding and decoding really. They both seem to write the same way (once for every frame).

The only difference I can see is that on decode the data is more consistent and there's more to write. More data should mean more impact if anything but there seems to be less impact.

Feel free to explore this further, I'll probably won't spend much time on this until I clear some bugs found by fuzzing that seem rather hard to debug.

I'll explore it to a conclusion as much as I can on Linux. The only other hardware I have has a modern fast SSD which probably won't be of much help and VM's seem like a bad idea here. Maybe there's a raspberry pi or slow USB stick somewhere that might be useful to make I/O issues more obvious.

@ktmf01
Copy link
Collaborator

ktmf01 commented Apr 11, 2023

I don't think fflush is called every frame, fflush is present only when FLAC__VALGRIND_TESTING is defined on encode and decode, presumably only in test and/or debug builds.

Yes indeed, I didn't see that right

@chocolate42
Copy link
Author

Tested decode on 2193 tracks mostly 16 bit against the default build and the vbuf10m8dec (renamed 10m8) build. ~10% of tracks were decoded quicker by the default build but the sum clearly trends towards the 10m8 build, laptop was being lightly used during test which probably didn't help with variance. One build fully ran the tracks followed by the other build fully running the tracks to eliminate caching as a potential factor.

        HDD wall time sum
default 5918.10
10m8    4508.47

Made a selection of builds with varying buffer sizes and tested decode with the 150MB 24 bit track, 10 runs averaged:

k=KiB, m=MiB, xm8=xMiB+8
        HDD    SSD
default 11.938 6.202
8k      11.614 5.809
8k8     12.117 5.786
64k     10.972 5.723
64k8    11.225 5.738
8m      10.004 5.712
8m8     10.156 5.666
10m     10.064 5.689
10m8    10.161 5.699
12m     10.062 5.688
12m8    10.144 5.669
16m     10.159 5.719
16m8    10.210 5.711

That seems to confirm a benefit but not the reason behind it. Now most +8 buffer sizes are showing as slightly worse, probably too close to margin of error to make a judgement, that being the case the +8 is probably irrelevant.

Here's some perf stats from a few decode runs. I only have a rudimentary grasp of tools like perf and gdb so this'll have to do for now. The default build has more branch misses for some reason. Less page faults on default build makes sense, less memory used so less first-time access causing a fault.

10m build

u20@u20:~/Documents/mountain/runtime/flac$ perf stat ./vbuf10m/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f

flac git-f061c6b0 20230411
Copyright (C) 2000-2009  Josh Coalson, 2011-2022  Xiph.Org Foundation
flac comes with ABSOLUTELY NO WARRANTY.  This is free software, and you are
welcome to redistribute it under certain conditions.  Type `flac' for details.

1159.flac: done         

 Performance counter stats for './vbuf10m/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f':

          5,863.31 msec task-clock                #    0.561 CPUs utilized          
            56,581      context-switches          #    9.650 K/sec                  
                28      cpu-migrations            #    4.775 /sec                   
             2,729      page-faults               #  465.436 /sec                   
     6,285,358,258      cycles                    #    1.072 GHz                    
    15,821,032,908      instructions              #    2.52  insn per cycle         
     1,368,116,924      branches                  #  233.335 M/sec                  
         7,874,357      branch-misses             #    0.58% of all branches        

      10.458442088 seconds time elapsed

       5.180040000 seconds user
       0.968928000 seconds sys

u20@u20:~/Documents/mountain/runtime/flac$ perf stat ./vbuf10m/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f

flac git-f061c6b0 20230411
Copyright (C) 2000-2009  Josh Coalson, 2011-2022  Xiph.Org Foundation
flac comes with ABSOLUTELY NO WARRANTY.  This is free software, and you are
welcome to redistribute it under certain conditions.  Type `flac' for details.

1159.flac: done         

 Performance counter stats for './vbuf10m/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f':

          5,875.65 msec task-clock                #    0.563 CPUs utilized          
            56,560      context-switches          #    9.626 K/sec                  
                19      cpu-migrations            #    3.234 /sec                   
             2,725      page-faults               #  463.779 /sec                   
     6,296,540,251      cycles                    #    1.072 GHz                    
    15,820,642,969      instructions              #    2.51  insn per cycle         
     1,368,024,143      branches                  #  232.830 M/sec                  
         7,776,357      branch-misses             #    0.57% of all branches        

      10.443608953 seconds time elapsed

       5.085768000 seconds user
       1.077939000 seconds sys

default build

u20@u20:~/Documents/mountain/runtime/flac$ perf stat ./main_build/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f

flac git-4b2c33eb 20230410
Copyright (C) 2000-2009  Josh Coalson, 2011-2022  Xiph.Org Foundation
flac comes with ABSOLUTELY NO WARRANTY.  This is free software, and you are
welcome to redistribute it under certain conditions.  Type `flac' for details.

1159.flac: done         

 Performance counter stats for './main_build/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f':

          6,141.41 msec task-clock                #    0.540 CPUs utilized          
            65,968      context-switches          #   10.742 K/sec                  
                30      cpu-migrations            #    4.885 /sec                   
               167      page-faults               #   27.192 /sec                   
     6,553,218,206      cycles                    #    1.067 GHz                    
    15,961,012,058      instructions              #    2.44  insn per cycle         
     1,399,294,282      branches                  #  227.846 M/sec                  
        14,624,563      branch-misses             #    1.05% of all branches        

      11.373024667 seconds time elapsed

       5.139417000 seconds user
       1.394814000 seconds sys


u20@u20:~/Documents/mountain/runtime/flac$ perf stat ./main_build/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f

flac git-4b2c33eb 20230410
Copyright (C) 2000-2009  Josh Coalson, 2011-2022  Xiph.Org Foundation
flac comes with ABSOLUTELY NO WARRANTY.  This is free software, and you are
welcome to redistribute it under certain conditions.  Type `flac' for details.

1159.flac: done         

 Performance counter stats for './main_build/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f':

          6,075.46 msec task-clock                #    0.542 CPUs utilized          
            65,971      context-switches          #   10.859 K/sec                  
                43      cpu-migrations            #    7.078 /sec                   
               165      page-faults               #   27.158 /sec                   
     6,483,611,422      cycles                    #    1.067 GHz                    
    15,961,729,133      instructions              #    2.46  insn per cycle         
     1,399,494,698      branches                  #  230.352 M/sec                  
        12,424,063      branch-misses             #    0.89% of all branches        

      11.202770781 seconds time elapsed

       5.112267000 seconds user
       1.340339000 seconds sys

@chocolate42
Copy link
Author

  • Turns out my default write buffer size is 4096
  • Looking at the below data it seems that the +8 above is irrelevant, what I read indicated that the extra 8 bytes were for book keeping but the entire buffer is used as buffer at least on Linux.

strace confirms that the default does orders of magnitude more write syscalls

10m:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 69.93    0.116078         866       134           write
 21.70    0.036026        4503         8           openat
  6.93    0.011505           0     37357           read
  0.74    0.001229           0      4716           clock_gettime
  0.67    0.001115           0      4712           lseek
  0.02    0.000026           1        19           mmap
  0.00    0.000006           0         7           fstat
  0.00    0.000004           1         4           brk
  0.00    0.000000           0         8           close
  0.00    0.000000           0         1           stat
  0.00    0.000000           0         6           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         6           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           chmod
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0         1           utimensat
------ ----------- ----------- --------- --------- ----------------
100.00    0.165989                 46986         2 total

default:

u20@u20:~/Documents/mountain/runtime/flac$ strace -c ./main_build/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 50.13    0.055369           5      9540           write
 23.28    0.025717        3214         8           openat
 22.05    0.024356           0     37357           read
  2.38    0.002626           0      4717           clock_gettime
  2.07    0.002289           0      4712           lseek
  0.03    0.000028           1        18           mmap
  0.03    0.000028           7         4           brk
  0.02    0.000024           3         8           fstat
  0.01    0.000013           1         8           close
  0.00    0.000004           4         1           ioctl
  0.00    0.000000           0         1           stat
  0.00    0.000000           0         6           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         6           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           chmod
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0         1           utimensat
------ ----------- ----------- --------- --------- ----------------
100.00    0.110454                 56393         2 total

9417 write syscalls vs 23 spent writing the wav file, the rest are for the progress bar. The above only shows time taken to do the syscall, not time taken in the syscall. strace -T outputs time taken in each syscall, summing the output of that results in this, which I believe is the time taken to actually write the wav file (possibly with or without the time to do the syscall from the first part of this post):

        Seconds
default 2.275357
10m     1.621201

What's curious is that each frame is written in two write calls for the default build, here's the start of the write calls for the default build:

write(3, "RIFF\366\320\312\rWAVEfmt (\0\0\0\376\377\2\0\0\356\2\0\0\224\21\0"..., 4096) = 4096 <0.000080>
write(3, "\377\263\374\377:\0\0|\1\0\0\1\0\373\3\0\241\375\377\244\376\377w\1\0\260\3\0\333\375\377\31"..., 45056) = 45056 <0.000307>
write(3, "\204\377\330\r\0{\215\377z5\0\325\254\377)\3\08\214\377\355\t\0\263\236\377\315\362\377\327o\377"..., 4096) = 4096 <0.000053>
write(3, "\0\362\30\0TR\0\362H\0O\4\0#\377\377i\10\0\366\24\0\266\27\0k4\0\2404\0\211"..., 45056) = 45056 <0.000295>
write(3, "\376\1\377B\377\361\34\2t\35\377L\23\0023O\377\tF\2wX\377\0010\2\345y\377\242/\2"..., 4096) = 4096 <0.000047>
write(3,

Which presumably is because the header partially fills the buffer, then each frame in this example file wants to write 49152 bytes. At each frame the rest of the buffer is filled and written, then the remaining multiple of 4096 is written at once, then the buffer is partially filled to be written with the next frame.

It looked like a default build that also flushed the header might half the number of write syscalls when a small buffer aligns neatly (like decoding to a simple format like wav with a typical blocksize of 4096, very common), but it didn't:

write(3, "RIFF\366\320\312\rWAVEfmt (\0\0\0\376\377\2\0\0\356\2\0\0\224\21\0"..., 68) = 68 <0.000071>
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\2\0\0\373\377\377\374\377\377\1\0"..., 4096) = 4096 <0.000211>
write(3, "\375\377\231\376\377\356\0\0\254\372\377-\376\377\307\0\0\214\1\09\377\377\207\0\0\271\2\0\252\0\0"..., 45056) = 45056 <0.000466>
write(3, "z\16\0\257\222\377\274\21\0\300s\377v1\0\334\241\377N\363\377.\177\3774\31\0\265\220\377\345!"..., 4096) = 4096 <0.000072>
write(3, "0\0005!\0\233\16\0G0\0\17+\0\324\3\0\330\35\0T\325\377\34'\0_\1\0jq\0"..., 45056) = 45056 <0.000433>

Still doing 4096+45056 instead of a single write of 49152. Double checked and there was nothing else written after the header to misalign the buffer, so it looks like (at least in gcc and clang's implementation) that the buffer is filled even if it starts empty, then the remainder (maybe in full or the largest multiple of buffer size) is written in a single call. Unfortunate for the edge case, but the solution that works everywhere is to have a buffer that's at least as big as the maximum expected raw data contained in a frame.

The doubling of write syscalls is an inefficiency when the buffer is too small, but it might not be the whole story. It might not even be a major part of why it's slower, but it would be convenient if it were.

@chocolate42
Copy link
Author

  • With gcc and presumably clang the setvbuf buffer size is ignored if buffer is NULL. This is unlike windows, which resizes an internal buffer (which is how flac currently does it).

  • The current state of https://github.com/chocolate42/flac/tree/vbuf implements the external buffer properly instead of hacking it in, and moves _WIN32 builds to using an external buffer instead of internal, so everything uses a 10MiB external buffer.

  • Couldn't implement external buffer for stdout, because stdout is not closed. A program using stdout will have to use setvbuf on stdout if they want to (which they might, there's a benefit at least when piping to file).

For _WIN32, other than using an external buffer the only other difference is that stdout is determined with file != stdout instead of using if(GetFileType((HANDLE)_get_osfhandle(_fileno(file))) == FILE_TYPE_DISK). This seems fine? It's good enough elsewhere in the code, in the case where the file is a non-stdout pipe the code elsewhere probably breaks (freeing a pipe that may be re-used). Maybe I'm misunderstanding but I don't think arbitrary pipes are currently supported, supporting would involve overhauling everywhere that checks for stdin/stdout instead of an arbitrary pipe.

What remains to be solved to potentially turn this into a PR:

  • Test a windows build of https://github.com/chocolate42/flac/tree/vbuf to make sure there's no downside to using an external buffer. If there is no downside the diverging paths can be merged, otherwise the temporary #if 0's in the latest commit are changed to #ifdef _WIN32 and the diverging paths remain. Merging paths would be ideal because that's a mess.
  • Determine what targets a large buffer cannot be used with and exclude them

@ktmf01
Copy link
Collaborator

ktmf01 commented Apr 13, 2023

* Determine what targets a large buffer cannot be used with and exclude them

I haven't looked into your changed in depth, but I would consider moving the flac command line tool from using FLAC__stream_encoder_init_file() to FLAC__stream_encoder_init_FILE(). That way you can implement the whole buffer thing in flac and leave libFLAC untouched, which massively reduces the number of extreme cases you're dealing with. Especially embedded application tend to use libFLAC but not the command line tools.

@chocolate42
Copy link
Author

That makes a lot of sense.

@chocolate42
Copy link
Author

Measuring IO is apparently hard, the latest results make it questionable when and where the apparent gains are to be had. These are the old results to compare against

        HDD wall time sum
default 5918.10
10m8    4508.47

After trying some Linux kernel tweaks to try and disable/flush cache to get more consistent results these new results happened (no defrag)

        HDD wall time sum
default 3191.47
10m     3271.49

The default is the exact same build for both tests. Not only are the results suspiciously much better and the two builds close to margin of error, the results are also much less consistent. As mentioned previously the old test was pretty consistent, this test had wild swings in both directions.

My best guess as to what's happening is that the new measured time is junk. fflush on Linux doesn't guarantee that the data is written to disk, only that the buffer is emptied and that it's up to the kernel to finish things. Whatever stuck from the kernel tweaks I recklessly tried (and can't seem to reverse) seems to have resulted in the kernel returning quicker from writing than it did before, possibly very quickly. The kernels internal buffer doesn't seem to have enough time to empty across decodes, when it builds up too much it seems to fully empty at which point whatever test was running gets a severe wall time penalty. If that's what's actually happening then the only writes that are reflected in the new data are the ones where the kernel fully dumps its buffer.

The next thing to try is fsync on fclose, which should push the data in the kernel buffer to the HDD and should make the results consistent again. Even fsync doesn't guarantee that the data is fully written, only that the disk fully has it (so SSD's have it at least in their RAM), but this is an old HDD with limited cache so it should be good enough to test.

@chocolate42
Copy link
Author

Disregard any prior benchmarks or conclusions drawn, I vastly misunderstood how aggressively and when laptops put themselves in different states so it's all tainted. The following results try to control as many variables as is feasible (laptop plugged in fully charged, no UI, no other access to the disk, screen off, no other programs open, cooldown between runs). Fairly confident that these results are as accurate as they can be.

                              default 10m_fsync 10m_nofsync
script time                   3603.91   3109.06   2651.81
flac time summed              3574.27   3081.07   2626.36
Average                       1.62985   1.40496   1.19761
Standard deviation            1.80209   0.92650   1.34400
Normalised standard deviation 1.00000   0.59642   1.01498
  • There is a large benefit to a 10MiB buffer, at least on this particular hdd. Maybe it's fragmented, maybe that exaggerates the benefit
  • The 10m_fsync is more consistent than 10m_nofsync but takes longer as it flushes the kernel buffer. It confirms that normal operation may not have fully written the file to disk until some time after ./flac has exited, making benchmarks tricky (you have to have a large corpus, a small corpus would be very misleading)
  • It's interesting that the large buffer is more beneficial than the fsync is a detriment, comparing to the default build
  • The normalised st.dev being similar for default/nofsync is a good sign that my interpretation is accurate

Will eventually test on other hardware and redo the buffer size test to see if there's potentially a more appropriate size.
vbuftest.ods

@chocolate42
Copy link
Author

Tried the same decode test using a fast M.2 ssd with a Zen2 4700u APU. Had to rebuild everything, these use gcc 12.2 and clang 11.1, the previous test used gcc 9.4 but the difference is probably minimal.

              default default 10m_fsync 10m_fsync 10m_nofsync 10m_nofsync
              gcc     clang   gcc       clang     gcc         clang
script time   662.01  676.75  719.42    721.72    671.72      689.68
flac time sum 648.46  663.42  705.81    708.03    658.25      676.26
average       0.2957  0.3025  0.3218    0.3229    0.3002      0.3084
st.dev        0.2078  0.2254  0.2200    0.2314    0.2199      0.2247

The disk is fast enough that buffer size is not relevant using a single core. Parallelising with parallel only utilised an estimated ~330% of 8 cores and cut the time to ~210-225 seconds, still not enough for the default build to be noticeably inadequate using these tests. Processing 60GB of flac into 98GB of wav in under 4 minutes is neat.

Slow sata SSD's are common and sit between HDD and fast SSD in specs. I'm guessing the default build is adequate there under normal conditions (stronger compression, single core), maybe slow SSD's feel the pressure when fully threaded.

@chocolate42
Copy link
Author

Tried a thumb drive with the zen2, not the slowest with ~15MB/s writes but not quick. With roughly the first third of the corpus as the full corpus doesn't fit on the drive. Fresh format, no fragmentation

              default_gcc  10m_fsync_gcc  10m_nofsync_gcc default_clang 64k_nofsync_gcc
script time   3008.33      3014.97        2969.59         2976.72       2949.04
flac time sum 2779.32      2993.86        2723.95         2695.7        2717.38

So it looks like the benefit measured before is from fragmentation and not an OS difference, same reason init_file sets the 10m buffer on windows for encode. Don't fancy trying to defrag an ntfs partition on Linux to find out for sure, sounds like a recipe for data loss.

Curiously it looks like the data is fsynced at the end of the script, don't know what else explains the script times all aligning but not the sum. Nothing in the script explains the large discrepency between script time and flac sum, it's just a loop that calls flac for every file.

@chocolate42
Copy link
Author

The current state of the vbuf branch does as suggested, flac uses init_FILE and all changes are contained to encode.c/decode.c. WIN32 also uses an external buffer, it still needs to be tested on windows to make sure encode speed remains the same and that decode speed has improved in the presence of fragmentation. I can probably check that win32 works but not benchmark from a VM as the VM would translate to Linux calls and we want to benchmark the windows calls.

@ktmf01
Copy link
Collaborator

ktmf01 commented Apr 20, 2023

I can test on Windows.

Something that might be relevant: what filesystem did you test on? You mentioned ntfs once, are all tests on ntfs?

@chocolate42
Copy link
Author

Thanks. The laptop hdd is ntfs with the OS on a primary SSD. The fast SSD is ext4 with OS on the same partition. The USB stick is FAT32.

@ktmf01
Copy link
Collaborator

ktmf01 commented Apr 20, 2023

So, could you summarize what you're seeing?

What hardware were you using for the numbers in the post where you said to disregard all earlier measurements? Was that an HDD?

The current state of your vbuf branch, that is 'nofsync' I presume?

@chocolate42
Copy link
Author

The results from the disregard post were the HDD on laptop with i7-6700hq. The post after that tested the fast ssd in a miniPC host containing Ryzen 4700u. The post after that tested the USB stick on miniPC host. The current state of vbuf is 10m_nofsync, just a 10MiB external output buffer everywhere.

tl;dr I'm seeing a big speedup when writing to a presumably fragmented NTFS HDD, a set of flac that takes an hour to decode to wav on the default build takes 45 minutes with a 10MiB buffer. All other hardware hasn't shown a significant difference.

@ktmf01
Copy link
Collaborator

ktmf01 commented Apr 21, 2023

I've tried to replicate. Here are results for running on Ubuntu 22.04, Intel Celeron N5105, Seagate Barracuda 2.5 5400. ext4 partition.

$ for I in {1..20}; do for EXE in flac-currentgit flac-vbuf; do echo -n $EXE; rm /home/martijn/bin/temp/*; /usr/bin/time -f ": %e" ~/bin/$EXE -s */*.flac */*/*.flac --output-prefix=/home/martijn/bin/temp/ -0; done; done
flac-currentgit: 1042.88
flac-vbuf: 1109.46
flac-currentgit: 1092.47
flac-vbuf: 1058.09
flac-currentgit: 1101.90
flac-vbuf: 1079.19
flac-currentgit: 1093.96
flac-vbuf: 1102.59
flac-currentgit: 1081.33
flac-vbuf: 1170.21
flac-currentgit: 1190.34
flac-vbuf: 1245.58
flac-currentgit: 1162.68
flac-vbuf: 1418.31
flac-currentgit: 1646.74
flac-vbuf: 1851.07
flac-currentgit: 1091.20
flac-vbuf: 1059.79
flac-currentgit: 1281.78
flac-vbuf: 1207.90

That gives me the following numbers

        currentgit  vbuf
average 1178,53     1230,22
stddev  178,32      244,39
stderr  56,39       77,28

Looking at the averages and standard error, I cannot conclude which one is faster, but I think it is safe to say any possible difference gets drowned in the noise anyway.

I do have a theory as to why adding a buffer might make encoding slower: having a small buffer, the OS can start writing right away while buffering a bit. However, when the program itself implements a buffer, the OS cannot start sending data to the harddisk until that buffer is full.

Results on windows will follow later.

@chocolate42
Copy link
Author

Thank you for trying to confirm.

I do have a theory as to why adding a buffer might make encoding slower: having a small buffer, the OS can start writing right away while buffering a bit. However, when the program itself implements a buffer, the OS cannot start sending data to the harddisk until that buffer is full.

That makes sense if the OS is fully writing one flac file before letting the next flac process run, aka there is an implicit fsync so there's time at the start of every file where there is no write in progress.

What I'm seeing I've interpreted as the next file being processed while the previous one is still being written by the kernel, the previous file has left the fwrite buffer thanks to the implicit fflush from fclose, but is still in the kernel buffer being written to disk after the process that made it has exited. There's wildly variable behaviour of the default and 10m_nofsync builds on the HDD, the same results as before but highlighted for clarity:

vbuftest.ods

What I've attributed this to is the kernel buffer getting full and the flac process having to wait for the kernel buffer to empty enough during fwrite. My interpretation may be entirely wrong, pure guess. My OS is nothing special AFAIK, Ubuntu 20.04:

uname -a
Linux u20 5.15.0-67-generic #74~20.04.1-Ubuntu SMP Wed Feb 22 14:52:34 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Sorry for the mess, no matter what the reason is it seems to be peculiar behaviour. I'll try running the USB on the laptop to see if the behaviour matches the USB result from the miniPC or the HDD result from the laptop. It should match the miniPC result, but if it matches the HDD result then there's definitely something going on with the laptop OS

@chocolate42
Copy link
Author

The USB result behaves the same on the Ubuntu 20.04 laptop as the miniPC, mildly supporting that it's not OS weirdness on slow media causing the HDD result and likely is a result of fragmentation.

Reran the HDD test on the laptop using a live image of Fedora 38 and rebuilt with gcc 13. Did a run of default then 10m, rebooted, then a run of 10m then default to rule out caching:

              default_run0 10m_run0 10m_run1 default_run1
flac sum time 2887.02      2089.43  2141.82  2902.73
script time   2919.42      2114.49  2170.31  2932.13

This rules out any weirdness that may have accumulated on the daily driver Ubuntu 20.04 that's been in service for 3 years, or from old toolchains. Admittedly it's the same hardware, but two OS's have confirmed that the 10m build performs better when using a (very likely) fragmented HDD.

fedora38_hdd.ods

@ktmf01
Copy link
Collaborator

ktmf01 commented Apr 24, 2023

How much space is left on that device when running the tests? If that is more than 20%, fragmentation should not be too much of an issue. On the other hand, the linux ntfs handling might not be as optimized as one Windows, let alone as the ext4 handling.

@chocolate42
Copy link
Author

It's 98.1% full, 18.5GB space free out of ~1TB. Linux NTFS handling certainly leaves a lot to be desired, or it did at least. My Ubuntu 20.04 uses ntfs-3g which is the old way of handling ntfs (feature-incomplete but mature, implemented via FUSE which is a userspace solution which incurs extra context switching overhead at the very least compared to a kernel implementation). A much better replacement made it's way into the kernel last year (ntfs3) but it looks like it stalled from world events so is still WIP, fedora 38 probably hasn't enabled it by default. It's possible that the additional ntfs-3g context switching is the bottleneck, I doubt it but will have to figure out if the ntfs3 driver is default on the fedora live image and if not redo the test with it to see if there's a difference.

Even ext4 is seeing improvements still ( https://www.phoronix.com/news/Linux-6.3-EXT4 ). Tangentially related, io_uring is an exciting new async IO interface for Linux, as much as IO interfaces can be exciting. It's unlikely that it's that much of a benefit to flac's synchronous access pattern but there are likely at least minor gains from less system call overhead.

@chocolate42
Copy link
Author

Checked the live image and it turns out Fedora 38 does use ntfs3, so the Fedora result above rules out an unoptimised ntfs-3g as an issue as ntfs3 shows similar gains with the 10m build.

Note the HDD results between OS's are not directly comparable, AFAIK there's nothing that can be read from the Ubuntu runs taking longer than the Fedora runs. To control for laptop power state shenanigans I had the Ubuntu runs wait 10 minutes to get into a stable (not ramping due to UI) lower power state. The Fedora live image didn't seem to have any adaptive power behaviour, presumably the live image doesn't enable any of that for a snappier live impression. There's also probably different schedulers in use and all sorts of different settings.

@ktmf01
Copy link
Collaborator

ktmf01 commented May 16, 2023

I've finally taken some time to test this under Windows, and I must say, under the right conditions the difference is horrific, truly.

I've compared three compiles, one with the vbuf tree of your repository, one with that same tree but rolled back a few commits and one rolled back, with the setvbuf in stream_encoder.c (which has been there for many years) removed. Respectively they're called flac-extravbuf.exe, flac-nochange.exe and flac-novbuf.exe.

I've taken an old 16GB no-name exFAT USB-stick and filled it for 98%. In the remaining 2% testing was done, first copying a WAV file there, then compressing it to a FLAC, then decompressing it to a WAV. These 3 files are then removed, making space for the next round.

Differences are night and day, for both encoding as well as decoding.

set of 16bit tracks vbuf.pdf

Testes presets are 0, 1, 2, 3 and 4. It seems encoding gets 10x as fast, and decoding gets 4-5x as fast. I think the reason 3 and 4 decode so much faster is because of the larger blocksize (hence larger writes) and 4096 * 4 is probably a much better write size than 1152 * 4.

So, clearly, I'd say this is an improvement. Even if it gives a slight slowdown in situations where fragmentation is low (which I can't say for sure) the advantages far outweigh the (possible) disadvantages.

I'll do some testing on using smaller block sizes, perhaps somethink like 128KiB already improves a lot.

I'm wondering though, should the setvbuf be removed from libFLAC, in effect moving it to the flac command line tool? Or should we only add it to the flac command line tool for non WIN32 encoding? Or should we duplicate it, like the current state of the repo does? Maybe calling setvbuf twice on the same file doesn't really do anything?

@ktmf01
Copy link
Collaborator

ktmf01 commented May 17, 2023

Here are some more detailed results

set of 16bit tracks vbuf.pdf

I'd say using a 32KiB buffer already gives a nice performance improvement. I'm not sure whether it makes much sense, but I think 10MiB is a bit wasteful. Then again, that is what's been in FLAC for years now on Windows.

I'll do some more testing, trying to get the same result with a heavily fragmented ext4 partition or something like that.

@ktmf01
Copy link
Collaborator

ktmf01 commented May 17, 2023

Some more results. Took a 32GB NTFS partition on a 250GB SSD (several years old) and filled it for 99%. Then ran the tests on Windows 10.
set of 16bit tracks vbuf SSD.pdf

X-axis is different here, because logarithmic scale doesn't show any labels. Difference is small but consistent. I have no clue why 32KiB and 10MiB buffer would be slower here than a 128KiB and 512KiB buffer on decoding.

@ktmf01
Copy link
Collaborator

ktmf01 commented May 23, 2023

Even more results. I tried to test something a little more realistic, so took a 1TB HDD, filled it for 95% and ran tests. The results are opposite of the previous few: adding a buffer slows down encoding by about 20% for some reason, while a buffer of 128KiB or 512KiB speeds up decoding by about 40%. These results don't seem to make sense, but they are reproducible, I've ran the tests several times.

Also, running the binaries is interleaved, 10 tracks are tested, and each track gets tested which each binary before moving on to the next track.

set of 16bit tracks vbuf HDD.pdf

@chocolate42
Copy link
Author

Sorry I just noticed your latest posts.

These results don't seem to make sense

In a way I'm glad not to be the only one mystified by benchmarking I/O.

I'd say using a 32KiB buffer already gives a nice performance improvement. I'm not sure whether it makes much sense, but I think 10MiB is a bit wasteful. Then again, that is what's been in FLAC for years now on Windows.

I came to the conclusion that at least 64KiB showed a benefit but in the MiB range did help a bit more, but that was during some unreliable benchmarking so inconclusive. Research indicated 8MiB could be beneficial and that there was little benefit to going beyond 16MiB, but I didn't save the link.

I'm wondering though, should the setvbuf be removed from libFLAC, in effect moving it to the flac command line tool?

IMO control of how a stream is handled should stay with whatever opened the stream, ie libflac when init_file is used or the external program when init_FILE is used. Quoting a manpage:

The setvbuf() function may only be used after opening a stream and before any other operations have been performed on it.

We don't know the history of the stream so can't guarantee that no ops have been performed. Similarly that's why I suggested elsewhere to only fclose in libflac if libflac was the one to fopen (for all we know the stream could be an archive or chained ogg file that the external program doesn't necessarily want closed).

Even if it gives a slight slowdown in situations where fragmentation is low (which I can't say for sure) the advantages far outweigh the (possible) disadvantages.

If you mean a slowdown because a larger buffer takes longer for the initial flush, IMO that's not really relevant and not really a slowdown. When I/O pressure is low (kernel+hardware buffer is not saturated, ie fast hardware and/or small data) it just delays the actual write slightly and shouldn't alter ./flac's execution time. When I/O pressure is high (kernel+hardware buffer is saturated so an fwrite has to wait during runtime, ie slow hardware and/or a lot of data like processing an entire collection) there are bigger factors at play than a millisecond delay on first flush.

@ktmf01
Copy link
Collaborator

ktmf01 commented May 26, 2023

I've repeated the last results, but I've added an executable that not only has a vbuf on the output file but also a vbuf on the input file. That seems to alleviate the problem.

very large set of tracks vbuf read HDD.pdf

@ValeZAA
Copy link

ValeZAA commented Jun 19, 2023

That reminds me of hacks in onion108/cplusplus-output-benchmark#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants