Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose dirent pointer and added readDirStreamWith #251

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

mmhat
Copy link
Contributor

@mmhat mmhat commented Oct 1, 2022

This PR adds two new functions readDirStreamWith and readDirStreamWithPtr:

  • readDirStreamWith is a version of readDirStreamMaybe that takes a callback that is used to obtain the result from the pointer to the directory entry. This directory pointer is wrapped in a newtype called DirEnt. The function is intended to be used in cases where we know about the features of the installed libc: For example, glibc provides the information about the type of the directory entry directly in the dirent struct. Therefore it can be used directly without calling further functions and additional calls to stat(2).
    Since those extensions to dirent are not covered by POSIX but widely used, we provide only the possibility to make use of those and do not add this functionality to unix.
  • readDirStreamWithPtr takes a pre-allocated pointer in addition to the other arguments of readDirStreamWith. This pointer is used to store the pointer of the dirent struct (if there is any) which can safe some allocations when you want to read a lot of directory entries as this pre-allocated pointer can be shared between the calls to the function.

In addition to that a small CPP bug was fixed which prevented compilation on my system. Also, some tests for the readDirStream functions were added.

@mmhat mmhat force-pushed the expose-dirent branch 2 times, most recently from a2e6261 to 3f5c348 Compare October 1, 2022 14:45
@vdukhovni
Copy link

Why not conditionally provide a new high-level Linux/glibc-specific API that returns a directory entry and the file type. The conditional test here would be the availability of the DT_REG and DT_UNKNOWN entry types. The module would then also expose pattern synonyms for the expected types.

I think this would be cleaner/safer than exposing the proposed API. As for avoiding allocation of a dirent structure, with all the attendant syscall overhead, the memory GC cost should be fairly modest.

So I am not entirely enthused by this. Which is not a hard no, but significant scepticism.

tests/ReadDirStream.hs Outdated Show resolved Hide resolved
tests/ReadDirStream.hs Outdated Show resolved Hide resolved
@mmhat
Copy link
Contributor Author

mmhat commented Oct 4, 2022

@vdukhovni

Why not conditionally provide a new high-level Linux/glibc-specific API that returns a directory entry and the file type. The conditional test here would be the availability of the DT_REG and DT_UNKNOWN entry types. The module would then also expose pattern synonyms for the expected types.

It was my understanding that the unix package is about POSIX-compliance and additional features should be implemented in upstream libraries like hpath-path. This patch aims to make it easier for upstream libraries to do that as they don't have to copy/replicate the work done here.
If the conditional high-level API you proposed would be accepted as a contribution I'd be more than happy to implement those.

I think this would be cleaner/safer than exposing the proposed API. As for avoiding allocation of a dirent structure, with all the attendant syscall overhead, the memory GC cost should be fairly modest.

Ok, that's fair. I consider readDirStreamWithPtr as optional and I am willing to remove this function if requested

@mmhat
Copy link
Contributor Author

mmhat commented Oct 4, 2022

@hasufell

Am more in favor of https://hackage.haskell.org/package/hpath-posix-0.13.3/docs/System-Posix-RawFilePath-Directory-Traversals.html#v:readDirEnt

I am aware of hpath-posix (Great library!) and the purpose of this patch is precisely to make libraries like this one easier to implement. For example, I noticed that the function linked by you is essentially a copy of the readDirStream + d_type. My hope is that this contribution will reduce the need for code duplication like this.

@hasufell
Copy link
Member

I'm not opposed to this. Can we vote on this @hs-viktor @vdukhovni @Bodigrim

👍 from me

@vdukhovni
Copy link

I'm not opposed to this. Can we vote on this @hs-viktor @vdukhovni @Bodigrim

👍 from me

Which this are you in favour of? As I said, I'd rather not expose the proposed API and instead support the higher-level interface.

@hasufell
Copy link
Member

I'm not opposed to this. Can we vote on this @hs-viktor @vdukhovni @Bodigrim

👍 from me

Which this are you in favour of? As I said, I'd rather not expose the proposed API and instead support the higher-level interface.

The changes in this PR. I'm fine with these low-level interfaces. Alternatively, we can also put them into Internal module and expose it to signal "you're on your own".

@hasufell
Copy link
Member

hasufell commented Dec 5, 2022

If no other maintainer chimes in, then I'll merge this in 2 weeks.

@Bodigrim
Copy link
Contributor

Bodigrim commented Dec 8, 2022

@hs-viktor @vdukhovni just a gentle reminder.

@vdukhovni
Copy link

I still mostly stand by my original comment, and perhaps expose this only as an Internal API with no stability guarantees. Maintainers of dependent packages would have to be willing to adapt to future changes...

Why not just support the higher-level (system-specific) API?

@hasufell
Copy link
Member

hasufell commented Dec 9, 2022

I'm fine with exposing this as Internal.

@hasufell
Copy link
Member

hasufell commented Dec 9, 2022

haskell/ghcup-hs#415 (comment)

One more reason to merge this and add tests.

@Bodigrim
Copy link
Contributor

Bodigrim commented Dec 9, 2022

This is way out of my area of expertise, but I'm inclined to agree with Viktor.

It was my understanding that the unix package is about POSIX-compliance and additional features should be implemented in upstream libraries like hpath-path. This patch aims to make it easier for upstream libraries to do that as they don't have to copy/replicate the work done here.

Practically unix is a package for everything which is not Windows. E. g., WASM is not a POSIX-compliant platform, and almost every OS has its own set of quirks violating POSIX.

If the conditional high-level API you proposed would be accepted as a contribution I'd be more than happy to implement those.

I'd accept such contribution.

@Bodigrim
Copy link
Contributor

@mmhat are you interested to work on a high-level API here?

@hasufell
Copy link
Member

I don't think it's appropriate for this PR to provide the high-level API. We're going to burn out the contributor this way. We should decide whether we merge this PR mostly as-is and expose it as internal API.

This has a 👍 from me.

@Bodigrim
Copy link
Contributor

We're going to burn out the contributor this way.

That's a very sensible concern, but @mmhat originally said that "If the conditional high-level API you proposed would be accepted as a contribution I'd be more than happy to implement those". It seems a win-win to me.

@mmhat
Copy link
Contributor Author

mmhat commented Dec 19, 2022

We're going to burn out the contributor this way.

That's a very sensible concern, but @mmhat originally said that "If the conditional high-level API you proposed would be accepted as a contribution I'd be more than happy to implement those". It seems a win-win to me.

@Bodigrim Luckly I have a bit more time to work on this until the end of the year so I could do that.

@mmhat
Copy link
Contributor Author

mmhat commented Dec 20, 2022

Why not conditionally provide a new high-level Linux/glibc-specific API that returns a directory entry and the file type. The conditional test here would be the availability of the DT_REG and DT_UNKNOWN entry types. The module would then also expose pattern synonyms for the expected types.

I think this would be cleaner/safer than exposing the proposed API. As for avoiding allocation of a dirent structure, with all the attendant syscall overhead, the memory GC cost should be fairly modest.

So I am not entirely enthused by this. Which is not a hard no, but significant scepticism.

@Bodigrim @hasufell I added some high-level API as outlined in the comment by @vdukhovni. I also moved readDirStreamWith and readDirStreamPtr to the internal API. Can you have a quick look if this first draft matches your expectations?

@hasufell
Copy link
Member

hasufell commented Feb 3, 2023

I just stumbled over this:

So... DirType will not work as expected on some filesystem. How do we handle that? It is as such unsafe non-portable.

@vdukhovni
Copy link

Another reason to not expose the low-level API.

@hasufell
Copy link
Member

hasufell commented Feb 3, 2023

In ghcup I've written a portable version of readDirStream: haskell/ghcup-hs@53e324b#diff-2d9015010bcf8511a8be4190d7203f50fea2ce7f5b04d3f2d44bea2b93914ce1R116

This is similar to what other projects have been doing (falling back to 'stat' on 'DT_UNKNOWN'): ggreer/the_silver_searcher#37

@mmhat
Copy link
Contributor Author

mmhat commented Feb 8, 2023

Another reason to not expose the low-level API.

I kindly disagree: IMHO, when it comes to directory tree traversal, the functions we provide in this package are already low-level. The user needs to understand concepts like a directory stream and we throw a bunch of low-level technical terms at them in the Haddocks (like dirent, d_name, ...). I think it is reasonable to assume that they are familiar with all that stuff in general and that exposing DirType does not add another level of "low-leveliness" that should be hidden.

To resolve this issue with unsupported filesystems I see the following options:

  1. We decide that exposing DirType comes indeed with too much limitations and we do not want to support it in this library. Hence this contribution here is rejected.
  2. We expose the readDirStreamWithType functions and DirType as part of an internal API. That would probably mean to create modules like System.Posix.Directory.Internals.ByteString and move those functions there.
  3. We expose the readDirStreamWithType functions and DirType as part of an public API as it is currently done in this PR.

If we go with (2) or (3) I'd like to add some documentation pointing out what the limitations of DirType are. In addition to that I could add a portable version of readDirStreamWithType like @hasufell did for ghcup-hs or change it to readDirStreamWithType :: Bool -> ... where the Bool indicates whether to use stat on DT_UKNOWN or not. Another option is to return a Maybe DirType where Nothing means DT_UNKNOWN.

Personally, I'd prefer option (3) with readDirStreamWithType as it is. Also, I'd rather add the Bool argument than introduce another (portable) version of readDirStreamWithType.

@hasufell
Copy link
Member

hasufell commented Feb 8, 2023

where the Bool indicates whether to use stat on DT_UKNOWN

When would someone pass False here? I believe we can expose the non-portable version in a module that's clearly marked non-portable. I think otherwise the unix package really strives to be... well, portable.

No need for a Bool option imo.

@mmhat
Copy link
Contributor Author

mmhat commented Feb 8, 2023

where the Bool indicates whether to use stat on DT_UKNOWN

When would someone pass False here?

Good point; I always thought of the readDirStream functions as wrappers around readdir, hence being unopinionated about fallback behavior and leave it up to the user to decide what to do if a filesystem doesn't support certain functionality. So far for the theory, I admit I don't have a use case in mind where somebody would pass False.

There is however a corner case I am not sure how to deal with in a fallback implementation: Suppose the underlying platform does not properly support d_type and some of DT_UNKNOWN, DT_REG, ... are not defined in dirent.h. Then the corresponding constants in this package will be all be -1 due to FP_CHECK_CONSTS in the configure.ac and a fallback implementation like the one linked in #251 (comment) won't save us as the constructed DirType value is meaningless.

@hasufell
Copy link
Member

hasufell commented Feb 9, 2023

There is however a corner case I am not sure how to deal with in a fallback implementation: Suppose the underlying platform does not properly support d_type and some of DT_UNKNOWN, DT_REG, ... are not defined in dirent.h. Then the corresponding constants in this package will be all be -1 due to FP_CHECK_CONSTS in the configure.ac and a fallback implementation like the one linked in #251 (comment) won't save us as the constructed DirType value is meaningless.

We should be able to check for DT_UNKNOWN or -1 in the stat-fallback, no?

@vdukhovni
Copy link

Returning a directory name along with a DT_UNKNOWN or actual file type is the lowest-level interface we should expose in my view. There's little reason to distinguish at that layer between the filesystem not supporting file type metadata in the directory entry and the C-API not making the data available.

Users who want robust/efficient traversal based on that data will for now need to roll their own upper layers. At some point we could offer a suitably robust iterator (making the equivalent of "rm -r" safe against hostile actors manipulating the directory tree during the traversal). That requires some care.

@hasufell
Copy link
Member

hasufell commented Mar 14, 2023

Returning a directory name along with a DT_UNKNOWN or actual file type is the lowest-level interface we should expose in my view.

I'm not sure if my opinion was clear enough: I'm suggesting to only expose C FFI, at most a haskell function with a callback taking the DirEnt as an argument (the initial proposal from the author). Nothing else.

I'm still very clearly -1 on providing a seemingly high-level Haskell function that is non-portable.

@hasufell
Copy link
Member

I propose to unblock this with a vote:

  1. @vdukhovni suggests to expose a function that doesn't abstract over platforms that don't have d_type and let the caller handle it: Expose dirent pointer and added readDirStreamWith #251 (review)

  2. @hasufell proposed to expose FFI functions and a high-level function that abstracts over platforms, with a possible performance penalty on those.

I'm +1 on suggestion 2.

@vdukhovni @Bodigrim @pcapriotti @chessai @igfoo

@hasufell
Copy link
Member

hasufell commented Oct 5, 2023

statx has been added to unix. But it doesn't work on macOS afaiu.

@vdukhovni
Copy link

statx has been added to unix. But it doesn't work on macOS afaiu.

Not surprising, MacOS unix is a *BSD descendent. There is no statx(2) in FreeBSD or NetBSD either, and I expect ditto for OpenBSD.

@vdukhovni
Copy link

I propose to unblock this with a vote:

  1. @vdukhovni suggests to expose a function that doesn't abstract over platforms that don't have d_type and let the caller handle it: Expose dirent pointer and added readDirStreamWith #251 (review)
  2. @hasufell proposed to expose FFI functions and a high-level function that abstracts over platforms, with a possible performance penalty on those.

I'm +1 on suggestion 2.

@vdukhovni @Bodigrim @pcapriotti @chessai @igfoo

My vote is still 1. Not fond of exposing FFI to low-level storable structures with platform-dependent fields.
The statx(2) implementation is the sort of thing that I do support. A high-level API that's supported on a subset of platforms.
In the case of statx(2), did we end up exposing a compile-time constant that makes it easy for users of unix to determine whether their platform has statx(2) without resorting to CPP or try blocks? Perhapst that should be added if not present.

@vdukhovni
Copy link

In the case of statx(2), did we end up exposing a compile-time constant that makes it easy for users of unix to determine whether their platform has statx(2) without resorting to CPP or try blocks? Perhapst that should be added if not present.

A user could approximate this by checking whether StatxType == 0 or some other expected to be non-zero with statx(2) pattern synonym, but we should perhaps provide a more natural indication of support.

@hasufell
Copy link
Member

hasufell commented Oct 5, 2023

@vdukhovni

#include "HsUnix.h"

module Main where

main :: IO ()
main = do
#ifdef HAVE_STATX
  putStrLn "STATX"
#else
  putStrLn "Hello, Haskell!"
#endif

This seems to work.


https://github.com/hasufell/unix-test

@vdukhovni
Copy link

vdukhovni commented Oct 5, 2023

@vdukhovni

...
#ifdef HAVE_STATX
  putStrLn "STATX"
#else
  putStrLn "Hello, Haskell!"
#endif

This seems to work.

Sure, but I was thinking that it be friendly to our users to do that for them, and expose a boolean value they can test without any of CPP guards or #include directives. They can do that by carefully choosing which pattern synonym to compare with 0, but is ad-hoc and non-obvious.

This is not a "must have", just something that could be a useful addition, unless you see reasons not to...

@hasufell
Copy link
Member

I'm a bit embarrassed that this didn't move forward in time.

My suggestion is to move the new API to modules that end in .Internal and just merge.

Any objections?

@mmhat
Copy link
Contributor Author

mmhat commented May 21, 2024

@hasufell Would work for me. But I think it would be good to keep the issue open, I still hope that we expose some of the functionality in the public API in the future...

@hasufell
Copy link
Member

hasufell commented May 25, 2024

Alright. If you adjust the PR to expose via Internal modules, I will merge in 2 weeks, unless there are objections.

To be clear: internal modules are still under PVP, but I see no obligation to do e.g. deprecation periods for them. They may break at random.

mmhat added 14 commits May 30, 2024 18:28
This commit exposes the dirent pointer used in the directory stream
functions wrapped in a newtype called `DirEnt`. It also adds a new
function `readDirStreamWith` that takes a callback that is used to
obtain the result from the pointer to the directory entry.
This version of `readDirStreamWith` takes a pre-allocated pointer as an
additional argument that is used to store the pointer to the dirent. It
is useful when you want to read a lot of entries and want to safe a few
allocations since you can reuse the pointer for each call to
`readDirStreamWithPtr`.
 * Added `readDirStream`/`readDirStream` for ByteString, PosixPath
 * Removed `DirEnt`, `readDirStreamWith`, `readDirStreamWithPtr` from public API
 * Changed the assignment of values for missing DT_* constants. Since
   the 'd_type' field is of type 'unsigned char' we assign a negative
   number for each missing DT_* constant so they are distinguishable but
   do not collide with the values from the libc implementation.
 * Added a new 'DirStreamWithPath' that contains the path of the
   directory the directory stream belongs to.
 * 'readDirStreamWithType' falls back to a 'stat' if the 'd_type' is
   unknown or undetermined.
 * Added some tests for 'readDirStreamWithType'.
@mmhat
Copy link
Contributor Author

mmhat commented May 30, 2024

@hasufell The stuff from the System.Posix.Directory.Common module is now exposed in System.Posix.Directory.Internals. The actual implementation still lives in the Common since it is used in the implementations of the readDirStreamMaybe functions. Is that ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants