Introduce pdfcabinet #1643

ginabythebay · 2023-01-01T16:53:59Z

There was a small amount of discussion about this here: https://groups.google.com/g/perkeep/c/B_Wq3ovph2I

At this point, it seems to more or less work. It is pretty much a ripoff of scanning cabinet, but oriented to work with pdfs rather than images. Since a pdf often contains multiple pages and I couldn't think of a reason for a pdf document to contain multiple pdfs, the relationship is 1:1.

I thought this is a good place to share what I have and see who it looks to folks.

I haven't yet implemented the 'who' functionality but don't think that will be too hard. Pretty much like tags, but with an attribute named something like 'pdfcabinet:who'.

Right now when displaying the pdf document, I'm using an tag to show the associated pdf on the page and it contains all the perkeep chrome too, which is a bit weird. If someone can point me at a way to embed it without the perkeep chrome inside the object, I would appreciate it.

Another weirdness is that I am currently storing the filename of the pdf in the pdfcabinet:pdf permanode even though it is also in the file permanode. That is because I couldn't figure out how to efficiently grab it (I need it for each un-annotated pdf so I have something to identify the pdf to the user on the main page).

I'll probably also want to provide some facility for bulk uploading of already tagged and dated pdfs, but have not thought hard about this yet.

There was a small amount of discussion about this here: https://groups.google.com/g/perkeep/c/B_Wq3ovph2I At this point, it seems to more or less work. It is pretty much a ripoff of scanning cabinet, but oriented to work with pdfs rather than images. Since a pdf often contains multiple pages and I couldn't think of a reason for a pdf document to contain multiple pdfs, the relationship is 1:1. I thought this is a good place to share what I have and see who it looks to folks. I haven't yet implemented the 'who' functionality but don't think that will be too hard. Pretty much like tags, but with an attribute named something like 'pdfcabinet:who'. Right now when displaying the pdf document, I'm using an <object> tag to show the associated pdf on the page and it contains all the perkeep chrome too, which is a bit weird. If someone can point me at a way to embed it without the perkeep chrome inside the object, I would appreciate it. Another weirdness is that I am currently storing the filename of the pdf in the pdfcabinet:pdf permanode even though it is also in the file permanode. That is because I couldn't figure out how to efficiently grab it (I need it for each un-annotated pdf so I have something to identify the pdf to the user on the main page). I'll probably also want to provide some facility for bulk uploading of already tagged and dated pdfs, but have not thought hard about this yet.

MichaHoffmann · 2023-01-02T07:10:04Z

Hey,

I only gave it a quick cursory look but I wonder: couldnt this be built upon/into scanningcabinet in some way?

ginabythebay · 2023-01-02T12:35:43Z

Probably and that was an idea I raised here: https://groups.google.com/g/perkeep/c/B_Wq3ovph2I Since I didn’t receive much feedback I went ahead with this approach, which is simpler and doesn’t affect existing use cases for scanningcabinet.

…

On Sun, Jan 1, 2023, at 11:10 PM, Michael Hoffmann wrote: Hey, I only gave it a quick cursory look but I wonder: couldnt this be built upon/into scanningcabinet in some way? — Reply to this email directly, view it on GitHub <#1643 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAHQWZI5DQJZFSDBJS46LTWQJ5NNANCNFSM6AAAAAATOGSROQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

ginabythebay · 2023-01-02T19:34:26Z

I'm thinking this larger strategic issue would be best discussed on email. I've updated the thread here https://groups.google.com/g/perkeep/c/B_Wq3ovph2I

MichaHoffmann · 2023-01-02T21:28:05Z

I'm thinking this larger strategic issue would be best discussed on email. I've updated the thread here https://groups.google.com/g/perkeep/c/B_Wq3ovph2I

Answered in the mailing list, message does not yet show up though

ginabythebay · 2023-01-03T02:00:05Z

Yeah, lol, something seems to be borked with google groups maybe? Uh, sorry to ask this, but can you repeat your response here so I can see it?

For any other interested reviewers, this is what I wrote there:

I've since created PR #1643 #1643, where I went with the 'creating a separate app' option (in go).

In a PR comment, Micah asked if it would make sense to instead build the pdfcabinet upon/into scanningcabinet. I feel like it is better to try and continue this discussion here rather than in the PR.

I guess I'd like to get a sense of whether anyone actually cares about scanningcabinet. Not sure since it was sitting somewhat broken when I started playing with it recently. #1635

If nobody cares about scanningcabinet, perhaps the best thing to do is to remove it.

If someone does care about scanningcabinet, how would you feel about merging pdf functionality into it? The primary difference, I think is the 1:1 nature of of pdfs and documents where scanningcabinet expects multiple images(pages) per document.

That will show up in the UI, especially in the creation of documents. In scanningcabinet, you are expected to select some images, then click the button to turn that into a document. In pdfcabinet, you click the button associated with the pdf you want. Each flow should ideally be optimized for the user, to make it easy to create documents quickly.

If we were to try to merge pdfcabinet/scanningcabinet together, I lean towards some kind of mode the user can set to decide how they want to create documents (selecting multiple items vs. a single item).

The display of documents will also be affected by this. scanningcabinet lays out multiple images (using img tags I assume) where pdfcabinet uses an object to embed the pdf into the html page. This seems manageable...just thinking out loud, but I lean towards, at document creation time, marking the document permanode with an attribute to indicate which kind it is.

Does anyone else have thoughts related to this?ps the best thing to do is to remove it.

If someone does care about scanningcabinet, how would you feel about merging pdf functionality into it? The primary difference, I think is the 1:1 nature of of pdfs and documents where scanningcabinet expects multiple images(pages) per document.

That will show up in the UI, especially in the creation of documents. In scanningcabinet, you are expected to select some images, then click the button to turn that into a document. In pdfcabinet, you click the button associated with the pdf you want. Each flow should ideally be optimized for the user, to make it easy to create documents quickly.

If we were to try to merge pdfcabinet/scanningcabinet together, I lean towards some kind of mode the user can set to decide how they want to create documents (selecting multiple items vs. a single item).

The display of documents will also be affected by this. scanningcabinet lays out multiple images (using img tags I assume) where pdfcabinet uses an object to embed the pdf into the html page. This seems manageable...just thinking out loud, but I lean towards, at document creation time, marking the document permanode with an attribute to indicate which kind it is.

Does anyone else have thoughts related to this?

MichaHoffmann · 2023-01-03T07:04:52Z

Hey,

I wrote that PDFs would be more useful to my usecase. If you are dogfooding this for your usecase a separate app would be pretty cool too. If someone is using scancab we can think down the road if its appropriate to merge the two somehow. Ill give it a test drive in the (central european) evening.

ginabythebay · 2023-01-03T13:11:30Z

I would be fine with treating this as separate for now and merging the two later if/when that makes sense. I look forward to you giving it a try!

…

On Mon, Jan 2, 2023, at 11:05 PM, Michael Hoffmann wrote: Hey, I wrote that PDFs would be more useful to my usecase. If you are dogfooding this for your usecase a separate app would be pretty cool too. If someone is using scancab we can think down the road if its appropriate to merge the two somehow. Ill give it a test drive in the (central european) evening. — Reply to this email directly, view it on GitHub <#1643 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAHQWYFPPUO3Z2P4QKGZFDWQPFR5ANCNFSM6AAAAAATOGSROQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

MichaHoffmann · 2023-01-03T22:34:55Z

Hey,

I just tested really quick and uploaded a multi page PDF using curl and it worked and was displayed as expected. IIRC scanningcabinet came with a small tool to upload but i dont think it matters too much. Probably uploads could also be handled with a nicer UI at some point too ( using elm to write a UI is still on my bucket list ).
I wonder if the datamodel is fitting since most of my pdfs do not have a physical location at all, also name could be deduced from the filename i think.

What are your plans for this app? Does it fit your usecase as is or do you have some followup work planned?

ginabythebay · 2023-01-04T01:37:20Z

Thanks so much for giving it a try and offering feedback!

I agree the upload could be made much easier. Perhaps some web ui that allows the user to choose multiple files at a time. I'm not much of an html expert.

I am not wedded to the idea of having physical location be part of pdf cabinet. It was something that existed in scanning cabinet and I was trying to keep things as much the same as possible. But maybe it is actually better to start more minimally and add as we find reasons to do so.

Defaulting the title from the filename seems reasonable. In my case I suspect they will always be different but maybe not for other people.

The main thing I see wanting to do with the app is to add a UI field call 'who' which can be used to annotate document with, e.g. the sender/receiver. I'm thinking it would a be multivalue attribute, like tag. In my current pdfs I usually search by who or by tag, depending on what I am doing.

Due to this review, I'm starting to second-guess my decision to have a pdfcabinet:pdf permanode. I kept that without thinking too hard about it as an analog to scanningcabinet:image. But maybe pdfcabinet:document should just point at the file permanode of the pdf instead. I'm starting to think of a more streamlined ingestion process where the pdf gets uploaded and a pdfcabinet:document permanode is created, just with no tags. The main page already (like scanning cabinet) shows the user all of their untagged documents so they can go clean them up if they want to.

ginabythebay · 2023-01-04T01:43:29Z

Converted this to draft status for now so I can explore the idea of getting rid of the pdfcabinet:pdf permanode. I had been fine with the idea of adding a 'who' entry in a separate change, but I'd like to get the expected core schema in place in the first change if I can.

I think it is working. There may be more cruft I can cut out

ginabythebay · 2023-01-07T14:20:22Z

OK, I think this is ready for review now. At this point, pdfcabinet:doc nodes point at pdf file permanodes (I eliminated pdfcabinet:pdf nodes). Also remove physical location for now. Also the title is defaulted to match the filename, without any extensions.

I believe this is good enough to merge and dogfood, but of course there are more things to be done.

Near term:

some kind of UI for uploading files would make this a lot more approachable.
2.
I want to introduce the 'who' field(s), similar to tags, but orthogonal.
3.
It would be nice to have a bulk upload tool which allows a user to not only upload the pdf, but to set the other attributes (e.g. title, tags etc).

Longer term perhaps this doesn't need to be so pdf-specific but I haven't thought about that too hard.

ginabythebay · 2023-01-19T01:49:46Z

ping

bradfitz · 2023-01-19T02:16:12Z

Sorry, I saw the summary but I'm having trouble imagining the app itself and what the goal is.

With scanning cabinet, the goal was to decouple the task of scanning a stream of pages, feeding them into the scanner, and the task of segmenting the stream of pages into document objects, and then tagging those documents.

If your PDF blobs are 1:1 with permanodes, what's the goal of pdfcabinet over just using the Perkeep web UI?

Admittedly I haven't patched it in or played with it. You don't happen to have a video or screenshots?

ginabythebay · 2023-01-19T02:41:08Z

I’m hoping it could be useful to someone who doesn’t have to understand what a permanode is. I can work on some screenshots this weekend.

…

On Wed, Jan 18, 2023, at 6:16 PM, Brad Fitzpatrick wrote: Sorry, I saw the summary but I'm having trouble imagining the app itself and what the goal is. With scanning cabinet, the goal was to decouple the task of scanning a stream of pages, feeding them into the scanner, and the task of segmenting the stream of pages into document objects, and then tagging those documents. If your PDF blobs are 1:1 with permanodes, what's the goal of pdfcabinet over just using the Perkeep web UI? Admittedly I haven't patched it in or played with it. You don't happen to have a video or screenshots? — Reply to this email directly, view it on GitHub <#1643 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAHQWYR5K2LRIU2U3I3IVDWTCPXPANCNFSM6AAAAAATOGSROQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

ginabythebay · 2023-01-21T20:45:00Z

Here are some screenshots.

This first one is what it looks like after you have uploaded a few files (which I'm currently doing manually through curl; clearly a more user-friendly option would be nice. I'm hoping MicaHoffman will be interested and have time, but if not I can probably cobble something together).
1_uploaded.pdf

This second one looks a little weird because the screenshot software built into chrome seems to be trying to enforce page breaks, also it is doing something weird with margins in the embedded pdf that doesn't appear on my screen). it shows the title pre-filled in based on the file name, along with a place to define some metadata, as well as a preview of the file. The 'PDF' link takes the user directly to the pdf in perkeep.
2_initial_detail.pdf

This last screenshot shows the view after filling in some tags and then searching one one of them.
3_tag_results.pdf

For better or worse, I just used the scancab ui to get this far, removing the parts that don't apply when there is a 1:1 relationship.

I think it is useful to have an app over the perkeep UI as it helps to maintain a namespace (for tags) that is separate from whatever blobs are floating around in perkeep and the user doesn't have to learn the perkeep UI which seems to assume the user has some pretty esoteric knowledge.

This is pretty minimal at this point. Future work could include:
1: a better upload story
2: a reasonable export story
3: the ability to tag documents with information regarding who they are for/from
4: maybe some help to for users who already have pdfs floating around in perkeep and want to include them in pdfcabiniet

marchon · 2023-01-24T21:23:43Z

ginabythebay, I would enjoy having an extended voice or zoom call with you about pdfs and possible usages for my business case. Would you be willing to reach out to me at marchon@gmail.com, since I cant leave messasges in the group mailing lists or web discussions yet?

ginabythebay · 2023-01-25T03:34:18Z

I’m sorry but I cannot make myself available for the extended call. You can certainly use this forum to discuss the pull request. Also, once you have access, you can discuss it in the group mailing list.

…

On Tue, Jan 24, 2023, at 1:23 PM, George Lambert wrote: ginabythebay, I would enjoy having an extended voice or zoom call with you about pdfs and possible usages for my business case. Would you be willing to reach out to me at ***@***.***, since I cant leave messasges in the group mailing lists or web discussions yet? — Reply to this email directly, view it on GitHub <#1643 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAHQWZ6MKJTVG4TXJDPNYLWUBB6VANCNFSM6AAAAAATOGSROQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

marchon · 2023-01-25T05:31:18Z

thank you On Tue, Jan 24, 2023 at 10:34 PM ginabythebay ***@***.***> wrote:

…

I’m sorry but I cannot make myself available for the extended call. You can certainly use this forum to discuss the pull request. Also, once you have access, you can discuss it in the group mailing list. On Tue, Jan 24, 2023, at 1:23 PM, George Lambert wrote: > > > ginabythebay, I would enjoy having an extended voice or zoom call with you about pdfs and possible usages for my business case. Would you be willing to reach out to me at ***@***.***, since I cant leave messasges in the group mailing lists or web discussions yet? > > > — > Reply to this email directly, view it on GitHub < #1643 (comment)>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ABAHQWZ6MKJTVG4TXJDPNYLWUBB6VANCNFSM6AAAAAATOGSROQ >. > You are receiving this because you authored the thread.Message ID: ***@***.***> > — Reply to this email directly, view it on GitHub <#1643 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB2HIEWSJGZ6K2IICOTOFDWUCNMLANCNFSM6AAAAAATOGSROQ> . You are receiving this because you commented.Message ID: ***@***.***>

-- P THINK BEFORE PRINTING: is it really necessary? This e-mail and its attachments are confidential and solely for the intended addressee(s). Do not share or use them without approval. If received in error, contact the sender and delete them.

dev/devcam/server.go

bradfitz

LGTM otherwise

ginabythebay · 2023-02-04T14:05:06Z

What is the next step with this? I see that it is approved but also that merging is blocked because “ The base branch restricts merging to authorized users.”

ginabythebay added 2 commits January 1, 2023 08:43

lint: cremove unused code pointed out by staticcheck

6d9e5d5

ginabythebay marked this pull request as draft January 4, 2023 01:41

ginabythebay added 3 commits January 5, 2023 16:34

get rid of physical location

d74b7bf

1st draft remove pdf object

2887fec

I think it is working. There may be more cruft I can cut out

trim file extension when using it as title

390ae9d

ginabythebay marked this pull request as ready for review January 7, 2023 14:20

bradfitz reviewed Jan 27, 2023

View reviewed changes

dev/devcam/server.go Outdated Show resolved Hide resolved

bradfitz approved these changes Jan 27, 2023

View reviewed changes

Add scanningcabinet back into the list of targets

a0ba180

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce pdfcabinet #1643

Introduce pdfcabinet #1643

ginabythebay commented Jan 1, 2023

MichaHoffmann commented Jan 2, 2023

ginabythebay commented Jan 2, 2023 via email

ginabythebay commented Jan 2, 2023

MichaHoffmann commented Jan 2, 2023

ginabythebay commented Jan 3, 2023 •

edited

MichaHoffmann commented Jan 3, 2023

ginabythebay commented Jan 3, 2023 via email

MichaHoffmann commented Jan 3, 2023

ginabythebay commented Jan 4, 2023

ginabythebay commented Jan 4, 2023

ginabythebay commented Jan 7, 2023

ginabythebay commented Jan 19, 2023

bradfitz commented Jan 19, 2023

ginabythebay commented Jan 19, 2023 via email

ginabythebay commented Jan 21, 2023

marchon commented Jan 24, 2023

ginabythebay commented Jan 25, 2023 via email

marchon commented Jan 25, 2023 via email

bradfitz left a comment

ginabythebay commented Feb 4, 2023

Introduce pdfcabinet #1643

Are you sure you want to change the base?

Introduce pdfcabinet #1643

Conversation

ginabythebay commented Jan 1, 2023

MichaHoffmann commented Jan 2, 2023

ginabythebay commented Jan 2, 2023 via email

ginabythebay commented Jan 2, 2023

MichaHoffmann commented Jan 2, 2023

ginabythebay commented Jan 3, 2023 • edited

MichaHoffmann commented Jan 3, 2023

ginabythebay commented Jan 3, 2023 via email

MichaHoffmann commented Jan 3, 2023

ginabythebay commented Jan 4, 2023

ginabythebay commented Jan 4, 2023

ginabythebay commented Jan 7, 2023

ginabythebay commented Jan 19, 2023

bradfitz commented Jan 19, 2023

ginabythebay commented Jan 19, 2023 via email

ginabythebay commented Jan 21, 2023

marchon commented Jan 24, 2023

ginabythebay commented Jan 25, 2023 via email

marchon commented Jan 25, 2023 via email

bradfitz left a comment

Choose a reason for hiding this comment

ginabythebay commented Feb 4, 2023

ginabythebay commented Jan 3, 2023 •

edited