Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce pdfcabinet #1643

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

ginabythebay
Copy link
Contributor

There was a small amount of discussion about this here: https://groups.google.com/g/perkeep/c/B_Wq3ovph2I

At this point, it seems to more or less work. It is pretty much a ripoff of scanning cabinet, but oriented to work with pdfs rather than images. Since a pdf often contains multiple pages and I couldn't think of a reason for a pdf document to contain multiple pdfs, the relationship is 1:1.

I thought this is a good place to share what I have and see who it looks to folks.

I haven't yet implemented the 'who' functionality but don't think that will be too hard. Pretty much like tags, but with an attribute named something like 'pdfcabinet:who'.

Right now when displaying the pdf document, I'm using an tag to show the associated pdf on the page and it contains all the perkeep chrome too, which is a bit weird. If someone can point me at a way to embed it without the perkeep chrome inside the object, I would appreciate it.

Another weirdness is that I am currently storing the filename of the pdf in the pdfcabinet:pdf permanode even though it is also in the file permanode. That is because I couldn't figure out how to efficiently grab it (I need it for each un-annotated pdf so I have something to identify the pdf to the user on the main page).

I'll probably also want to provide some facility for bulk uploading of already tagged and dated pdfs, but have not thought hard about this yet.

There was a small amount of discussion about this here:
https://groups.google.com/g/perkeep/c/B_Wq3ovph2I

At this point, it seems to more or less work.  It is pretty much a
ripoff of scanning cabinet, but oriented to work with pdfs rather than
images.  Since a pdf often contains multiple pages and I couldn't
think of a reason for a pdf document to contain multiple pdfs, the
relationship is 1:1.

I thought this is a good place to share what I have and see who it
looks to folks.

I haven't yet implemented the 'who' functionality but don't think that
will be too hard.  Pretty much like tags, but with an attribute
named something like 'pdfcabinet:who'.

Right now when displaying the pdf document, I'm using an <object> tag
to show the associated pdf on the page and it contains all the perkeep
chrome too, which is a bit weird.  If someone can point me at a way to
embed it without the perkeep chrome inside the object, I would
appreciate it.

Another weirdness is that I am currently storing the filename of the
pdf in the pdfcabinet:pdf permanode even though it is also in the file
permanode.  That is because I couldn't figure out how to efficiently
grab it (I need it for each un-annotated pdf so I have something to
identify the pdf to the user on the main page).

I'll probably also want to provide some facility for bulk uploading of
already tagged and dated pdfs, but have not thought hard about this yet.
@MichaHoffmann
Copy link
Member

Hey,

I only gave it a quick cursory look but I wonder: couldnt this be built upon/into scanningcabinet in some way?

@ginabythebay
Copy link
Contributor Author

ginabythebay commented Jan 2, 2023 via email

@ginabythebay
Copy link
Contributor Author

I'm thinking this larger strategic issue would be best discussed on email. I've updated the thread here https://groups.google.com/g/perkeep/c/B_Wq3ovph2I

@MichaHoffmann
Copy link
Member

I'm thinking this larger strategic issue would be best discussed on email. I've updated the thread here https://groups.google.com/g/perkeep/c/B_Wq3ovph2I

Answered in the mailing list, message does not yet show up though

@ginabythebay
Copy link
Contributor Author

ginabythebay commented Jan 3, 2023

Yeah, lol, something seems to be borked with google groups maybe? Uh, sorry to ask this, but can you repeat your response here so I can see it?

For any other interested reviewers, this is what I wrote there:

I've since created PR #1643 #1643, where I went with the 'creating a separate app' option (in go).

In a PR comment, Micah asked if it would make sense to instead build the pdfcabinet upon/into scanningcabinet. I feel like it is better to try and continue this discussion here rather than in the PR.

I guess I'd like to get a sense of whether anyone actually cares about scanningcabinet. Not sure since it was sitting somewhat broken when I started playing with it recently. #1635

If nobody cares about scanningcabinet, perhaps the best thing to do is to remove it.

If someone does care about scanningcabinet, how would you feel about merging pdf functionality into it? The primary difference, I think is the 1:1 nature of of pdfs and documents where scanningcabinet expects multiple images(pages) per document.

That will show up in the UI, especially in the creation of documents. In scanningcabinet, you are expected to select some images, then click the button to turn that into a document. In pdfcabinet, you click the button associated with the pdf you want. Each flow should ideally be optimized for the user, to make it easy to create documents quickly.

If we were to try to merge pdfcabinet/scanningcabinet together, I lean towards some kind of mode the user can set to decide how they want to create documents (selecting multiple items vs. a single item).

The display of documents will also be affected by this. scanningcabinet lays out multiple images (using img tags I assume) where pdfcabinet uses an object to embed the pdf into the html page. This seems manageable...just thinking out loud, but I lean towards, at document creation time, marking the document permanode with an attribute to indicate which kind it is.

Does anyone else have thoughts related to this?ps the best thing to do is to remove it.

If someone does care about scanningcabinet, how would you feel about merging pdf functionality into it? The primary difference, I think is the 1:1 nature of of pdfs and documents where scanningcabinet expects multiple images(pages) per document.

That will show up in the UI, especially in the creation of documents. In scanningcabinet, you are expected to select some images, then click the button to turn that into a document. In pdfcabinet, you click the button associated with the pdf you want. Each flow should ideally be optimized for the user, to make it easy to create documents quickly.

If we were to try to merge pdfcabinet/scanningcabinet together, I lean towards some kind of mode the user can set to decide how they want to create documents (selecting multiple items vs. a single item).

The display of documents will also be affected by this. scanningcabinet lays out multiple images (using img tags I assume) where pdfcabinet uses an object to embed the pdf into the html page. This seems manageable...just thinking out loud, but I lean towards, at document creation time, marking the document permanode with an attribute to indicate which kind it is.

Does anyone else have thoughts related to this?

@MichaHoffmann
Copy link
Member

Hey,

I wrote that PDFs would be more useful to my usecase. If you are dogfooding this for your usecase a separate app would be pretty cool too. If someone is using scancab we can think down the road if its appropriate to merge the two somehow. Ill give it a test drive in the (central european) evening.

@ginabythebay
Copy link
Contributor Author

ginabythebay commented Jan 3, 2023 via email

@MichaHoffmann
Copy link
Member

Hey,

I just tested really quick and uploaded a multi page PDF using curl and it worked and was displayed as expected. IIRC scanningcabinet came with a small tool to upload but i dont think it matters too much. Probably uploads could also be handled with a nicer UI at some point too ( using elm to write a UI is still on my bucket list ).
I wonder if the datamodel is fitting since most of my pdfs do not have a physical location at all, also name could be deduced from the filename i think.

What are your plans for this app? Does it fit your usecase as is or do you have some followup work planned?

@ginabythebay
Copy link
Contributor Author

Thanks so much for giving it a try and offering feedback!

I agree the upload could be made much easier. Perhaps some web ui that allows the user to choose multiple files at a time. I'm not much of an html expert.

I am not wedded to the idea of having physical location be part of pdf cabinet. It was something that existed in scanning cabinet and I was trying to keep things as much the same as possible. But maybe it is actually better to start more minimally and add as we find reasons to do so.

Defaulting the title from the filename seems reasonable. In my case I suspect they will always be different but maybe not for other people.

The main thing I see wanting to do with the app is to add a UI field call 'who' which can be used to annotate document with, e.g. the sender/receiver. I'm thinking it would a be multivalue attribute, like tag. In my current pdfs I usually search by who or by tag, depending on what I am doing.

Due to this review, I'm starting to second-guess my decision to have a pdfcabinet:pdf permanode. I kept that without thinking too hard about it as an analog to scanningcabinet:image. But maybe pdfcabinet:document should just point at the file permanode of the pdf instead. I'm starting to think of a more streamlined ingestion process where the pdf gets uploaded and a pdfcabinet:document permanode is created, just with no tags. The main page already (like scanning cabinet) shows the user all of their untagged documents so they can go clean them up if they want to.

@ginabythebay ginabythebay marked this pull request as draft January 4, 2023 01:41
@ginabythebay
Copy link
Contributor Author

Converted this to draft status for now so I can explore the idea of getting rid of the pdfcabinet:pdf permanode. I had been fine with the idea of adding a 'who' entry in a separate change, but I'd like to get the expected core schema in place in the first change if I can.

@ginabythebay
Copy link
Contributor Author

OK, I think this is ready for review now. At this point, pdfcabinet:doc nodes point at pdf file permanodes (I eliminated pdfcabinet:pdf nodes). Also remove physical location for now. Also the title is defaulted to match the filename, without any extensions.

I believe this is good enough to merge and dogfood, but of course there are more things to be done.

Near term:

some kind of UI for uploading files would make this a lot more approachable.
2.
I want to introduce the 'who' field(s), similar to tags, but orthogonal.
3.
It would be nice to have a bulk upload tool which allows a user to not only upload the pdf, but to set the other attributes (e.g. title, tags etc).

Longer term perhaps this doesn't need to be so pdf-specific but I haven't thought about that too hard.

@ginabythebay ginabythebay marked this pull request as ready for review January 7, 2023 14:20
@ginabythebay
Copy link
Contributor Author

ping

@bradfitz
Copy link
Contributor

Sorry, I saw the summary but I'm having trouble imagining the app itself and what the goal is.

With scanning cabinet, the goal was to decouple the task of scanning a stream of pages, feeding them into the scanner, and the task of segmenting the stream of pages into document objects, and then tagging those documents.

If your PDF blobs are 1:1 with permanodes, what's the goal of pdfcabinet over just using the Perkeep web UI?

Admittedly I haven't patched it in or played with it. You don't happen to have a video or screenshots?

@ginabythebay
Copy link
Contributor Author

ginabythebay commented Jan 19, 2023 via email

@ginabythebay
Copy link
Contributor Author

Here are some screenshots.

This first one is what it looks like after you have uploaded a few files (which I'm currently doing manually through curl; clearly a more user-friendly option would be nice. I'm hoping MicaHoffman will be interested and have time, but if not I can probably cobble something together).
1_uploaded.pdf

This second one looks a little weird because the screenshot software built into chrome seems to be trying to enforce page breaks, also it is doing something weird with margins in the embedded pdf that doesn't appear on my screen). it shows the title pre-filled in based on the file name, along with a place to define some metadata, as well as a preview of the file. The 'PDF' link takes the user directly to the pdf in perkeep.
2_initial_detail.pdf

This last screenshot shows the view after filling in some tags and then searching one one of them.
3_tag_results.pdf

For better or worse, I just used the scancab ui to get this far, removing the parts that don't apply when there is a 1:1 relationship.

I think it is useful to have an app over the perkeep UI as it helps to maintain a namespace (for tags) that is separate from whatever blobs are floating around in perkeep and the user doesn't have to learn the perkeep UI which seems to assume the user has some pretty esoteric knowledge.

This is pretty minimal at this point. Future work could include:
1: a better upload story
2: a reasonable export story
3: the ability to tag documents with information regarding who they are for/from
4: maybe some help to for users who already have pdfs floating around in perkeep and want to include them in pdfcabiniet

@marchon
Copy link

marchon commented Jan 24, 2023

ginabythebay, I would enjoy having an extended voice or zoom call with you about pdfs and possible usages for my business case. Would you be willing to reach out to me at marchon@gmail.com, since I cant leave messasges in the group mailing lists or web discussions yet?

@ginabythebay
Copy link
Contributor Author

ginabythebay commented Jan 25, 2023 via email

@marchon
Copy link

marchon commented Jan 25, 2023 via email

dev/devcam/server.go Outdated Show resolved Hide resolved
Copy link
Contributor

@bradfitz bradfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@ginabythebay
Copy link
Contributor Author

What is the next step with this? I see that it is approved but also that merging is blocked because “ The base branch restricts merging to authorized users.”

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants