You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
0.14.0
BREAKING CHANGES
Turn table extraction for PDFs and images off by default. Reverting the default behavior for table extraction to "off" for PDFs and images. A number of users didn't realize we made the change and were impacted by slower processing times due to the extra model call for table extraction.
Enhancements
Skip unnecessary element sorting in partition_pdf(). Skip element sorting when determining whether embedded text can be extracted.
Faster evaluation Support for concurrent processing of documents during evaluation
Add strategy parameter to partition_docx(). Behavior of future enhancements may be sensitive the partitioning strategy. Add this parameter so partition_docx() is aware of the requested strategy.
Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR configuration parameteres to control temporary storage.
Features
Add form extraction basics (document elements and placeholder code in partition). This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a NotImplementedError.
Fixes
Add missing starting_page_num param to partition_image
Make the filename and file params for partition_image and partition_pdf match the other partitioners
Fix include_slide_notes and include_page_breaks params in partition_ppt
Re-apply: skip accuracy calculation feature Overwritten by mistake
Fix type hint for paragraph_grouper paramparagraph_grouper can be set to False, but the type hint did not not reflect this previously.
Remove links param from partition_pdflinks is extracted during partitioning and is not needed as a paramter in partition_pdf.
Improve CSV delimeter detection.partition_csv() would raise on CSV files with very long lines.
Fix disk-space leak in partition_doc(). Remove temporary file created but not removed when file argument is passed to partition_doc().
Fix possible SyntaxError or SyntaxWarning on regex patterns. Change regex patterns to raw strings to avoid these warnings/errors in Python 3.11+.
Fix disk-space leak in partition_odt(). Remove temporary file created but not removed when file argument is passed to partition_odt().