Skip to content

fix: avoid misclassifying sparse PDF prose as tables#1847

Open
jaythehardcoder wants to merge 1 commit intomicrosoft:mainfrom
jaythehardcoder:fix/pdf-whitespace-120
Open

fix: avoid misclassifying sparse PDF prose as tables#1847
jaythehardcoder wants to merge 1 commit intomicrosoft:mainfrom
jaythehardcoder:fix/pdf-whitespace-120

Conversation

@jaythehardcoder
Copy link
Copy Markdown

Bug Description

Fixes #120.

Some prose-heavy PDFs were being routed through the form/table path, which collapsed normal spacing and produced output like Leak,Cheat,Repeat instead of plain text.

Root Cause

The PDF converter treated rows as table rows whenever they aligned with at least two detected columns. On wide multi-column prose pages, that produced a large set of tentative columns even though each row only touched a small fraction of them. The page then got rendered as a sparse markdown table instead of falling back to text extraction.

Fix

  • track how many global columns each detected table row actually uses
  • reject very wide layouts where the median table row only fills a small share of the detected columns
  • add a regression test for sparse multi-column prose
  • keep coverage for wide dense tables so real tables still convert through the table path

How to Verify

  1. Run pytest -q packages/markitdown/tests/test_pdf_prose_layout_detection.py.
  2. Run pytest -q packages/markitdown/tests.
  3. Convert the PDF from issue Removal of all whitespaces during PDF conversion #120 and confirm the prose keeps normal spaces instead of starting as a fake table.

Test Plan

  • Added regression test for this bug
  • Existing tests still pass
  • Manual verification of the fix

Risk Assessment

Low — the new guard only applies to very wide detected table layouts and leaves dense table extraction intact.

@jaythehardcoder jaythehardcoder force-pushed the fix/pdf-whitespace-120 branch from 674f432 to e276945 Compare April 29, 2026 09:59
@jaythehardcoder
Copy link
Copy Markdown
Author

jaythehardcoder commented Apr 29, 2026 via email

@jaythehardcoder
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Removal of all whitespaces during PDF conversion

1 participant