🔍 Description
When MarkItDown processes hyperlinks with square brackets [ or ] in their link text (e.g., [Learn [GPT]]), it fails to escape these characters in the output Markdown. This violates the CommonMark specification (Section 6.1), leading to:
Broken link rendering (Unmatched ']' errors)
Truncated link text (e.g., [Learn [GPT]] → parsed as two separate links)
Corruption of downstream LLM/document processing pipelines
🧪 Steps to Reproduce
Input: Convert a document containing a link with text Example [Text] (e.g., HTML: Example [Text] or Word hyperlink).
Conversion: Run MarkItDown to generate Markdown.
Output: https://url # UNSAFE: Unescaped brackets
Observed Result:
GitHub/VSCode preview: Link text truncates to Example [Text (ignores ])
Markdown parsers (e.g., markdown-it): Throw syntax errors
✅ Expected Behavior
Per CommonMark rules, square brackets in link text must be escaped:
https://url # CORRECT: Escaped brackets
Renders as: Example [Text] with functional link.
🌐 Impact
Critical: Breaks all workflows where link texts include [ ] (common in tech/docs).
Affected Components:
Markdown hyperlinks (url)
Reference-style links ([text][id])
Image alt-text (!img.png)
🛠 Suggested Fix
Implement escaping during link serialization:
// Pseudo-code (link renderer logic)
function escapeLinkText(text: string) {
return text.replace(/[[]]/g, "\$&"); // Escapes [ → [ , ] → ]
Standards Compliance:
https://spec.commonmark.org/0.30/#backslash-escapes
GFM: Identical escaping rules
🚧 Workarounds
Users currently must manually add \ to brackets post-conversion. Automation-unfriendly.
Environment: MarkItDown v0.9+, all input formats (PDF/Word/HTML).
Priority: High (blocking tech/docs use cases).
Tags: bug, markdown, links, escaping
🔍 Description
When MarkItDown processes hyperlinks with square brackets [ or ] in their link text (e.g., [Learn [GPT]]), it fails to escape these characters in the output Markdown. This violates the CommonMark specification (Section 6.1), leading to:
Broken link rendering (Unmatched ']' errors)
Truncated link text (e.g., [Learn [GPT]] → parsed as two separate links)
Corruption of downstream LLM/document processing pipelines
🧪 Steps to Reproduce
Input: Convert a document containing a link with text Example [Text] (e.g., HTML: Example [Text] or Word hyperlink).
Conversion: Run MarkItDown to generate Markdown.
Output: https://url # UNSAFE: Unescaped brackets
Observed Result:
GitHub/VSCode preview: Link text truncates to Example [Text (ignores ])
Markdown parsers (e.g., markdown-it): Throw syntax errors
✅ Expected Behavior
Per CommonMark rules, square brackets in link text must be escaped:
https://url # CORRECT: Escaped brackets
Renders as: Example [Text] with functional link.
🌐 Impact
Critical: Breaks all workflows where link texts include [ ] (common in tech/docs).
Affected Components:
Markdown hyperlinks (url)
Reference-style links ([text][id])
Image alt-text (!img.png)
🛠 Suggested Fix
Implement escaping during link serialization:
// Pseudo-code (link renderer logic)
function escapeLinkText(text: string) {
return text.replace(/[[]]/g, "\$&"); // Escapes [ → [ , ] → ]
Standards Compliance:
https://spec.commonmark.org/0.30/#backslash-escapes
GFM: Identical escaping rules
🚧 Workarounds
Users currently must manually add \ to brackets post-conversion. Automation-unfriendly.
Environment: MarkItDown v0.9+, all input formats (PDF/Word/HTML).
Priority: High (blocking tech/docs use cases).
Tags: bug, markdown, links, escaping