Converting HTML to Markdown with Python
In this blog post, we will delve into a Python script designed to convert HTML files into Markdown format. This script not only converts the content but also handles code blocks with language detection, sanitizes filenames, and adds YAML front matter for static site generators like Jekyll or Hugo. We’ll provide an in-depth explanation of each component, so you can understand how it works and adapt it to your needs.
1. Introduction
Converting HTML content to Markdown is a common task, especially when migrating blogs or documentation to static site generators. While there are tools available, they may not handle all nuances, such as code blocks with language annotations or custom front matter requirements. This Python script aims to fill that gap by providing a flexible solution that you can customize.
2. Overview of the Script
The script performs the following key tasks:
- Reads an HTML file specified as a command-line argument.
- Parses the HTML content using BeautifulSoup.
- Extracts the title from the first
<h1>
tag. - Converts the HTML content to Markdown using
markdownify
, with custom handling for code blocks. - Adds YAML front matter to the Markdown content.
- Sanitizes the output filename to ensure it is filesystem-safe.
- Writes the Markdown content to an output directory, creating it if necessary.
3. The sanitize_filename
Function
This function ensures that the output filename is safe to use on various filesystems by removing illegal characters and formatting the string appropriately.
1
2
3
4
5
6
7
8
9
10
def sanitize_filename(filename):
# Remove illegal characters for file names
filename = re.sub(r'[<>:"/\\|?*]', '', filename)
# Replace spaces with hyphens
filename = filename.replace(' ', '-')
# Replace underscores with hyphens
filename = filename.replace('_', '-')
# Limit the length of the filename
return filename[:100]
Explanation:
- Remove illegal characters: Characters like
<>:"/\|?*
are not allowed in filenames on many operating systems. The regular expressionr'[<>:"/\\|?*]'
matches these characters and removes them. - Replace spaces and underscores: Spaces and underscores are replaced with hyphens to create URL-friendly filenames.
- Limit filename length: Filenames are truncated to 100 characters to avoid filesystem limitations.
4. The convert_html_to_markdown
Function
This function handles the conversion of HTML content to Markdown, including code block processing and YAML front matter generation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def convert_html_to_markdown(html_content):
# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')
# Remove the <title> tag
if soup.title:
soup.title.decompose()
# Extract the title from the <h1> tag
h1_tag = soup.find('h1')
if h1_tag:
title = h1_tag.get_text(strip=True)
# Remove the <h1> tag from the content
h1_tag.decompose()
else:
title = 'Untitled'
# Get current date and time with timezone +1000
current_datetime = datetime.datetime.now()
current_date_str = current_datetime.strftime('%Y-%m-%d')
current_date_time_str = current_datetime.strftime('%Y-%m-%d %H:%M:%S') + ' +1000'
# Build YAML front matter
yaml_header = f"""---
title: {title}
date: {current_date_time_str}
categories: []
tags: []
---
"""
# Remove the YAML front matter if any from the HTML
if soup.find('yamlfrontmatter'):
soup.find('yamlfrontmatter').decompose()
# Function to convert code blocks with language detection
def code_block_converter(el, text, convert_as_inline):
if el.name == 'pre':
# Check if there's a code element inside
code = el.find('code')
if code:
# Try to get the language from the class attribute
language = None
if 'class' in code.attrs:
classes = code.attrs['class']
for cls in classes:
if cls.startswith('language-'):
language = cls.replace('language-', '')
break
elif cls in ['bash', 'python', 'javascript', 'html', 'css', 'java', 'c', 'cpp', 'ruby', 'php', 'go', 'rust']:
language = cls
break
code_text = code.get_text()
if language:
return f"```{language}\n{code_text}\n```"
else:
return f"```\n{code_text}\n```"
else:
# No <code> inside <pre>, treat the entire <pre> as code
return f"```\n{el.get_text()}\n```"
return text
# Convert HTML to Markdown with custom code block converter
markdown_content = markdownify.markdownify(
str(soup),
heading_style="ATX",
code_handler=code_block_converter
)
# Combine YAML header and Markdown content
full_markdown = f"{yaml_header}\n{markdown_content}"
return full_markdown, title, current_date_str
Parsing and Preprocessing
The function starts by parsing the HTML content using BeautifulSoup:
1
2
soup = BeautifulSoup(html_content, 'html.parser')
It then removes the <title>
tag to prevent duplication in the Markdown output:
1
2
3
if soup.title:
soup.title.decompose()
Title Extraction
The title is extracted from the first <h1>
tag:
1
2
3
4
5
6
7
h1_tag = soup.find('h1')
if h1_tag:
title = h1_tag.get_text(strip=True)
h1_tag.decompose()
else:
title = 'Untitled'
If no <h1>
tag is found, the title defaults to Untitled
.
Generating YAML Front Matter
The script creates YAML front matter, which includes the title and date:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Get current date and time with timezone +1000
current_datetime = datetime.datetime.now()
current_date_str = current_datetime.strftime('%Y-%m-%d')
current_date_time_str = current_datetime.strftime('%Y-%m-%d %H:%M:%S') + ' +1000'
# Build YAML front matter
yaml_header = f"""---
title: {title}
date: {current_date_time_str}
categories: []
tags: []
---
"""
Code Block Conversion with Language Detection
The code_block_converter
function processes code blocks, adding syntax highlighting support by detecting the programming language:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def code_block_converter(el, text, convert_as_inline):
if el.name == 'pre':
# Check if there's a code element inside
code = el.find('code')
if code:
# Try to get the language from the class attribute
language = None
if 'class' in code.attrs:
classes = code.attrs['class']
for cls in classes:
if cls.startswith('language-'):
language = cls.replace('language-', '')
break
elif cls in ['bash', 'python', 'javascript', 'html', 'css', 'java', 'c', 'cpp', 'ruby', 'php', 'go', 'rust']:
language = cls
break
code_text = code.get_text()
if language:
return f"```{language}\n{code_text}\n```"
else:
return f"```\n{code_text}\n```"
else:
return f"```\n{el.get_text()}\n```"
return text
Explanation:
- Language Detection: The function checks the
class
attributes of the<code>
tag to detect the programming language. - Markdown Formatting: Formats the code block using triple backticks, including the language if detected.
- Fallback: If no language is detected, it formats the code block without specifying a language.
Converting to Markdown
The script then converts the HTML content to Markdown, using the custom code block converter:
1
2
3
4
5
6
markdown_content = markdownify.markdownify(
str(soup),
heading_style="ATX",
code_handler=code_block_converter
)
Combining YAML and Markdown Content
Finally, the YAML front matter and the converted Markdown content are combined:
1
2
full_markdown = f"{yaml_header}\n{markdown_content}"
5. The main
Function
The main
function serves as the entry point of the script, handling command-line arguments and file operations.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def main():
# Add the output directory variable
output_directory = r"C:\Users\Bruce Renner\Documents\Projects\newdocs\_posts"
if len(sys.argv) != 2:
print("Usage: python html_to_md.py path/to/your/file.html")
sys.exit(1)
html_file_path = sys.argv[1]
if not os.path.isfile(html_file_path):
print(f"File not found: {html_file_path}")
sys.exit(1)
# Read HTML content from file
with open(html_file_path, 'r', encoding='utf-8') as f:
html_content = f.read()
# Convert HTML to Markdown
markdown_content, title, current_date_str = convert_html_to_markdown(html_content)
# Sanitize title for use in filename
sanitized_title = sanitize_filename(title)
# Define output Markdown file path
directory = output_directory
# Ensure the output directory exists
if not os.path.exists(directory):
os.makedirs(directory)
markdown_file_name = f"{current_date_str}-{sanitized_title}.md"
markdown_file_path = os.path.join(directory, markdown_file_name)
# Write Markdown content to file
with open(markdown_file_path, 'w', encoding='utf-8') as f:
f.write(markdown_content)
print(f"Conversion complete. Markdown file saved to {markdown_file_path}")
if __name__ == "__main__":
main()
Key functionalities:
- Argument Parsing: The script expects a single command-line argument specifying the path to the HTML file.
- File Validation: Checks if the provided file exists.
- Reading HTML Content: Opens and reads the HTML file content.
- Conversion: Calls
convert_html_to_markdown
to process the content. - Filename Sanitization: Uses
sanitize_filename
to create a safe output filename. - Directory Handling: Ensures the output directory exists, creating it if necessary.
- Writing Output: Writes the Markdown content to the output file.
6. How to Use the Script
Follow these steps to use the script:
Install Required Libraries
The script depends on BeautifulSoup
and markdownify
. Install them using pip:
1
2
pip install beautifulsoup4 markdownify
Save the Script
Copy the script into a file named html_to_md.py
.
Modify the Output Directory
Update the output_directory
variable in the script to your desired output path:
1
2
3
# Add the output directory variable
output_directory = r"C:\path\to\your\output\directory"
Run the Script
Execute the script from the command line, providing the path to your HTML file:
1
2
python html_to_md.py path/to/your/file.html
The script will output the converted Markdown file to the specified directory, naming it with the current date and sanitized title.
7. Conclusion
This Python script offers a customizable solution for converting HTML files to Markdown, complete with YAML front matter and code block handling. By dissecting each function, we’ve provided insights into how the script operates, allowing you to modify and extend it according to your requirements. Whether you’re migrating content for a static site generator or need a reliable HTML to Markdown converter, this script serves as a solid foundation.