Post

Converting HTML to Markdown with Python

Image

In this blog post, we will delve into a Python script designed to convert HTML files into Markdown format. This script not only converts the content but also handles code blocks with language detection, sanitizes filenames, and adds YAML front matter for static site generators like Jekyll or Hugo. We’ll provide an in-depth explanation of each component, so you can understand how it works and adapt it to your needs.

1. Introduction

Converting HTML content to Markdown is a common task, especially when migrating blogs or documentation to static site generators. While there are tools available, they may not handle all nuances, such as code blocks with language annotations or custom front matter requirements. This Python script aims to fill that gap by providing a flexible solution that you can customize.

2. Overview of the Script

The script performs the following key tasks:

  • Reads an HTML file specified as a command-line argument.
  • Parses the HTML content using BeautifulSoup.
  • Extracts the title from the first <h1> tag.
  • Converts the HTML content to Markdown using markdownify, with custom handling for code blocks.
  • Adds YAML front matter to the Markdown content.
  • Sanitizes the output filename to ensure it is filesystem-safe.
  • Writes the Markdown content to an output directory, creating it if necessary.

3. The sanitize_filename Function

This function ensures that the output filename is safe to use on various filesystems by removing illegal characters and formatting the string appropriately.

1
2
3
4
5
6
7
8
9
10
def sanitize_filename(filename):
    # Remove illegal characters for file names
    filename = re.sub(r'[<>:"/\\|?*]', '', filename)
    # Replace spaces with hyphens
    filename = filename.replace(' ', '-')
    # Replace underscores with hyphens
    filename = filename.replace('_', '-')
    # Limit the length of the filename
    return filename[:100]

Explanation:

  • Remove illegal characters: Characters like <>:"/\|?* are not allowed in filenames on many operating systems. The regular expression r'[<>:"/\\|?*]' matches these characters and removes them.
  • Replace spaces and underscores: Spaces and underscores are replaced with hyphens to create URL-friendly filenames.
  • Limit filename length: Filenames are truncated to 100 characters to avoid filesystem limitations.

4. The convert_html_to_markdown Function

This function handles the conversion of HTML content to Markdown, including code block processing and YAML front matter generation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def convert_html_to_markdown(html_content):
    # Parse the HTML content
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove the <title> tag
    if soup.title:
        soup.title.decompose()

    # Extract the title from the <h1> tag
    h1_tag = soup.find('h1')
    if h1_tag:
        title = h1_tag.get_text(strip=True)
        # Remove the <h1> tag from the content
        h1_tag.decompose()
    else:
        title = 'Untitled'

    # Get current date and time with timezone +1000
    current_datetime = datetime.datetime.now()
    current_date_str = current_datetime.strftime('%Y-%m-%d')
    current_date_time_str = current_datetime.strftime('%Y-%m-%d %H:%M:%S') + ' +1000'

    # Build YAML front matter
    yaml_header = f"""---
title: {title}
date: {current_date_time_str}
categories: []
tags: []
---
"""

    # Remove the YAML front matter if any from the HTML
    if soup.find('yamlfrontmatter'):
        soup.find('yamlfrontmatter').decompose()

    # Function to convert code blocks with language detection
    def code_block_converter(el, text, convert_as_inline):
        if el.name == 'pre':
            # Check if there's a code element inside
            code = el.find('code')
            if code:
                # Try to get the language from the class attribute
                language = None
                if 'class' in code.attrs:
                    classes = code.attrs['class']
                    for cls in classes:
                        if cls.startswith('language-'):
                            language = cls.replace('language-', '')
                            break
                        elif cls in ['bash', 'python', 'javascript', 'html', 'css', 'java', 'c', 'cpp', 'ruby', 'php', 'go', 'rust']:
                            language = cls
                            break
                code_text = code.get_text()
                if language:
                    return f"```{language}\n{code_text}\n```"
                else:
                    return f"```\n{code_text}\n```"
            else:
                # No <code> inside <pre>, treat the entire <pre> as code
                return f"```\n{el.get_text()}\n```"
        return text

    # Convert HTML to Markdown with custom code block converter
    markdown_content = markdownify.markdownify(
        str(soup),
        heading_style="ATX",
        code_handler=code_block_converter
    )

    # Combine YAML header and Markdown content
    full_markdown = f"{yaml_header}\n{markdown_content}"

    return full_markdown, title, current_date_str

Parsing and Preprocessing

The function starts by parsing the HTML content using BeautifulSoup:

1
2
soup = BeautifulSoup(html_content, 'html.parser')

It then removes the <title> tag to prevent duplication in the Markdown output:

1
2
3
if soup.title:
    soup.title.decompose()

Title Extraction

The title is extracted from the first <h1> tag:

1
2
3
4
5
6
7
h1_tag = soup.find('h1')
if h1_tag:
    title = h1_tag.get_text(strip=True)
    h1_tag.decompose()
else:
    title = 'Untitled'

If no <h1> tag is found, the title defaults to Untitled.

Generating YAML Front Matter

The script creates YAML front matter, which includes the title and date:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Get current date and time with timezone +1000
current_datetime = datetime.datetime.now()
current_date_str = current_datetime.strftime('%Y-%m-%d')
current_date_time_str = current_datetime.strftime('%Y-%m-%d %H:%M:%S') + ' +1000'

# Build YAML front matter
yaml_header = f"""---
title: {title}
date: {current_date_time_str}
categories: []
tags: []
---
"""

Code Block Conversion with Language Detection

The code_block_converter function processes code blocks, adding syntax highlighting support by detecting the programming language:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def code_block_converter(el, text, convert_as_inline):
    if el.name == 'pre':
        # Check if there's a code element inside
        code = el.find('code')
        if code:
            # Try to get the language from the class attribute
            language = None
            if 'class' in code.attrs:
                classes = code.attrs['class']
                for cls in classes:
                    if cls.startswith('language-'):
                        language = cls.replace('language-', '')
                        break
                    elif cls in ['bash', 'python', 'javascript', 'html', 'css', 'java', 'c', 'cpp', 'ruby', 'php', 'go', 'rust']:
                        language = cls
                        break
                code_text = code.get_text()
                if language:
                    return f"```{language}\n{code_text}\n```"
                else:
                    return f"```\n{code_text}\n```"
        else:
            return f"```\n{el.get_text()}\n```"
    return text

Explanation:

  • Language Detection: The function checks the class attributes of the <code> tag to detect the programming language.
  • Markdown Formatting: Formats the code block using triple backticks, including the language if detected.
  • Fallback: If no language is detected, it formats the code block without specifying a language.

Converting to Markdown

The script then converts the HTML content to Markdown, using the custom code block converter:

1
2
3
4
5
6
markdown_content = markdownify.markdownify(
    str(soup),
    heading_style="ATX",
    code_handler=code_block_converter
)

Combining YAML and Markdown Content

Finally, the YAML front matter and the converted Markdown content are combined:

1
2
full_markdown = f"{yaml_header}\n{markdown_content}"

5. The main Function

The main function serves as the entry point of the script, handling command-line arguments and file operations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def main():
    # Add the output directory variable
    output_directory = r"C:\Users\Bruce Renner\Documents\Projects\newdocs\_posts"

    if len(sys.argv) != 2:
        print("Usage: python html_to_md.py path/to/your/file.html")
        sys.exit(1)

    html_file_path = sys.argv[1]

    if not os.path.isfile(html_file_path):
        print(f"File not found: {html_file_path}")
        sys.exit(1)

    # Read HTML content from file
    with open(html_file_path, 'r', encoding='utf-8') as f:
        html_content = f.read()

    # Convert HTML to Markdown
    markdown_content, title, current_date_str = convert_html_to_markdown(html_content)

    # Sanitize title for use in filename
    sanitized_title = sanitize_filename(title)

    # Define output Markdown file path
    directory = output_directory

    # Ensure the output directory exists
    if not os.path.exists(directory):
        os.makedirs(directory)

    markdown_file_name = f"{current_date_str}-{sanitized_title}.md"
    markdown_file_path = os.path.join(directory, markdown_file_name)

    # Write Markdown content to file
    with open(markdown_file_path, 'w', encoding='utf-8') as f:
        f.write(markdown_content)

    print(f"Conversion complete. Markdown file saved to {markdown_file_path}")

if __name__ == "__main__":
    main()

Key functionalities:

  • Argument Parsing: The script expects a single command-line argument specifying the path to the HTML file.
  • File Validation: Checks if the provided file exists.
  • Reading HTML Content: Opens and reads the HTML file content.
  • Conversion: Calls convert_html_to_markdown to process the content.
  • Filename Sanitization: Uses sanitize_filename to create a safe output filename.
  • Directory Handling: Ensures the output directory exists, creating it if necessary.
  • Writing Output: Writes the Markdown content to the output file.

6. How to Use the Script

Follow these steps to use the script:

Install Required Libraries

The script depends on BeautifulSoup and markdownify. Install them using pip:

1
2
pip install beautifulsoup4 markdownify

Save the Script

Copy the script into a file named html_to_md.py.

Modify the Output Directory

Update the output_directory variable in the script to your desired output path:

1
2
3
# Add the output directory variable
output_directory = r"C:\path\to\your\output\directory"

Run the Script

Execute the script from the command line, providing the path to your HTML file:

1
2
python html_to_md.py path/to/your/file.html

The script will output the converted Markdown file to the specified directory, naming it with the current date and sanitized title.

7. Conclusion

This Python script offers a customizable solution for converting HTML files to Markdown, complete with YAML front matter and code block handling. By dissecting each function, we’ve provided insights into how the script operates, allowing you to modify and extend it according to your requirements. Whether you’re migrating content for a static site generator or need a reliable HTML to Markdown converter, this script serves as a solid foundation.

This post is licensed under CC BY 4.0 by the author.