exteraGram

Markdown Parser

This module provides the ability to parse markdown-formatted text and convert formatting entities to TLRPC objects suitable for the Telegram API.

The markdown_utils.py module allows you to easily convert text with common Markdown V2-style formatting into a plain text string and a list of TLRPC.MessageEntity objects. These entities can then be used with client_utils.send_message or other API methods that accept formatted text.

Core Components

The parser returns a ParsedMessage object, which has two main attributes:

  • text: str: The plain text content with all Markdown markers removed.
  • entities: Tuple[RawEntity, ...]: A tuple of RawEntity objects, each representing a formatting instruction.

Each RawEntity object contains:

  • type: TLEntityType: The type of the entity (e.g., bold, italic, code).
  • offset: int: The starting position of the entity in the text (UTF-16 code units).
  • length: int: The length of the formatted segment in the text (UTF-16 code units).
  • language: Optional[str]: For pre (code block) entities, the specified language.
  • url: Optional[str]: For text_link entities, the URL.
  • document_id: Optional[int]: For custom_emoji entities, the ID of the custom emoji document.

To convert RawEntity objects into TLRPC.MessageEntity objects suitable for the Telegram API, call the to_tlrpc_object() method on each RawEntity.

Supported Entity Types (TLEntityType)

The parser supports the following entity types:

  • BOLD (*bold*)
  • ITALIC (_italic_)
  • UNDERLINE (__underline__)
  • STRIKETHROUGH (~strikethrough~)
  • SPOILER (||spoiler||)
  • CODE (inline code)
  • PRE (code block) - can include an optional language specifier.
  • TEXT_LINK ([link text](http://example.com))
  • CUSTOM_EMOJI ([alt text](document_id)) - alt text becomes the content of the entity, document_id is the emoji's ID.

Usage Example

This example demonstrates how to parse a Markdown string and send it as a formatted message.

from client_utils import send_message
from markdown_utils import parse_markdown
from android_utils import log
 
params = {
    "peer": 12345678,
    "entities": []
}
 
markdown_input_string = (
    "Markdown entities parsing test:\n\n"
    "~strike~ *bold* __underlined__ _italic_ ||spoiler|| [textlink](https://google.com)\n"
    "This is an inline `code` example.\n"
    "Custom emoji: [😎](5373141891321699086)\n" # Example document_id for a custom emoji
    "\n"
    "Code block 1 (no language specified):\n"
    "```\n"
    "print('Hello, Python!')\n"
    "def greet(name):\n"
    "    return f'Hi, {name}'\n"
    "```\n"
    "\n"
    "Code block 2 (language specified as 'java'):\n"
    "```java\n"
    "public class HelloWorld {\n"
    "    public static void main(String[] args) {\n"
    "        System.out.println(\"Hello world!\");\n"
    "    }\n"
    "}\n"
    "```\n"
    "Nested *bold and _italic_ inside bold*."
)
 
try:
    parsed_message_object = parse_markdown(markdown_input_string)
 
    params["message"] = parsed_message_object.text
    params["entities"] = []
 
    for raw_entity in parsed_message_object.entities:
        tlrpc_entity = raw_entity.to_tlrpc_object()
        params["entities"].append(tlrpc_entity)
 
    log(f"Sending message: '{params['message']}' with {len(params['entities'])} entities.")
    send_message(params)
 
except SyntaxError as e:
    log(f"Markdown parsing error: {e}")
except Exception as e:
    log(f"An unexpected error occurred: {e}")

Important Notes

  • UTF-16 Offsets & Lengths: The offset and length in RawEntity (and the resulting TLRPC.MessageEntity) are calculated based on UTF-16 code units, as required by the Telegram API. The parser handles this conversion automatically.
  • Error Handling: If the Markdown syntax is incorrect (e.g., unclosed tags), parse_markdown will raise a SyntaxError. It's good practice to wrap the call in a try-except block.
  • Nesting: Basic nesting of styles (e.g., bold inside italic) is generally supported, but complex or ambiguous nesting might lead to unexpected results.
  • Escaping: Special Markdown characters (*, _, ~, |, `, [, ], \) can be escaped with a backslash (\) if you want them to appear as literal characters. For example, \*not bold\* will render as *not bold*.
  • Code Blocks:
    • Inline code is surrounded by single backticks (`).
    • Fenced code blocks are surrounded by triple backticks ( ).
    • An optional language identifier can be placed immediately after the opening triple backticks (e.g., ```python).
  • Custom Emoji: The syntax [alt text](document_id) is used. The alt text (e.g., the emoji character itself) becomes the text segment covered by the TLRPC.TL_messageEntityCustomEmoji entity, and document_id is the ID of the custom emoji. You can obtain the emoji ID by sending the emoji to @AdsMarkdownBot on Telegram.

This parser provides a robust way to include rich text formatting in messages sent by your plugins.

On this page