CML

Context Mark Language (CML) is a new markup architecture designed for the AI era, addressing the needs of contextual markup with a structure that aligns with natural semantics.

The core feature of the Context Mark Language syntax is its use of a single string to represent multi-dimensional, composable, and implicitly contextual semantic structures. It offers a simpler experience for writing, embedding, transmitting, storing, and computing.

Paper

CML elaborates on the Context Structure Expression model. For details, refer to the paper:

DOI 10.5281/zenodo.15255534.
English version：CSE:A Unified Modeling Framework for Semantic Contextual Markup
Chinese version：CSE模型：上下文语义标记的统一建模框架

Semantic Architecture
Semantic Structure
Encoding Rules
Minimal Completeness

Context Mark Language can express semantic contextual structures in an extremely concise way.

The CML string, used to mark knowledge contexts, is composed of a series of two types of semantic primitives: semantic tokens and relation separators.

Semantic Token: The basic unit for describing contextual semantics in CML. It can be any form of text fragment. It corresponds to the concept of a token in the LLM domain, but it is not lexical in nature—instead, it carries abstract dimensions of natural semantics.
Relation Separator (:, ., @, +, space): Used to express the semantic relationships of contextual structure between semantic tokens. It implicitly declares the weighted prioritization of these relationships.

By freely combining these semantic primitives and encoding them, a complete CML string is formed.

Expression Rules

CML uses a linear structure to describe semantic relationships (which is easy to embed).

Assuming we use <semantic_token> to represent a semantic token and <separator> to represent a relation separator, the rules can be roughly defined as follows:

Each semantic token (<semantic_token>) must be connected by one and only one relation separator (<separator>).
Semantic tokens and relation separators always appear in an alternating pattern: <semantic_token><separator><semantic_token><separator>...<semantic_token>.

CML revolves around three core semantic needs—priority expression of tokens, structural extensibility, and logical orthogonality. From these, five highly expressive and composable semantic structure primitives are abstracted, forming the fundamental units for constructing all logical semantic relationships.

These five structural primitives are categorized into Basic Structures, Composite Structures, and Combinatorial Structures. Basic structures describe simple relationships among semantic tokens and can serve as components of composite structures. Combinatorial structures are higher-level containers that can connect any semantic tokens, basic structures, or composite structures.

a. Supplementary Relationship

Basic Structure. The semantic object on the right supplements, explains, or restricts the one on the left. It usually acts as a semantic addition without altering the original structure.

f:A∋B∋C

Expressed using the @ symbol. The further left, the higher the priority, indicating greater weight.

For example:

name@identity@organization
name@company@position

Both highlight "name" first, but declare different priority emphases overall.

b. Linear Progression Relationship

Basic Structure. An ordered, stepwise semantic relationship. It can express directional flows like progression, transformation chains, causal chains, sequences, type refinement, or lifecycles.

f:A→B→C

Expressed with the . symbol. Whether the left side has slightly higher weight depends on LLM interpretation based on actual semantic tokens.

For example:

Organism.Animal.Human — focuses on the subclass Human.
Product Design.Development.Operations — the semantic tokens clearly describe a lifecycle.

c. Parallel Set Relationship

Basic Structure. Multiple semantic elements in parallel, like a set, property group, or multi-branch description. No priority or order; elements can be rearranged freely.

f:{A, B, C}

Expressed with the + symbol. No priority difference in weight.

For example:

Man+Woman — both equally describe the concept of Human.
Name+Age+Username — registration attributes of an account.

d. Mapping Relationship

A composite correspondence between one semantic structure and another, in key-value form. Both sides can use any combination of the three basic relationships to support two-dimensional semantic expression.

f(A,B) ↦ f(C,D,E)

Expressed with the : symbol. It's like a key-value pair, but more flexible—both key and value can be basic structures, not just semantic tokens.

<key-context-struct>:<value-context-struct>

Examples of valid mapping structures:

Website:doc-war.com
Website@doc-war.com:Document Battlefield@Contribute Judgment Value
AI+LLM:ChatGPT+Claude@v3.7
ask.answer: What is CML?.CML is a semantic structure language aligned with natural semantics

Special Constraint

CML does not support nested mappings to avoid increasing parsing complexity. For example, expressions like User:ZhangSan:Delete+Query are syntactically invalid.

e. Combination Relationship

Multiple semantic structures combine into a new semantic whole without losing their original meaning. Conversely, they can also be decomposed. Essentially, this acts as a "semantic structure container" capable of computation.


      f(A)+f(B) = f(A+B)
      f(A) = f(A+B) - f(B)

Expressed using a space, which overlays two structures semantically.

Because space has the lowest semantic precedence and is not order-sensitive, it's also used as the overall operator in plaintext CML strings. Concatenating two CML strings with a space produces a valid CML string that retains original semantics.

plaintext(A)+space+plaintext(B) = plaintext(A+space+B)

This feature of lossless restoration—free combination and decomposition—gives CML semantic computation properties, not just semantic expression. It lays a solid foundation for collaborative annotation workflows.

Special Constraint

Because space carries the responsibility of lossless computation, expressions like If User:ZhangSan how about, while syntactically valid, violate the principle of lossless combination. Once split and recombined, the position shift disrupts the original meaning.

Operator Precedence

Relationship operations work similarly to expression parsing in programming languages: a left-to-right lexical scan determines the semantic operation order based on the precedence of relation separators.

CML defines a clear precedence to ensure consistency in semantic interpretation during annotation and inference.

Supplementary Relationship^a > Linear Progression Relationship^b > Parallel Set Relationship^c > Mapping Relationship^d > Combination Relationship^e

Based on semantic expression, two standard CML string patterns are defined. The core difference lies in the stage and form used to mark the semantic token (semantic_token) itself.

Encoding Rules Diagram

Natural Language Format

The natural language format is designed for a plain text editing experience for document engineers, suitable for human-readable scenarios.

It uses the backtick marker (inline code) from Markdown syntax[15] to wrap semantic tokens (semantic_token). Document engineers can use WYSIWYG Markdown editors as a plain text editing environment for semantic structures, which is highly efficient and intuitive.

For example, writing a plain text string in Markdown:

`token1`.`token2`@`token3`+`token4` `token5`:`token6`

This can be rendered in real-time into the following natural semantic effect, clear at a glance:

token1.token2@token3+token4 token5:token6

Encoding Format

CML strings with backticks `, including delimiters, may introduce unexpected parsing boundaries and escaping requirements in certain special scenarios.

Therefore, for embedding, storage, parsing, and computation scenarios, CML defines a safer and more consistent encoded output format.

Encoding Format Diagram

Example:

`token1`.`token2`@`token3`+`token4` `token5`:`token6`

The above plain text CML string is encoded in the following order:

Sequentially extract semantic tokens and relationship delimiters from the original CML string.
Encode each semantic token (excluding the backticks) into a byte stream using UTF-8, then perform Base58 encoding[16] on the byte stream to generate a Base58 string.


              token1 → encode → zyvFCwFv
              token2 → encode → zyvFCwFw
              token3 → encode → zyvFCwFx
              token4 → encode → zyvFCwFy
              token5 → encode → zyvFCwFz
              token6 → encode → zyvFCwG1

Reassemble using the original relationship delimiters (this process essentially replaces the backtick wrapping with Base58).

zyvFCwFv.zyvFCwFw@zyvFCwFx+zyvFCwFy zyvFCwFz:zyvFCwG1

Finally, perform UTF-8+Base58 encoding again on the entire string to eliminate all special characters.

3EkzyE8r5SqnU6KSbLS98LVLJxFoNvskzaazkuEEryWminqaGwJz13YoatvfoRWoDyrofwUCQ

Line Breaks and Multiple Spaces

In natural editing, especially in long-context tagging scenarios, humans may prefer to split long strings with line breaks for readability rather than strictly using spaces. Therefore, the CML editor should support symbol equivalence, automatically converting various line break symbols like \n, \r\n, and \r to a space symbol during parsing and storage of plain text format.

Similarly, in human editing scenarios, users may want to insert multiple spaces in a single line to enhance the visual clarity of structural separation. The CML editor should also support this equivalence.

Before Base58 encoding, line breaks and excessive spaces should be explicitly eliminated and uniformly converted to a single space symbol.

The five relationship delimiters are derived from an abstract refinement of the semantic expression structures in natural language, as well as a deep consideration of expression ambiguity and structural controllability.

Natural Language Reference

The five relationship delimiters defined by CML are based on the most fundamental semantic structures implicit in natural language (modification, succession, parallel, contrast, combination). These five semantic structures are independent of the grammatical expression styles of specific natural languages such as Chinese, English, French, or Japanese.


            Natural Language Examples:
            Example: The man^A who is tall^B and wearing a hat^C               (A←B←C)
            Example: He gets up^A, dresses^B, and goes out^C                    (A→B→C)
            Example: I like apples^A, bananas^B, and oranges^C                  {A, B, C}
            Example: ChatGPT^A, Claude^B, and DeepSeek^C are all AI^D, also called LLM^E  
             (A+B+C ↦ D+E)
            Example: This flower^A + is beautiful^B = This flower is beautiful^A+B             
            (A+B=AB)

In theory, the combination of these five primitive relationships can express the vast majority of logical relationships, offering considerable completeness.

Semantics Within Tokens

For logical relationships such as exclusion, quantifier ranges, or nested mappings, CML does not provide native support. Instead, these are delegated to the token layer, allowing flexible expression in conjunction with supplementary relationships to avoid structural pollution, which could introduce complexity in priority operations and human-readable comprehension.

This is because tokens themselves can also express structure.


            Natural Language Example: "His age is approximately 18~25 years old."
            Annotation:
            ❌ `age`:`range`:`18-25`       
             (Illegal nesting and unnecessary, can be merged forward or backward)
            ❌ `age`:`>18`+`<25`           
             (No need to split so finely)
            ✅ `age`:`range:18-25`        
             (Nesting within the token aligns with natural semantics)
            ✅ `age`:`18-25`

This reflects the principle of minimalism in structure:

The CSE paradigm does not attempt to explicitly resolve all structural semantic ambiguities in advance. Instead, during actual expression, it leverages the semantic tokens connected by relationship delimiters as the context for relational semantics, allowing the LLM to infer the most reasonable semantic relationships and corresponding priority weights.

Core Value of Explicit Structure

Advocating the use of semantic structure expression within tokens also reflects the core value of CML in explicit encoding.

The reason CML is close to natural language but not equivalent to pure natural language lies in its necessity for declaring important semantics and its expressive requirements. Compared to the complete uncontrollability of token segmentation in natural language inference, the essence of explicit segmentation is to break down semantics into vertical hierarchies and horizontal relationships, thereby eliminating relational ambiguities and ultimately enhancing controllability and interpretability.