From XML to JSON to CBOR

Jul 30, 2025 - 13:00
 0  0

From XML to JSON to CBOR

A Lingua Franca for Data?

In modern computing, data exchange is foundational to everything from web browsing to microservices and IoT devices. The ability for different systems to represent, share, and interpret structured information drives our digital world. Yet no single perfect format has emerged to meet all needs. Instead, we've seen an evolution of data interchange formats, each addressing the specific challenges and technical requirements of its time.

This narrative traces three pivotal data formats: Extensible Markup Language (XML), JavaScript Object Notation (JSON), and Concise Binary Object Representation (CBOR). We explore their origins and motivations, examine their core design principles and inherent trade-offs, and follow their adoption trajectories within the evolving digital landscape. The journey begins with XML's focus on robust document structure, shifts to JSON's web-centric simplicity and performance, and advances to CBOR's binary efficiency for constrained devices. Understanding this evolution reveals not just technical specifications, but the underlying pressures driving innovation in data interchange formats.

The Age of Structure: XML's Rise from Publishing Roots

Modern data interchange formats trace back not to the web, but to challenges in electronic publishing decades earlier. SGML provided the complex foundation that XML would later refine and adapt for the internet age.

The SGML Inheritance: Laying the Foundation

In the 1960s-70s, IBM researchers Charles Goldfarb, Ed Mosher, and Ray Lorie created Generalized Markup Language (GML) to overcome proprietary typesetting limitations. Their approach prioritized content structure over presentation. GML later evolved into Standard Generalized Markup Language (SGML), formalized as ISO 8879 in 1986.

SGML innovated through its meta-language approach, providing rules for creating custom markup languages. It allowed developers to define specific vocabularies (tag sets) and grammars (Document Type Definitions or DTDs) for different document types, creating machine-readable documents with exceptional longevity independent of processing technologies.

SGML gained traction in sectors managing complex documentation: government, military (CALS DTD), aerospace, legal publishing, and heavy industry. However, its 150+ page specification with numerous special cases complicated parser implementation, limiting broader adoption.

The web's emergence proved pivotal for markup languages. Tim Berners-Lee selected SGML as HTML's foundation due to its text-based, flexible, non-proprietary nature. Dan Connolly created the first HTML DTD in 1992. While HTML became ubiquitous, it drifted toward presentation over structure, with proliferating browser-specific extensions. SGML remained too complex for widespread web use, creating demand for a format that could bring SGML's structural capabilities to the internet in a more accessible form.

W3C and the Birth of XML: Taming SGML for the Web

By the mid-1990s, the web needed more structured data exchange beyond HTML's presentational focus. In 1996, the W3C established an XML Working Group, chaired by Jon Bosak of Sun Microsystems, to create a simplified SGML subset suitable for internet use while maintaining extensibility and structure.

The W3C XML Working Group developed XML with clear design goals, formalized in the XML 1 Specification (W3C Recommendation, February 1998):

  1. Internet Usability: Straightforward use over the internet
  2. Broad Applicability: Support for diverse applications beyond browsers
  3. SGML Compatibility: XML documents should be conforming SGML documents
  4. Ease of Processing: Simple program development for XML processing
  5. Minimal Optional Features: Few or no optional features
  6. Human Readability: Legible and clear documents
  7. Rapid Design: Quick design process
  8. Formal and Concise Design: Formal specification amenable to standard parsing
  9. Ease of Creation: Simple document creation with basic tools
  10. Terseness is Minimally Important: Conciseness was not prioritized over clarity

SGML compatibility was strategically crucial. By defining XML as a valid SGML subset, existing SGML parsers and tools could immediately process XML documents when the standard released in 1998. This lowered adoption barriers for organizations already using SGML and provided an instant software ecosystem. The constraint also helped the working group achieve rapid development by limiting design choices, demonstrating an effective strategy for launching the new standard.

Designing XML: Tags, Attributes, Namespaces, and Schemas

XML's structure uses nested elements marked by tags. An element consists of a start tag (), an end tag (), and content between them, which can be text or other nested elements. Start tags can contain attributes for metadata (

). Empty elements use syntax like
or

. This hierarchical structure makes data organization explicit and human-readable.

As XML usage expanded, combining elements from different vocabularies created naming conflicts. The "Namespaces in XML" Recommendation (January 1999) addressed this by qualifying elements with unique IRIs, typically URIs. This uses the xmlns attribute, often with a prefix (xmlns:addr="http://www.example.com/addresses"), creating uniquely identified elements (). Default namespaces can be declared (xmlns="URI") for un-prefixed elements, but don't apply to attributes. Though URIs ensure uniqueness, they needn't point to actual online resources.

XML documents are validated using schema languages. XML initially used Document Type Definitions (DTDs) from SGML, which define allowed elements, attributes, and nesting rules. To overcome DTD limitations (non-XML syntax, poor type support), the W3C developed XML Schema Definition (XSD), standardized in 2001. XSD offers powerful structure definition, rich data typing, and rules for cardinality and uniqueness. XSD schemas are themselves written in XML.

XML's structure enabled supporting technologies: XPath for node selection, XSL Transformations (XSLT) for document transformation, and APIs like Document Object Model (DOM) for in-memory representation or Simple API for XML (SAX) for event-based streaming.

While XML effectively modeled complex data structures with extensibility and validation, its power introduced complexity. Creating robust XSD schemas was challenging, leading some to prefer simpler alternatives like RELAX NG or Schematron. Namespaces solved naming collisions but complicated both document authoring and parser development. XML's flexibility allowed multiple valid representations of the same data, potentially hindering interoperability without strict conventions. This inherent complexity, combined with verbosity, eventually drove demand for simpler formats, especially where ease of use and performance outweighed validation and expressiveness. The tension between richness and simplicity significantly influenced subsequent data format evolution.

XML's Reign and Ripples: Adoption and Impact

Following its 1998 standardization, XML quickly became dominant across computing domains throughout the early 2000s, offering a standard, platform-independent approach for structured data exchange.

XML formed the foundation of Web Services through SOAP (Simple Object Access Protocol), an XML-based messaging framework operating over HTTP. Supporting technologies like WSDL (Web Services Description Language) and UDDI (Universal Description, Discovery and Integration) completed the "WS-*" stack for enterprise integration.

Configuration Files widely adopted XML due to its structure and readability. Examples include Java's Log4j, Microsoft.NET configurations (web.config, app.config), Apache Ant build scripts, and numerous system parameters.

In Document Formats and Publishing, XML fulfilled its original promise by powering XHTML, RSS and Atom feeds, KML geographic data, and specialized formats like DocBook. Its content-presentation separation proved valuable for multi-channel publishing and content management.

As a general-purpose Data Interchange format, XML facilitated cross-system communication while avoiding vendor lock-in and supporting long-term data preservation.

This widespread adoption fostered a rich ecosystem of XML parsers, editors, validation tools, transformation engines (XSLT), data binding utilities, and dedicated conferences, building a strong technical community.

The Seeds of Change: XML's Verbosity Challenge

Despite its success, XML carried the seeds of its own partial decline. A key design principle—"Terseness in XML markup is of minimal importance"—prioritized clarity over compactness, requiring explicit start and end tags for every element.

While enhancing readability, this structure created inherent verbosity. Simple data structures required significantly more characters in XML than in more compact formats. For example, {"name": "Alice"} in JSON versus Alice in XML added substantial overhead, especially for large datasets with many small elements.

This verbosity became problematic as the web evolved. The rise of AJAX in the mid-2000s emphasized frequent, small data exchanges between browsers and servers for dynamic interfaces. In this context, minimizing bandwidth usage and parsing time became critical. XML's larger payloads and complex parsing requirements created performance bottlenecks.

The XML community recognized these efficiency concerns, leading to initiatives like the W3C's Efficient XML Interchange (EXI) Working Group, which developed a standardized binary XML format. While EXI offered significant compaction, it highlighted the challenge of retrofitting efficiency onto XML's tag-oriented foundation without adding complexity.

The decision to deprioritize terseness, while distinguishing XML from SGML, had unintended consequences. As the web shifted toward dynamic applications prioritizing speed and efficiency, XML's verbose structure became a liability. This created an opportunity for a format that would optimize for precisely what XML had considered minimal: conciseness and ease of parsing within web browsers and JavaScript.

The Quest for Simplicity: JSON's Emergence in the Web 2.0 Era

As XML's verbosity and complexity became problematic in web development, particularly with AJAX's rise, a simpler alternative emerged directly from JavaScript.

JavaScript's Offspring: Douglas Crockford and the "Discovery" of JSON

JSON (JavaScript Object Notation) originated with Douglas Crockford, an American programmer known for his JavaScript work. In 2001, Crockford and colleagues at State Software needed a lightweight format for data exchange between Java servers and JavaScript browsers without plugins like Flash or Java applets.

Crockford realized JavaScript's object literal syntax (e.g., { key: value }) could serve this purpose. Data could be sent from servers embedded in JavaScript snippets for browsers to parse, initially using the eval() function. Crockford describes this as a "discovery" rather than invention, noting similar techniques at Netscape as early as 1996.

The initial implementation sent HTML documents containing