Show HN: I modeled the Voynich Manuscript with SBERT to test for structure

admin

May 18, 2025 - 16:45

0 0

Show HN: I modeled the Voynich Manuscript with SBERT to test for structure

📜 Voynich Manuscript Structural Analysis

🔍 Overview

This started as a personal challenge to figure out what modern NLP could tell us about the Voynich Manuscript — without falling into translation speculation or pattern hallucination. I'm not a linguist or cryptographer. I just wanted to see if something as strange as Voynichese would hold up under real language modeling: clustering, POS inference, Markov transitions, and section-specific patterns.

Spoiler: it kinda did.

This repo walks through everything — from suffix stripping to SBERT embeddings to building a lexicon hypothesis. No magic, no GPT guessing. Just a skeptical test of whether the manuscript has structure that behaves like language, even if we don’t know what it’s saying.

🧠 Why This Matters

The Voynich Manuscript remains undeciphered, with no agreed linguistic or cryptographic solution. Traditional analyses often fall into two camps: statistical entropy checks or wild guesswork. This project offers a middle path — using computational linguistics to assess whether the manuscript encodes real, structured language-like behavior.

📁 Project Structure

/data/
  AB.docx                         # Full transliteration with folio/line tags
  voynichese/                     # Root word .txt files
  stripped_cluster_lookup.json    # Cluster ID per stripped root
  unique_stripped_words.json      # All stripped root forms
  voynich_line_clusters.csv       # Cluster sequences per line

/scripts/
  cluster_roots.py                # SBERT clustering + suffix stripping
  map_lines_to_clusters.py        # Maps manuscript lines to cluster IDs
  pos_model.py                    # Infers grammatical roles from cluster behavior
  transition_matrix.py            # Builds and visualizes cluster transitions
  lexicon_builder.py              # Creates a candidate lexicon by section and role
  cluster_language_similarity.py  # (Optional) Compares clusters to real-world languages

/results/
  Figure_1.png                    # SBERT clusters (PCA reduced)
  transition_matrix_heatmap.png  # Markov transition matrix
  cluster_role_summary.csv
  cluster_transition_matrix.csv
  lexicon_candidates.csv

✅ Key Contributions

Clustering of stripped root words using multilingual SBERT
Identification of function-word-like vs. content-word-like clusters
Markov-style transition modeling of cluster sequences
Folio-based syntactic structure mapping (Botanical, Biological, etc.)
Generation of a data-driven lexicon hypothesis table

🔧 Preprocessing Choices

One of the most important assumptions I made was about how to handle the Voynich words before clustering. Specifically: I stripped a set of recurring suffix-like endings from each word — things like aiin, dy, chy, and similar variants. The goal was to isolate what looked like root forms that repeated with variation, under the assumption that these suffixes might be:

Phonetic padding
Grammatical particles
Chant-like or mnemonic repetition
Or… just noise

This definitely improved the clustering behavior — similar stems grouped more tightly, and the transition matrix showed cleaner structural patterns. But it's also a strong preprocessing decision that may have:

Removed actual morphological information
Disguised meaningful inflectional variants
Introduced a bias toward function over content

So it’s not neutral — it helped, but it also shaped the results. If someone wants to fork this repo and re-run the pipeline without suffix stripping — or treat suffixes as their own token class — I’d be genuinely interested in the comparison.

📈 Key Findings

Cluster 8 exhibits high frequency, low diversity, and frequent line-starts — likely a function word group
Cluster 3 has high diversity and flexible positioning — likely a root content class
Transition matrix shows strong internal structure, far from random
Cluster usage and POS patterns differ by manuscript section (e.g., Biological vs Botanical)

🧬 Hypothesis

The manuscript encodes a structured constructed or mnemonic language using syllabic padding and positional repetition. It exhibits syntax, function/content separation, and section-aware linguistic shifts — even in the absence of direct translation.

▶️ How to Reproduce

# 1. Install dependencies
pip install -r requirements.txt

# 2. Run each stage of the pipeline
python scripts/cluster_roots.py
python scripts/map_lines_to_clusters.py
python scripts/pos_model.py
python scripts/transition_matrix.py
python scripts/lexicon_builder.py

📊 Example Visualizations

📌 Figure 1: SBERT cluster embeddings (PCA-reduced)

📌 Figure 2: Transition Matrix Heatmap

📌 Limitations

Cluster-to-word mappings are indirect — frequency estimates may overlap
Suffix stripping is heuristic and may remove meaningful endings
No semantic translation attempted — only structural modeling

✍️ Authors Note

This project was built as a way to learn — about AI, NLP, and how far structured analysis can get you without assuming what you're looking at. I’m not here to crack the Voynich. But I do believe that modeling its structure with modern tools is a better path than either wishful translation or academic dismissal.

So if you're here for a Rosetta Stone, you're out of luck.

If you're here to model a language that may not want to be modeled — welcome.

🤝 Contributions Welcome

This project is open to extensions, critiques, and collaboration — especially from linguists, cryptographers, conlang enthusiasts, and computational language researchers.