← Back to Portfolio

Semantic Integrity: Enforcing "Ground Truth" in Probabilistic Models

The Core Mission

The Problem: The "Fluency Mask"

Neural Machine Translation (NMT) models are excellent at mimicry. They produce sentences that sound perfect but are factually wrong. In technical domains, this "Fluency Mask" is dangerous.

Example: A generic NMT model translates a semiconductor patent claim:

Source (English):
"...wherein the current flows through the substrate layer..."

❌ NMT Output (Fluent but Wrong):
"...le courant circulant à travers la couche de substrat..."

Problem: "current" → "courant" (correct word, wrong domain)
BUT: Context confusion — model treats it as electrical current
when the patent actually describes water flow in a cooling system.

The model chose "courant électrique" because electricity appears more frequently in its training data than fluidic flow contexts. It optimized for statistical probability, not technical accuracy.

The Solution: Data-Driven Calibration

Semantic Integrity is the systematic process of aligning the model's vector space with the physical reality of the invention. This is not a one-time fix — it is an ongoing curation process fed by expert Human-in-the-Loop (HITL) validation.

The Goal: A model that doesn't just predict likely words, but understands domain-specific constraints that override statistical frequency.


The Three Levels of Semantic Fidelity

We align the model across three critical layers, each addressing a distinct failure mode:

Level 1: Anti-Hallucination (The Safety Layer)

Objective: Elimination of "Critical Failures" — instances where the model invents concepts not present in the source text.

The Challenge: NMT models generate translations token-by-token based on probability distributions. When faced with ambiguous or rare terms, they "hallucinate" by selecting high-frequency alternatives that fit grammatically but distort meaning.

Common Hallucination Patterns in Patent Translation

Pattern 1: Polysemy Misresolution

Source: "The chip performs a U-turn during signal routing..."

❌ Generic NMT:
"La puce effectue un demi-tour pendant le routage du signal..."
(Treats "U-turn" as literal vehicular maneuver)

✓ Domain-Aligned Model:
"La puce effectue un retour en U pendant le routage du signal..."
(Recognizes "U-turn" as chip routing terminology)

Pattern 2: Context-Free Substitution

Source: "...the package comprises a leadframe..."

❌ Generic NMT:
"...l'emballage comprend un cadre de plomb..."
("leadframe" → "cadre de plomb" literally "lead frame",
suggesting material composition rather than semiconductor component)

✓ Domain-Aligned Model:
"...le boîtier comprend un cadre de connexion..."
(Recognizes "leadframe" as standard semiconductor packaging term)

Why It Matters: Hallucinations can broaden or narrow the scope of protection in ways that contradict the inventor's intent. A "lead frame" made of lead is not the same invention as a "leadframe" (connection structure).

→ View Level 1 Case Catalog


Level 2: Terminological Precision (The Accuracy Layer)

Objective: Override generic synonyms with specific, client-approved nomenclature.

The Challenge: NMT models see "plastic," "resin," "polymer," and "thermoplastic" as interchangeable because they share semantic vector space. But in patent prosecution, these terms are legally distinct.

Source: "The housing is formed from a thermoplastic resin..."

❌ Generic NMT (Random Synonym Selection):
  Translation 1: "Le boîtier est formé d'une résine thermoplastique..."
  Translation 2: "Le boîtier est formé d'un plastique thermoplastique..."
  Translation 3: "Le boîtier est formé d'un polymère thermoplastique..."

Problem: All three are linguistically correct, but only ONE matches
the client's approved terminology and prior art landscape.

Legal Impact: If a competitor's patent uses "polymer" and your client's patent uses "thermoplastic resin," the terminological distinction may be the basis for establishing product differentiation.

→ View Level 2 Case Catalog


Level 3: In-Context Consistency (The Coherence Layer)

Objective: Ensure that terminology remains stable across the entire document.

The Challenge: Generic NMT models have no long-term memory. Each sentence is translated semi-independently, leading to catastrophic inconsistency in multi-claim patent documents.

The Consistency Failure Pattern

Claim 1: "A device comprising a guide member..."
→ "Dispositif comprenant un élément de guidage..."  

Claim 5: "The device of claim 1, wherein the guide member..."
→ ❌ "Le dispositif selon la revendication 1, le guide..."
  (Switches from "élément de guidage" → "guide")

Claim 9: "The device of claim 1, wherein the guide member..."
→ ❌ "Le dispositif selon la revendication 1, le membre directeur..."
  (Switches to entirely different term "membre directeur")

Why It Matters: Patent examiners and courts interpret term variation as intentional claim differentiation. If "guide member" becomes three different French terms, the claims may be rejected for indefiniteness.

→ View Level 3 Case Catalog


The Process: The "Living" Dataset

This section represents the most dynamic aspect of the alignment methodology — a continuous feedback loop where every correction becomes future training signal.

Patent Translation Request
NMT Initial Output
HITL Expert Review ← [Semantic Errors Identified]
Manual Correction
Error Annotation (Label Studio)
Gold Set Corpus Addition
Improved Model (next iteration)

Data Harvesting from TMPE

Every time a human expert corrects an NMT output:

  • The correction is logged with metadata (error type, domain, client)
  • Patterns are extracted (e.g., "model consistently mistranslates 'leadframe'")
  • The model is periodically retrained with updated Gold Set

Semantic Integrity vs. Structural Compliance

Dimension Semantic Integrity Structural Compliance
What It Governs What is being said How it must be said
Failure Mode Factual incorrectness Legal non-compliance
Example Error "Current" (water) → "courant" (electricity) Infinitive verb instead of nominalized form
Consequence Wrong scope of protection Invalid claim structure
Detection Requires domain expertise Can be rule-validated
Training Method HITL + NER annotation Dependency parsing + regex