4. Schemas & Governance

4.1 RulePack JSON Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "RulePack",
  "type": "object",
  "required": ["metadata", "rules"],
  "properties": {
    "metadata": {
      "type": "object",
      "required": ["pack_id", "version", "compliance"],
      "properties": {
        "pack_id": {"type": "string"},
        "version": {"type": "string", "pattern": "^\\d+\\.\\d+\\.\\d+$"},
        "compliance": {
          "type": "array",
          "items": {"type": "string"}
        }
      }
    },
    "rules": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["rule_id", "type", "field", "operator", "error_message"],
        "properties": {
          "rule_id": {"type": "string"},
          "type": {"enum": ["FATAL", "WARNING", "INFO"]},
          "field": {"type": "string"},
          "operator": {"enum": ["==", "!=", "<", "<=", ">", ">=", "in", "not_in", "contains", "matches"]},
          "value": {},
          "error_message": {"type": "string"}
        }
      }
    }
  }
}

4.2 WorkflowPack YAML Schema

$schema: http://json-schema.org/draft-07/schema#
title: WorkflowPack
type: object
required: [workflow_id, version, steps]
properties:
  workflow_id:
    type: string
  version:
    type: string
    pattern: ^\d+\.\d+\.\d+$
  steps:
    type: array
    items:
      type: object
      required: [id, type]
      properties:
        id: {type: string}
        type:
          enum: [data-source, validation, calculation, decision, approval, ai-inference, data-sink]
        config:
          type: object

4.3 Pack Governance

Approval Workflow:

Draft: Author creates pack in designer
Review: Business analyst reviews functional correctness
Compliance: Compliance officer certifies regulatory alignment
Testing: QA validates with test data
Approval: Admin approves for deployment
Deployment: Pack pushed to production registry

Versioning:

Semantic versioning: MAJOR.MINOR.PATCH
Breaking changes: MAJOR increment, requires re-certification
Backward-compatible features: MINOR increment
Bug fixes: PATCH increment

Marketplace Tiers:

Official: Treasury/IRS verified, free (government funded)
Certified: Community tested, 70/30 revenue split
Community: User contributed, as-is
Private: Agency-specific, enterprise licensing

4.4 RAG Knowledge Base Governance

The CORTX RAG Knowledge Base enhances AI capabilities by grounding responses in a curated and governed repository of documents. Its management ensures accuracy, compliance, and relevance.

4.4.1 Knowledge Base Content

The knowledge base is structured across four hierarchical levels (Platform, Suite, Module, Entity) and contains three main categories of information:

Compliance & Regulatory Documents:
- Federal Financial: OMB Circulars, GTAS Validation Rules, Treasury Financial Manual.
- Healthcare: HIPAA Security, Privacy, and Breach Notification Rules.
- Cybersecurity: NIST 800-53, FedRAMP Guides, NIST Cybersecurity Framework.
Platform Documentation:
- APIs: OpenAPI specifications for all CORTX microservices.
- Schemas: RulePack and WorkflowPack JSON schemas.
- Examples: Curated examples of valid RulePack and WorkflowPack definitions.
Domain Knowledge:
- FedSuite: Trial balance reconciliation, GTAS submission processes.
- MedSuite: Claims verification, HIPAA audit procedures.
- CorpSuite: Title verification, property search procedures.

4.4.2 Document Management Lifecycle

A strict lifecycle process ensures the integrity and currency of the knowledge base.

Addition: New documents undergo source verification, license checks, format conversion, chunking (512-token chunks), embedding, metadata tagging, and a quality review before being indexed.
Update: When a document is updated, the old version is marked as "deprecated" and the new version is added. The deprecated version is archived after a 30-day grace period to ensure smooth transition.
Removal: Documents are removed if they are superseded, their license expires, or they are no longer relevant. The process involves archiving, soft deletion, impact monitoring, and finally hard deletion.

4.4.3 Retrieval and Quality Assurance

Multiple strategies are employed to ensure relevant and accurate information is retrieved.

Retrieval Strategies:
Standard: Top-k similarity search.
Keyword-Boosted: Increases score for documents containing specific keywords.
Category-Filtered: Narrows search to a specific domain (e.g., hipaa_compliance).
Multi-Query: Generates query variations to retrieve a more diverse set of results for complex questions.
Quality Assurance:
Accuracy Testing: A curated set of 100+ query-answer pairs is used to ensure retrieval accuracy remains above 90%.
Embedding Drift: Cosine similarity distributions are monitored over time to detect and alert on potential embedding drift.
Coverage Analysis: The percentage of user queries that return at least one relevant result is tracked, with a target of >95%.

4.4.4 Security and Performance

Security:
PII Redaction: All documents are scanned and redacted for PII (SSNs, emails, etc.) before the embedding process.
Access Control: Retrieval is filtered based on user permissions, enforcing tenant isolation and access rights for internal vs. public documents.
Performance:
Caching: A Redis cache layer (1-hour TTL) stores results for frequent queries to reduce latency.
Index Optimization: The PostgreSQL vector store uses an HNSW (Hierarchical Navigable Small World) index, optimized for fast and accurate similarity searches.

4.4.5 Maintenance

A regular maintenance schedule is in place:

Daily: Metric and log review.
Weekly: Accuracy and coverage testing.
Monthly: Embedding drift analysis and index review.
Quarterly: Full knowledge base audit and test case updates.

```