🔬

CopernicusAI

Knowledge Engine for Scientific Discovery

A collaborative research platform that transforms cutting-edge scientific research into accessible, multi-format tools for collective knowledge exploration. These are research instruments—like microscopes for observing the collective knowledge of humanity—enabling hypothesis formation, testing, and discovery across scientific disciplines.

📋 Summary

CopernicusAI is an operational research platform that synthesizes scientific literature from 250+ million papers into AI-generated podcasts, integrates with a knowledge graph of 23,246 indexed papers, and provides collaborative tools for research discovery. The system demonstrates production-ready multi-source research synthesis with full citation tracking and evidence-based content generation requiring minimum 3 research sources per episode.

The platform includes a fully operational Research Tools Dashboard (deployed December 2025) with interactive knowledge graph visualization, vector search, and RAG capabilities, enabling researchers to explore, query, and synthesize scientific knowledge across disciplines.

🏗️ Knowledge Engine Architecture

The CopernicusAI Knowledge Engine systematically transforms information into knowledge through integrated capabilities. At its core, a knowledge engine is any system—biological or artificial—that systematically transforms information into knowledge, performing work by converting raw materials (information) into useful outputs (knowledge, understanding, insights).

The system architecture demonstrates the integration of data ingestion, processing, storage, and query capabilities across multiple modalities—research papers, process descriptions, and media content—enabling comprehensive knowledge discovery and synthesis.

Knowledge Engine Architecture Diagram showing data ingestion, processing, storage, and query layers

Figure: Knowledge Engine Architecture - Data flow from ingestion through processing and storage to query interfaces

📥 Data Ingestion

Multi-source acquisition from academic databases (PubMed, arXiv, NASA ADS), literature sources (textbooks, reviews), and educational content (videos, transcripts), with quality assessment and type classification.

⚙️ Processing & Storage

LLM-powered entity extraction and process logic extraction, structured data storage (JSON metadata, Mermaid flowcharts, transcripts), and specialized databases for papers, processes, and media.

🔍 Query & Output

Multiple access interfaces including RAG queries, vector search, knowledge graph visualization, API endpoints, and web interfaces, converging to unified knowledge output.

Prior Work & Current Status

Prior Work (2024-2025)

CopernicusAI is an active research prototype exploring AI-generated audio briefings as an interface for assisted scientific research.

The system allows any user to generate, refine, and share AI-generated science podcasts based on structured prompts, enabling rapid orientation to a topic, iterative deepening, and personalized research briefings.

Rather than functioning as a static content platform, CopernicusAI supports collectively generated and shared research artifacts, analogous to community-driven knowledge platforms (e.g., discussion forums), but grounded in scientific sources and metadata-aware workflows.

This work demonstrates technical feasibility for:

  • • AI-assisted research briefing and orientation
  • • Iterative question refinement via conversational interfaces
  • • Integration of text, audio, and metadata in research workflows

Current Implementation (December 2025)

The Research Tools Dashboard is fully operational and deployed to Google Cloud Run, providing unified access to all components with interactive knowledge graph visualization, vector search, RAG queries, and content browsing.

See the "Knowledge Engine Ecosystem" section below for details.

🎯 Mission & Vision

Inspired by Nicolaus Copernicus who challenged accepted knowledge with evidence and rigorous analysis, CopernicusAI creates collaborative research tools that enable collective participation in scientific discovery. These platforms are instruments for exploring humanity's collective knowledge—tools for hypothesis formation, testing, and collaborative research, not just educational content.

Just as a microscope enables observation of the microscopic world, CopernicusAI tools enable observation and exploration of humanity's collective knowledge. Subscribers collaborate to prompt, generate, and refine research content—sharing discoveries publicly or keeping them private. As large language models (LLMs) and AI systems gain unprecedented knowledge, CopernicusAI provides the infrastructure for human-AI collaborative knowledge exploration, with evidence-based truth-seeking as our guiding principle.

🧩 CopernicusAI Knowledge Engine

An integrated ecosystem of research and collaboration tools designed to assist scientists in their workflow, from research discovery through knowledge synthesis to multi-format content generation. View Public Project Interface →

🎙️

CopernicusAI Podcast Generation

Synthesis & distribution platform for AI-powered research briefing podcast generation

Visit Website →
🛠️

Programming Framework

Foundational meta-tool for universal process analysis across disciplines

Explore →
🧬

Genome Logic Modeling Project

Mermaid markdown format flowcharts modeling 100+ biochemical processes in Yeast and E. Coli

Explore →
📚

Research Paper Database

Core data infrastructure for research paper metadata and citation networks

Explore →
🎬

Science Video Database

Multi-modal content with transcript-based search for scientific videos

Explore →
🗺️

Research Tools Dashboard

✅ Prototype web interface for testing knowledge graph, vector search, RAG queries, and content browsing

Live System →
23,246
Research Papers
Indexed in Knowledge Engine (As of January 2025)
314
Processes
Visualized across 6 databases (As of January 2025)
753
Videos
Science videos indexed (As of January 2025)
79
Podcasts
Generated across 5 disciplines (As of January 2025)

🌟 Core Platform Capabilities

🎙️

AI-Powered Podcast Generation

Collaborative research platform where subscribers prompt and generate multi-voice AI podcasts (5-10 minutes) synthesizing research from multiple academic sources. Subscribers can share their podcasts publicly or keep them private. Evidence-based content generation requiring minimum 3 research sources per episode.

Key Features:

  • ✓ Comprehensive research integration (8+ databases)
  • ✓ Professional multi-speaker dialogue
  • ✓ AI-generated scientific visualizations
  • ✓ RSS feed distribution
  • ✓ Quality scoring & relevance ranking
  • ✓ Paradigm shift identification

Research Integration:

  • ✓ Real-time discovery from 8+ APIs
  • ✓ Parallel search across databases
  • ✓ Automatic citation extraction
  • ✓ Source validation & verification
  • ✓ Interdisciplinary connection analysis
🤖

Advanced LLM Integration

Multi-model architecture with intelligent model selection:

Primary Models:

  • Google Gemini 3 - Latest research analysis and content generation
  • OpenAI GPT-4/GPT-3.5 - Content synthesis and quality validation
  • Anthropic Claude 3 (Sonnet, Haiku) - Alternative reasoning paths
  • ElevenLabs TTS - Multi-voice text-to-speech synthesis

Capabilities:

  • • Multi-paper analysis & synthesis
  • • Paradigm shift detection
  • • Entity extraction (genes, proteins, compounds)
  • • Citation tracking & cross-references
  • • Content quality scoring
📊

Research Resource Access

Comprehensive academic database coverage with 250+ million research papers accessible through integrated APIs.

Academic Databases:

  • • PubMed/NCBI (~30+ million papers)
  • • arXiv (~2+ million preprints)
  • • NASA ADS (~15+ million papers)
  • • Zenodo (100K+ datasets)
  • • bioRxiv/medRxiv (preprints)
  • • CORE (~200+ million papers)
  • • Google Scholar (comprehensive)
  • • News API (current events)
  • • YouTube Data API (academic videos)
🎙️

Audio and Video Podcast Production

Operating Audio Podcast System: Full production and distribution platform for subscriber-generated podcasts. Users can prompt, generate, publish, and distribute audio podcasts with RSS feed support for Spotify, Apple Podcasts, and Google Podcasts.

Current Audio Capabilities (Operational):

  • ✓ Multi-voice AI podcast generation
  • ✓ Research-driven content creation
  • ✓ RSS feed distribution
  • ✓ Public and private podcast options
  • ✓ Professional audio quality

Video Production (Future - Phase 2+):

Advanced video features planned for future development:

  • Visual Content Integration: Automated extraction from papers, web scraping, JSON database integration
  • Dynamic Visualizations: Scientific animations, real-time charts, LaTeX rendering
  • External Video Quoting: YouTube segment extraction with attribution & fair use compliance
  • Advanced Composition: Multi-layer video, auto subtitles, text overlays, professional transitions

See: Science Video Database - Companion project for research video content management.

📚

Research Papers Metadata Database (Phase 2)

A centralized metadata repository (not a file archive) providing structured JSON objects with AI-powered preprocessing.

Structured JSON Objects:

  • • DOI, arXiv ID, publication info
  • • Abstracts & key findings
  • • Extracted entities (genes, proteins, compounds, equations)
  • • Citation networks & cross-references
  • • Paradigm shift indicators
  • • Quality scores & relevance metrics

AI-Powered Preprocessing:

  • • LLM-based entity extraction
  • • Automatic categorization
  • • Keyword extraction & semantic tagging
  • • Citation tracking & mapping
  • • Quality assessment
  • • RESTful API access

🔬 Methodology & System Design

Multi-Source Validation Process

The system requires a minimum of 3 research sources per podcast episode. Each source is:

  • • Retrieved from authoritative academic databases (PubMed, arXiv, NASA ADS, etc.)
  • • Validated for authenticity and publication status
  • • Scored for quality and relevance to the research topic
  • • Cross-referenced to verify consistency and eliminate conflicting information
  • • Processed through parallel API queries for comprehensive coverage

Quality Assurance Mechanisms

  • Source Verification: Automated checking of DOI, arXiv IDs, and publication metadata
  • Relevance Scoring: LLM-based assessment of paper relevance to query
  • Paradigm Shift Detection: Identification of revolutionary vs. incremental research
  • Citation Extraction: Automatic extraction and formatting of citations
  • Content Validation: Multi-model verification (Gemini, GPT-4, Claude) for accuracy

Citation Extraction & Verification

The system automatically extracts and formats citations from research papers:

  • • DOI resolution and metadata enrichment
  • • arXiv ID parsing and preprint identification
  • • Author, title, and publication information extraction
  • • Cross-reference linking between related papers
  • • Citation network analysis for relationship mapping

Paradigm Shift Detection Implementation

The system uses LLM analysis to identify paradigm-shifting research by:

  • • Analyzing citation patterns and impact metrics
  • • Detecting novel methodologies or breakthrough discoveries
  • • Comparing against established knowledge frameworks
  • • Identifying interdisciplinary connections and cross-domain insights
  • • Flagging research that challenges existing paradigms

⚙️ Technology Stack

AI & Machine Learning

  • • Google Gemini 3
  • • Google Vertex AI (model orchestration)
  • • OpenAI GPT-4/GPT-3.5
  • • Anthropic Claude 3
  • • ElevenLabs TTS
  • • DALL-E 3
  • • Cloud Vision API
  • • Video Intelligence API

Backend Infrastructure

  • • FastAPI (Python)
  • • Google Cloud Run
  • • Firestore (NoSQL)
  • • Cloud Storage
  • • Cloud Functions
  • • Cloud Tasks
  • • Secret Manager

Frontend

  • • Next.js 15.5.7
  • • Alpine.js
  • • Tailwind CSS
  • • Vercel

🔍 Limitations & Future Directions

Current Limitations

  • Discipline Coverage: Currently indexing 23,246 papers across multiple disciplines; expansion to additional disciplines in progress
  • Source Bias: Coverage depends on database API availability and open access policies
  • LLM Accuracy: Content generation relies on LLM accuracy; multi-source validation mitigates but doesn't eliminate errors
  • Real-Time Updates: Knowledge graph updates require manual or scheduled processing cycles
  • Language: Currently optimized for English-language research papers

Future Development

  • Multi-Discipline Expansion: Expanding knowledge graph to Biology, Chemistry, Physics, Computer Science
  • Process Databases: Creating comprehensive flowchart databases for all 5 disciplines (~50 processes each)
  • Advanced Video Features: Dynamic visualizations, animations, and multi-layer composition
  • Multi-Language Support: Extending to non-English research papers
  • Enhanced Validation: Peer review mechanisms and user feedback integration
  • Real-Time Updates: Automated continuous knowledge graph updates

🔬 Collaborative Research Tools

Collaborative Research Tools

These platforms enable collective participation and collaboration across diverse user communities:

  • Researchers - Tools for hypothesis formation and testing, cross-disciplinary synthesis
  • Collaborators - Collective knowledge exploration and refinement
  • Subscribers - Prompt, generate, and share podcasts (public or private)
  • Community - User suggestions, comments, and collaborative flowchart improvement (GLMP)

Like a microscope enables observation of the microscopic world, these tools enable observation and exploration of humanity's collective knowledge.

Key Innovations

  • • Multi-source validation (min 3 sources)
  • • Evidence-based generation
  • • Paradigm shift detection
  • • Interdisciplinary connections
  • • Multiple expertise levels
  • • Full citation tracking

📚 Prior Work & Research Contributions

Overview

This platform represents prior work that demonstrates foundational research and development achievements in AI-powered scientific knowledge synthesis, collaborative research tools, and multi-modal content generation. These contributions establish the technical foundation and proof-of-concept for the broader CopernicusAI Knowledge Engine initiative.

🔬 Research Contributions

  • AI-Powered Research Synthesis: Production system for multi-source research synthesis using LLMs
  • Multi-Model Architecture: Intelligent model selection with Gemini 3, GPT-4, Claude 3
  • Collaborative Platform: Subscriber-driven content generation with public/private sharing
  • Knowledge Engine Integration: Architecture for Research Papers DB, Video DB, GLMP, Framework

⚙️ Technical Achievements

  • 250+ Million Papers: Accessible via 8+ integrated academic databases
  • 79 Episodes: Generated across 5 scientific disciplines
  • Production Deployment: Live platform with operational API and RSS distribution
  • Scalable Architecture: Serverless microservices on Google Cloud

🎯 Position Within CopernicusAI Knowledge Engine

This platform serves as the core synthesis and distribution component of the CopernicusAI Knowledge Engine. The Knowledge Engine is an integrated ecosystem of research and collaboration tools that work together to assist scientists in their workflow, from research discovery through knowledge synthesis to multi-format content generation.

Current Components:

  • 1. CopernicusAI (This platform) - Core synthesis & distribution
  • 2. Programming Framework - Foundational meta-tool
  • 3. GLMP - Biological process visualization
  • 4. Research Paper Metadata Database - Data infrastructure
  • 5. Science Video Database - Multi-modal content

Future Development:

The Knowledge Engine is designed to grow and evolve. Additional tools, databases, and collaboration components will be added as the project develops, expanding capabilities for AI-assisted scientific research and knowledge discovery.

📖 Citation Information

For Grant Proposals (NSF/DOE):

Welz, G. (2025). CopernicusAI: Knowledge Engine for Scientific Discovery.

Hugging Face Space. https://huggingface.co/spaces/garywelz/copernicusai

Live Platform: https://www.copernicusai.fyi

BibTeX Format:

@misc{welz2025copernicusai,
  title={CopernicusAI: Knowledge Engine for Scientific Discovery},
  author={Welz, Gary},
  year={2025},
  url={https://huggingface.co/spaces/garywelz/copernicusai},
  note={Hugging Face Space, Live Platform: https://www.copernicusai.fyi}
}

📊 Data Availability Statement

Platform Access

Data & Code Availability

  • Hugging Face Spaces: All components accessible at https://huggingface.co/garywelz (opens in new tab)
  • Process Flowcharts (GLMP): JSON files stored in Google Cloud Storage, accessible via GLMP Database Table (opens in new tab)
  • Research Paper Metadata: 23,246 indexed papers with metadata accessible through Research Tools Dashboard
  • API Documentation: RESTful API endpoints available for programmatic access (see API Documentation section)

Reproducibility Information

  • Technology Stack: All technologies and versions documented in Technology Stack section
  • LLM Models: Google Gemini 3, OpenAI GPT-4/GPT-3.5, Anthropic Claude 3 (versions specified in documentation)
  • Source Citations: All podcast episodes include full citations to source papers
  • Metadata: Complete metadata for all generated content available through API
  • License: MIT License - see license information in space metadata

How to Cite This Work

Welz, G. (2024–2025). CopernicusAI: AI-Generated Audio Briefings as a Research Interface.
Hugging Face Spaces. https://huggingface.co/spaces/garywelz/copernicusai

BibTeX Format:

@misc{welz2025copernicusai,
  title={CopernicusAI: AI-Generated Audio Briefings as a Research Interface},
  author={Welz, Gary},
  year={2024--2025},
  url={https://huggingface.co/spaces/garywelz/copernicusai},
  note={Hugging Face Space}
}

🌐 Grant Support & Collaboration

Grant Applications Supported

This platform is designed to support grant applications to:

NSF

National Science Foundation - Science education and research infrastructure

DOE

Department of Energy - Scientific computing and data science

SAIR Foundation

AI research and development initiatives

Collaboration Opportunities

  • • Integration with academic institutions
  • • Partnership with research organizations
  • • Open data initiatives
  • • Educational program development

🔗 Live Platform & Resources

🧩 Knowledge Engine Components

The CopernicusAI Knowledge Engine is an integrated ecosystem of research and collaboration tools. The Research Tools Dashboard is now fully operational (December 2025) with a working web interface providing unified access to all components.

✅ Research Tools Dashboard (Implemented)

Fully operational web interface with knowledge graph visualization (23,246 papers), vector search, RAG queries, and content browsing.

Public Project Interface → (opens in new tab)
Research Tools Dashboard → (opens in new tab)

🔌 API Documentation

Base URL: https://copernicus-podcast-api-phzp4ie2sq-uc.a.run.app

Podcast Generation

  • POST /generate-podcast-with-subscriber
  • GET /api/subscribers/podcasts/{id}
  • POST /api/subscribers/podcasts/submit-to-rss

Research Endpoints

  • POST /api/papers/upload
  • GET /api/papers/{paper_id}
  • POST /api/papers/query
  • POST /api/papers/{id}/link-podcast/{id}

Admin Endpoints

  • GET /api/admin/subscribers
  • POST /api/admin/podcasts/fix-missing-titles
  • GET /api/admin/podcasts/catalog

📝 Example Request

POST /api/papers/query

{
  "discipline": "biology",
  "keywords": ["DNA replication", "cell cycle"],
  "date_range": {
    "start": "2020-01-01",
    "end": "2025-01-01"
  },
  "limit": 10
}

📤 Example Response

{
  "status": "success",
  "count": 10,
  "papers": [
    {
      "id": "pmid_12345678",
      "title": "Mechanisms of DNA Replication...",
      "authors": ["Smith, J.", "Doe, A."],
      "journal": "Nature",
      "year": 2023,
      "doi": "10.1038/s41586-023-01234",
      "abstract": "..."
    }
  ]
}

🔐 Authentication

API uses Bearer token authentication. Include in request headers:

Authorization: Bearer YOUR_API_TOKEN

⚡ Rate Limits

Standard rate limits apply: 100 requests/minute per API key. Contact for higher limits.

📚 API Version

Current version: v1.0. API is stable and backward-compatible.