Skip to content

Releases: hashangit/Extract2MD

v2.0.0

23 May 23:46
6a71d62

Choose a tag to compare

Extract2MD v2.0.0 - Major Release

Extract2MD

🚀 Full Redesign & Complete API Overhaul

Release Date: 24-05-2025
Version: 2.0.0 (Breaking Changes)
Migration Support: Legacy API maintained for transition period


📋 Release Overview

Extract2MD v2.0.0 represents a complete reimagining of the library with a focus on developer experience, intuitive usage patterns, and modern architecture. This major release introduces a revolutionary scenario-based API that replaces the complex instance-based approach with clear, purpose-driven methods.

Core Philosophy: Instead of configuring complex options, developers now choose from 5 distinct conversion scenarios that match their specific use cases.


⚠️ Breaking Changes

API Complete Redesign

  • Old: Instance-based API with complex configuration options
  • New: Static methods with scenario-based approach
  • Impact: All existing integrations require updates
  • Migration: Legacy API available as LegacyExtract2MDConverter during transition

Configuration Changes

  • Old: Loose configuration object with numerous optional parameters
  • New: Structured configuration with validation and default merging
  • Impact: Configuration structure has changed significantly
  • Migration: Use ConfigValidator for seamless config handling

Import/Export Changes

  • Old: Single converter class export
  • New: Modular exports with main converter and utilities
  • Impact: Import statements need updating
  • Migration: Update imports and follow new module structure

New Features

🎯 Scenario-Based API

Five distinct conversion methods designed for specific use cases:

1. Quick Only - Extract2MDConverter.quickOnly()

  • Purpose: Fast PDF.js-based text extraction
  • Best For: Clean PDFs with selectable text
  • Performance: Fastest option, minimal processing
  • Use Case: Documentation, reports, digital-native PDFs

2. High Accuracy OCR Only - Extract2MDConverter.highAccuracyOCROnly()

  • Purpose: Tesseract OCR with canvas rendering
  • Best For: Scanned documents, images, complex layouts
  • Performance: Slower but highly accurate
  • Use Case: Scanned books, historical documents, printed materials

3. Quick + LLM - Extract2MDConverter.quickPlusLLM()

  • Purpose: Fast extraction enhanced with AI processing
  • Best For: PDFs needing structure improvement
  • Performance: Moderate, WebGPU accelerated
  • Use Case: Business documents, formatted reports

4. High Accuracy + LLM - Extract2MDConverter.highAccuracyPlusLLM()

  • Purpose: OCR processing with AI enhancement
  • Best For: Complex documents requiring both OCR and AI
  • Performance: Comprehensive, highest quality
  • Use Case: Academic papers, technical documents

5. Combined + LLM - Extract2MDConverter.combinedPlusLLM()

  • Purpose: All extraction methods with AI post-processing
  • Best For: Maximum accuracy and formatting
  • Performance: Most thorough, longest processing time
  • Use Case: Critical documents, archival processing

🧩 Modular Architecture

Complete internal refactoring into specialized modules:

  • Extract2MDConverter.js - Main converter with scenario methods
  • WebLLMEngine.js - Encapsulated LLM integration
  • ConfigValidator.js - Configuration validation and defaults
  • OutputParser.js - LLM output cleaning and formatting
  • SystemPrompts.js - Centralized prompt management

📚 Comprehensive Documentation Suite

New Documentation Files:

  • MIGRATION.md - Step-by-step migration guide with code examples
  • DEPLOYMENT.md - Complete deployment guide for all environments
  • config.example.json - Full configuration example
  • Updated README.md - Rewritten for new API

Interactive Examples:

  • demo.html - Live interactive demo showcasing all 5 scenarios
  • usage-examples.js - Updated code examples for new API
  • SSL certificates - Demo server setup for local testing

⚙️ Enhanced Configuration System

  • Structured Configuration Object with clear hierarchy
  • Built-in Validation with ConfigValidator utility
  • JSON Configuration Support for external config files
  • Default Value Merging for simplified setup
  • Type Safety with comprehensive TypeScript definitions

🧪 Robust Testing Framework

New comprehensive test suite:

  • scenarios.test.js - Tests for all 5 scenario methods
  • simple.test.js - Basic structure validation
  • newline-optimization.test.js - Markdown formatting tests
  • simple-newline.test.js - Standalone newline processing tests
  • validate-deployment.js - Deployment readiness validation

🔧 Technical Improvements

Build System Enhancements

  • Dual Bundle Generation: UMD and ESM formats
  • Optimized Distribution: Essential workers and definitions copied to dist
  • Updated Entry Points: Proper main, module, and types configuration
  • Enhanced Packaging: Improved file inclusion/exclusion

TypeScript Integration

  • Complete Type Definitions in src/types/index.d.ts
  • Scenario Method Types with proper return types and parameters
  • Configuration Interfaces for type-safe config handling
  • Legacy Compatibility Types for migration support

Performance Optimizations

  • WebGPU Capability Detection for LLM scenarios
  • Modular Loading reduces initial bundle size
  • Optimized Canvas Rendering for OCR processing
  • Streaming LLM Support for better user experience

Developer Experience

  • Clear Error Messages with improved error handling
  • Progress Tracking across all conversion scenarios
  • Intuitive Method Names that clearly indicate functionality
  • Consistent Return Formats across all scenarios

🛤️ Migration Guide

Immediate Steps

  1. Install v2.0.0: npm install extract2md@2.0.0
  2. Use Legacy API: Replace Extract2MDConverter with LegacyExtract2MDConverter
  3. Test Functionality: Ensure existing code works with legacy API
  4. Plan Migration: Review MIGRATION.md for upgrade path

Recommended Migration Process

  1. Identify Usage Patterns: Determine which scenarios match your current usage
  2. Update Configuration: Migrate to new structured config format
  3. Replace Method Calls: Switch to appropriate scenario-based methods
  4. Update Error Handling: Adapt to new error formats
  5. Test Thoroughly: Validate output quality and performance

Timeline

  • v2.0.0 - v2.x.x: Legacy API available alongside new API
  • v3.0.0: Legacy API will be removed (future major release)
  • Recommended: Migrate within 1 months for best support

📦 Installation & Deployment

NPM Installation

npm install extract2md@2.0.0

Import Examples

// New API (recommended)
import { Extract2MDConverter } from 'extract2md';

// Legacy API (for migration)
import { LegacyExtract2MDConverter } from 'extract2md';

// Utilities
import { ConfigValidator, OutputParser } from 'extract2md';

Deployment Options

  • Node.js Applications: Full feature support
  • Web Applications: Browser-compatible with WebWorkers
  • CDN Distribution: Direct browser usage
  • Static Sites: Pre-built bundle integration

🌟 What's New in Detail

WebLLM Engine Integration

  • Standalone Engine Class for better modularity
  • Streaming Support for real-time processing feedback
  • Model Loading Management with error handling
  • WebGPU Optimization for enhanced performance

Output Processing Pipeline

  • Thinking Tag Removal from LLM outputs
  • Markdown Normalization for consistent formatting
  • Newline Optimization for better readability
  • Post-processing Hooks for custom transformations

Configuration Validation

  • Schema-based Validation with clear error messages
  • Default Value Injection for missing configuration
  • Type Coercion for flexible config input
  • JSON File Support for external configuration

Enhanced Error Handling

  • Scenario-specific Errors with context information
  • Validation Errors with field-level details
  • Processing Errors with progress context
  • Recovery Suggestions for common issues

🔮 Looking Forward

Planned Enhancements

  • Additional Scenarios based on user feedback
  • Performance Optimizations for large document processing
  • Enhanced LLM Models support and configuration
  • Advanced Output Formats beyond Markdown

Community & Support

  • Migration Support: Comprehensive documentation and examples
  • Community Feedback: Open to suggestions for new scenarios
  • Regular Updates: Incremental improvements and bug fixes
  • Long-term Support: Commitment to stable API evolution

📞 Support & Resources

  • Migration Guide: MIGRATION.md - Complete migration instructions
  • Deployment Guide: DEPLOYMENT.md - Production deployment best practices
  • Interactive Demo: examples/demo.html - Try all scenarios
  • Configuration Example: config.example.json - Complete config reference
  • Type Definitions: Full TypeScript support included

🙏 Acknowledgments

This major release represents months of development focused on creating the most intuitive and powerful PDF-to-Markdown conversion experience. Thank you to all contributors and early adopters who provided feedbac...

Read more