You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add import-mbox and import-emlx commands for local email import (#103)
## Summary
- Add `import-mbox` command for importing MBOX files (including zipped
archives from HEY.com, Google Takeout, etc.)
- Add `import-emlx` command for importing Apple Mail `.emlx` mailbox
directory trees
- Both importers share a common ingestion path (`IngestRawMessage`) with
checkpoint/resume, attachment extraction, and FTS5 indexing
- Fix invalid UTF-8 in Message-IDs, email addresses, and domains during
ingestion so DuckDB Parquet cache builds succeed
- Extend `repair-encoding` to cover
`conversations.source_conversation_id`, `participants.email_address`,
and `participants.domain`
## What's in this PR
### `import-mbox`
- New CLI: `msgvault import-mbox <identifier> <export-file>`
- Accepts `.mbox/.mbx` or a `.zip` containing `.mbox` files
- Records imported mail under a configurable `--source-type` (example:
`--source-type hey`)
- Optional `--label` to apply a label to newly imported messages
- Interrupt-safe imports with checkpoints and resume (default), with
`--no-resume` to start fresh
- Deterministic ZIP extraction to `dataDir/imports/mbox/<sha256(zip)>`
with a `.done` sentinel and stable ordering
- Streaming MBOX reader (`internal/mbox`) with mboxrd-style unescaping
### `import-emlx`
- New CLI: `msgvault import-emlx <identifier> <path>`
- Discovers `.mbox`/`.imapmbox` directories in Apple Mail directory
trees
- Parses `.emlx` format (byte-count prefix + MIME + XML plist metadata)
- Derives labels from directory paths, stripping `.mbox`/`.imapmbox`
suffixes
- Deduplicates identical messages across mailboxes (one message,
multiple labels)
- Checkpoint/resume with the same patterns as `import-mbox`
### Shared ingestion (`internal/importer`)
- Extracted `IngestRawMessage` so both importers share the same parsing,
recipient storage, attachment, and FTS5 indexing path
- UTF-8 sanitization on all address fields (Message-ID, email, domain,
display name) to prevent DuckDB errors during Parquet cache builds
- `repair-encoding` extended with three new fields for fixing existing
bad data
### Bug fixes and refactoring
- Fix attachment cache unbounded growth (key by path+size+hash)
- Fix `sourceMsgID` stability (content-based instead of
file-discriminator-based)
- Fix `MarkMessageDeletedByGmailID` to respect permanent flag
- Extract mbox zip handling into `internal/importer/mboxzip` package
- Various refactoring in `main.go`, `export_eml.go`,
`store_attachment.go`, `messages.go`
## Test plan
- [x] `make test` passes
- [x] `go vet ./...` passes
- [x] End-to-end CLI tests for both `import-mbox` and `import-emlx`
- [x] Unit tests for MBOX reader, EMLX parser, mailbox discovery,
ingestion UTF-8 sanitization
- [ ] Manual test: `./msgvault import-mbox you@hey.com
/path/to/hey-export.zip --source-type hey --label hey`
- [ ] Manual test: `./msgvault import-emlx me@icloud.com
~/Library/Mail/V10`
- [ ] Manual test: `./msgvault repair-encoding` followed by `./msgvault
build-cache --full-rebuild`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>
Copy file name to clipboardExpand all lines: README.md
+12-1Lines changed: 12 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,11 +15,12 @@ Archive a lifetime of email. Analytics and search in milliseconds, entirely offl
15
15
16
16
Your messages are yours. Decades of correspondence, attachments, and history shouldn't be locked behind a web interface or an API. msgvault downloads a complete local copy and then everything runs offline. Search, analytics, and the MCP server all work against local data with no network access required.
17
17
18
-
Currently supports Gmail, with WhatsApp and other messaging platforms planned.
18
+
Currently supports Gmail sync, plus offline imports from MBOX exports (including HEY.com).
19
19
20
20
## Features
21
21
22
22
-**Full Gmail backup**: raw MIME, attachments, labels, and metadata
23
+
-**MBOX import**: import email exports from providers without IMAP/POP access (e.g. HEY.com)
23
24
-**Interactive TUI**: drill-down analytics over your entire message history, powered by DuckDB over Parquet
24
25
-**Full-text search**: FTS5 with Gmail-like query syntax (`from:`, `has:attachment`, date ranges)
25
26
-**MCP server**: access your full archive at the speed of thought in Claude Desktop and other MCP-capable AI agents
@@ -86,12 +87,22 @@ msgvault tui
86
87
|`stats`| Show archive statistics |
87
88
|`verify EMAIL`| Verify archive integrity against Gmail |
88
89
|`export-eml`| Export a message as `.eml`|
90
+
|`import-mbox`| Import email from an MBOX export (HEY has no IMAP/POP) |
89
91
|`build-cache`| Rebuild the Parquet analytics cache |
0 commit comments