Skip to content

Commit ffdfeb7

Browse files
ruphyclaudewesm
authored
Add import-mbox and import-emlx commands for local email import (#103)
## Summary - Add `import-mbox` command for importing MBOX files (including zipped archives from HEY.com, Google Takeout, etc.) - Add `import-emlx` command for importing Apple Mail `.emlx` mailbox directory trees - Both importers share a common ingestion path (`IngestRawMessage`) with checkpoint/resume, attachment extraction, and FTS5 indexing - Fix invalid UTF-8 in Message-IDs, email addresses, and domains during ingestion so DuckDB Parquet cache builds succeed - Extend `repair-encoding` to cover `conversations.source_conversation_id`, `participants.email_address`, and `participants.domain` ## What's in this PR ### `import-mbox` - New CLI: `msgvault import-mbox <identifier> <export-file>` - Accepts `.mbox/.mbx` or a `.zip` containing `.mbox` files - Records imported mail under a configurable `--source-type` (example: `--source-type hey`) - Optional `--label` to apply a label to newly imported messages - Interrupt-safe imports with checkpoints and resume (default), with `--no-resume` to start fresh - Deterministic ZIP extraction to `dataDir/imports/mbox/<sha256(zip)>` with a `.done` sentinel and stable ordering - Streaming MBOX reader (`internal/mbox`) with mboxrd-style unescaping ### `import-emlx` - New CLI: `msgvault import-emlx <identifier> <path>` - Discovers `.mbox`/`.imapmbox` directories in Apple Mail directory trees - Parses `.emlx` format (byte-count prefix + MIME + XML plist metadata) - Derives labels from directory paths, stripping `.mbox`/`.imapmbox` suffixes - Deduplicates identical messages across mailboxes (one message, multiple labels) - Checkpoint/resume with the same patterns as `import-mbox` ### Shared ingestion (`internal/importer`) - Extracted `IngestRawMessage` so both importers share the same parsing, recipient storage, attachment, and FTS5 indexing path - UTF-8 sanitization on all address fields (Message-ID, email, domain, display name) to prevent DuckDB errors during Parquet cache builds - `repair-encoding` extended with three new fields for fixing existing bad data ### Bug fixes and refactoring - Fix attachment cache unbounded growth (key by path+size+hash) - Fix `sourceMsgID` stability (content-based instead of file-discriminator-based) - Fix `MarkMessageDeletedByGmailID` to respect permanent flag - Extract mbox zip handling into `internal/importer/mboxzip` package - Various refactoring in `main.go`, `export_eml.go`, `store_attachment.go`, `messages.go` ## Test plan - [x] `make test` passes - [x] `go vet ./...` passes - [x] End-to-end CLI tests for both `import-mbox` and `import-emlx` - [x] Unit tests for MBOX reader, EMLX parser, mailbox discovery, ingestion UTF-8 sanitization - [ ] Manual test: `./msgvault import-mbox you@hey.com /path/to/hey-export.zip --source-type hey --label hey` - [ ] Manual test: `./msgvault import-emlx me@icloud.com ~/Library/Mail/V10` - [ ] Manual test: `./msgvault repair-encoding` followed by `./msgvault build-cache --full-rebuild` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>
1 parent 307e4c5 commit ffdfeb7

39 files changed

Lines changed: 9203 additions & 148 deletions

.githooks/pre-commit

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,9 @@ if [ -n "$UNFORMATTED" ]; then
1818
exit 1
1919
fi
2020

21-
# Run linter
21+
# Run linter (only check new issues vs HEAD)
2222
echo "Running linter..."
23-
if ! golangci-lint run ./... 2>&1; then
23+
if ! golangci-lint run --new-from-rev=HEAD ./... 2>&1; then
2424
echo ""
2525
echo "Lint failed. Fix errors before committing."
2626
exit 1

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ oauth_client*.json
2323
# Local development state
2424
.beads/
2525
.githooks/post-commit
26+
.githooks/post-rewrite
2627
.mcp.json
2728

2829
# Python

README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,12 @@ Archive a lifetime of email. Analytics and search in milliseconds, entirely offl
1515

1616
Your messages are yours. Decades of correspondence, attachments, and history shouldn't be locked behind a web interface or an API. msgvault downloads a complete local copy and then everything runs offline. Search, analytics, and the MCP server all work against local data with no network access required.
1717

18-
Currently supports Gmail, with WhatsApp and other messaging platforms planned.
18+
Currently supports Gmail sync, plus offline imports from MBOX exports (including HEY.com).
1919

2020
## Features
2121

2222
- **Full Gmail backup**: raw MIME, attachments, labels, and metadata
23+
- **MBOX import**: import email exports from providers without IMAP/POP access (e.g. HEY.com)
2324
- **Interactive TUI**: drill-down analytics over your entire message history, powered by DuckDB over Parquet
2425
- **Full-text search**: FTS5 with Gmail-like query syntax (`from:`, `has:attachment`, date ranges)
2526
- **MCP server**: access your full archive at the speed of thought in Claude Desktop and other MCP-capable AI agents
@@ -86,12 +87,22 @@ msgvault tui
8687
| `stats` | Show archive statistics |
8788
| `verify EMAIL` | Verify archive integrity against Gmail |
8889
| `export-eml` | Export a message as `.eml` |
90+
| `import-mbox` | Import email from an MBOX export (HEY has no IMAP/POP) |
8991
| `build-cache` | Rebuild the Parquet analytics cache |
9092
| `repair-encoding` | Fix UTF-8 encoding issues |
9193
| `list-senders` / `list-domains` / `list-labels` | Explore metadata |
9294

9395
See the [CLI Reference](https://msgvault.io/cli-reference/) for full details.
9496

97+
## Importing HEY.com
98+
99+
HEY exports mail as MBOX (often delivered as a `.zip`). Import it using `import-mbox`:
100+
101+
```bash
102+
msgvault init-db
103+
msgvault import-mbox you@hey.com /path/to/hey-export.zip --source-type hey --label hey
104+
```
105+
95106
## Configuration
96107

97108
All data lives in `~/.msgvault/` by default (override with `MSGVAULT_HOME`).

SECURITY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ If you discover a security vulnerability in msgvault, please report it responsib
2727
**File permissions:**
2828
- OAuth token files created with 0600 permissions (owner read/write only)
2929
- Config directory (`~/.msgvault/`) should be 0700
30+
- Attachment storage directory (`~/.msgvault/attachments/`) is created with 0700; attachment files are 0600
3031
- Cross-platform support including Windows DACL
3132

3233
**SQL injection prevention:**

cmd/msgvault/cmd/export_eml.go

Lines changed: 76 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,20 @@ package cmd
22

33
import (
44
"fmt"
5-
"os"
65
"strconv"
6+
"strings"
77

88
"github.com/spf13/cobra"
99
"github.com/wesm/msgvault/internal/fileutil"
1010
"github.com/wesm/msgvault/internal/query"
1111
"github.com/wesm/msgvault/internal/store"
1212
)
1313

14+
const (
15+
stdoutSentinel = "-"
16+
emlFileMode = 0o600
17+
)
18+
1419
var (
1520
exportEMLOutput string
1621
)
@@ -29,75 +34,88 @@ Examples:
2934
msgvault export-eml 18f0abc123def -o important.eml`,
3035
Args: cobra.ExactArgs(1),
3136
RunE: func(cmd *cobra.Command, args []string) error {
32-
idStr := args[0]
37+
return runExportEML(cmd, args[0], exportEMLOutput)
38+
},
39+
}
40+
41+
type resolvedMessage struct {
42+
ID int64
43+
SourceMessageID string
44+
}
3345

34-
// Open database
35-
dbPath := cfg.DatabaseDSN()
36-
s, err := store.Open(dbPath)
46+
func resolveMessage(engine *query.SQLiteEngine, cmd *cobra.Command, messageRef string) (resolvedMessage, error) {
47+
if id, err := strconv.ParseInt(messageRef, 10, 64); err == nil {
48+
msg, err := engine.GetMessage(cmd.Context(), id)
3749
if err != nil {
38-
return fmt.Errorf("open database: %w", err)
50+
return resolvedMessage{}, fmt.Errorf("get message: %w", err)
3951
}
40-
defer s.Close()
41-
42-
// Create query engine to look up the message
43-
engine := query.NewSQLiteEngine(s.DB())
44-
45-
// Try to parse as numeric ID first
46-
var msgID int64
47-
var sourceMessageID string
48-
if id, err := strconv.ParseInt(idStr, 10, 64); err == nil {
49-
msgID = id
50-
// Get source message ID for output filename
51-
msg, err := engine.GetMessage(cmd.Context(), id)
52-
if err != nil {
53-
return fmt.Errorf("get message: %w", err)
54-
}
55-
if msg == nil {
56-
return fmt.Errorf("message not found: %s", idStr)
57-
}
58-
sourceMessageID = msg.SourceMessageID
59-
} else {
60-
// It's a source message ID (Gmail ID)
61-
sourceMessageID = idStr
62-
msg, err := engine.GetMessageBySourceID(cmd.Context(), idStr)
63-
if err != nil {
64-
return fmt.Errorf("get message: %w", err)
65-
}
66-
if msg == nil {
67-
return fmt.Errorf("message not found: %s", idStr)
68-
}
69-
msgID = msg.ID
52+
if msg == nil {
53+
return resolvedMessage{}, fmt.Errorf("message not found: %s", messageRef)
7054
}
55+
return resolvedMessage{ID: id, SourceMessageID: msg.SourceMessageID}, nil
56+
}
7157

72-
// Get the raw MIME data
73-
rawData, err := s.GetMessageRaw(msgID)
74-
if err != nil {
75-
return fmt.Errorf("get raw message data: %w (message may not have raw data stored)", err)
76-
}
58+
msg, err := engine.GetMessageBySourceID(cmd.Context(), messageRef)
59+
if err != nil {
60+
return resolvedMessage{}, fmt.Errorf("get message: %w", err)
61+
}
62+
if msg == nil {
63+
return resolvedMessage{}, fmt.Errorf("message not found: %s", messageRef)
64+
}
65+
return resolvedMessage{ID: msg.ID, SourceMessageID: msg.SourceMessageID}, nil
66+
}
7767

78-
// Determine output filename
79-
outputPath := exportEMLOutput
80-
if outputPath == "" {
81-
outputPath = fmt.Sprintf("%s.eml", sourceMessageID)
68+
func sanitizeEMLFilename(sourceMessageID string) string {
69+
safe := strings.Map(func(r rune) rune {
70+
if r == '/' || r == '\\' || r == '\x00' {
71+
return '_'
8272
}
73+
return r
74+
}, sourceMessageID)
75+
if safe == "" {
76+
safe = "message"
77+
}
78+
return safe + ".eml"
79+
}
8380

84-
// Write to file or stdout
85-
if outputPath == "-" {
86-
_, err = os.Stdout.Write(rawData)
87-
return err
88-
}
81+
func runExportEML(cmd *cobra.Command, messageRef, outputPath string) error {
82+
dbPath := cfg.DatabaseDSN()
83+
s, err := store.Open(dbPath)
84+
if err != nil {
85+
return fmt.Errorf("open database: %w", err)
86+
}
87+
defer s.Close()
8988

90-
err = fileutil.SecureWriteFile(outputPath, rawData, 0600) // Restricted permissions for email content
91-
if err != nil {
92-
return fmt.Errorf("write file: %w", err)
93-
}
89+
engine := query.NewSQLiteEngine(s.DB())
9490

95-
fmt.Printf("Exported message to: %s (%d bytes)\n", outputPath, len(rawData))
96-
return nil
97-
},
91+
resolved, err := resolveMessage(engine, cmd, messageRef)
92+
if err != nil {
93+
return err
94+
}
95+
96+
rawData, err := s.GetMessageRaw(resolved.ID)
97+
if err != nil {
98+
return fmt.Errorf("get raw message data: %w (message may not have raw data stored)", err)
99+
}
100+
101+
if outputPath == "" {
102+
outputPath = sanitizeEMLFilename(resolved.SourceMessageID)
103+
}
104+
105+
if outputPath == stdoutSentinel {
106+
_, err = cmd.OutOrStdout().Write(rawData)
107+
return err
108+
}
109+
110+
if err := fileutil.SecureWriteFile(outputPath, rawData, emlFileMode); err != nil {
111+
return fmt.Errorf("write file: %w", err)
112+
}
113+
114+
cmd.Printf("Exported message to: %s (%d bytes)\n", outputPath, len(rawData))
115+
return nil
98116
}
99117

100118
func init() {
101119
rootCmd.AddCommand(exportEMLCmd)
102-
exportEMLCmd.Flags().StringVarP(&exportEMLOutput, "output", "o", "", "Output file path (default: <gmail_id>.eml, use - for stdout)")
120+
exportEMLCmd.Flags().StringVarP(&exportEMLOutput, "output", "o", "", "Output file path (default: <source_message_id>.eml, use - for stdout)")
103121
}

0 commit comments

Comments
 (0)