Skip to content

kitschpatrol/metascope

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

192 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

metascope

NPM Package metascope License: MIT CI

A CLI tool and TypeScript library to easily extract metadata from all kinds of software repositories.

Note

Metascope is under development. Expect breaking changes until a 1.0 release.

Overview

Metascope aggregates metadata from a local code repository into a single monolithic JSON object. Given a project directory, it checks multiple sources in parallel — local git history, package manifests, the GitHub API, the NPM registry, lines of code analysis, and more — and returns a JSON object containing everything it could find.

From there, an (optional) template system lets you refine and transform the output to reflect exactly which fields you need, useful for archival purposes, populating dashboards, or feeding data into other tools. The template system also provides a spec-compliant implementation of the CodeMeta vocabulary, allowing easy generation of codemeta.json files for a semantically normalized view of a variety of project types.

Highlights:

  • A wide net
    Metascope pulls project metadata from many available sources: package.json, pyproject.toml, NPM, PyPI, GitHub, git, filesystem stats, and more.

  • Graceful degradation
    Each source checks its own availability before extraction. Missing tools, unavailable APIs, or absent credentials are silently skipped — you always get back whatever data is available within the constraints of the calling context.

  • Parallel extraction
    After an initial codemeta pass for discovery hints (package name, repository URL, keywords), all remaining sources are checked and extracted concurrently.

  • Typed templates
    The defineTemplate() helper provides full autocomplete on available fields. TypeScript infers the return type from your template function, so getMetadata() returns exactly the shape you need.

  • CLI and library
    Use it as a command-line tool for quick inspection or pipe-friendly JSON output, or import it as a library for programmatic access with full type safety.

Getting started

Dependencies

Metascope requires Node.js 22.17+. It is implemented in TypeScript, ships as ESM, and bundles complete type definitions.

Metascope also requires a recent version of git on your path for quickly identifying ignored files and aggregating repository statistics.

Optional external tools:

  • GitHub CLI
    Used as a fallback for GitHub API authentication if no token is provided via --github-token or $GITHUB_TOKEN. It's trivially installed from Homebrew: brew install gh.

Installation

Invoke directly on the current directory:

npx metascope

...or install locally:

npm install metascope

...or install globally:

npm install --global metascope

If you're using PNPM, you can safely ignore the build scripts for the tree-sitter dependencies, since we're only interested in their bundled WASM implementations.

In your pnpm-workspace.yaml:

ignoredBuiltDependencies:
  - tree-sitter-python
  - tree-sitter-ruby

Usage

CLI

Command: metascope

Extract metadata from a code repository.

Usage:

metascope [path]
Positional Argument Description Type Default
path Project directory path string "."
Option Description Type Default
--template
-t
Built-in template name (codemeta, codemetaJson, frontmatter, metadata, project) or path to a custom template file string
--github-token GitHub API token (or set $GITHUB_TOKEN) string
--author-name Optional author name(s) for ownership checks in templates array
--github-account Optional GitHub account name(s) for ownership checks in templates array
--absolute Output absolute paths. Use --no-absolute for relative paths. boolean true
--offline Skip sources requiring network requests boolean false
--sources
-s
Only run specific metadata sources (arduino-library-properties, cinder-cinderblock-xml, codemeta-json, git-config, go-go-mod, go-goreleaser-yaml, java-pom-xml, license-file, metadata-file, metascope, node-package-json, obsidian-plugin-manifest-json, openframeworks-addon-config-mk, openframeworks-install-xml, processing-library-properties, processing-sketch-properties, publiccode-yaml, python-pkg-info, python-pyproject-toml, python-setup-cfg, python-setup-py, readme-file, ruby-gemspec, rust-cargo-toml, xcode-info-plist, xcode-project-pbxproj, github-actions, code-stats, dependency-updates, file-stats, git-stats, github, node-npm-registry, obsidian-plugin-registry, python-pypi-registry); defaults to all array
--no-ignore Include files ignored by .gitignore in the file tree boolean false
--recursive
-r
Search for metadata files recursively in subdirectories boolean false
--workspaces
-w
Include workspace-specific metadata in monorepos; pass a boolean to enable or disable auto-detection, or pass one or more strings to explicitly define workspace paths true
--verbose Run with verbose logging boolean false
--help
-h
Show help boolean
--version
-v
Show version number boolean

Examples

Basic metadata extraction

Extract all available metadata from the current directory:

metascope

Output is pretty-printed JSON when writing to a terminal, compact JSON when piped.

Scan a specific directory
metascope /path/to/project
Use a built-in template
metascope --template project
Pass template data for ownership checks

Some preset templates return information based on the (relative) ownership status of a repo. This requires additional context data, which can be passed in via additional CLI flags:

metascope --template project --author-name "Jane Doe" --github-account janedoe

Multiple values are supported:

metascope --template project --author-name "Jane Doe" "John Doe" --github-account janedoe johndoe
Use a custom template file
metascope --template ./my-template.ts

Where my-template.ts might look like:

import { defineTemplate, helpers } from 'metascope'

export default defineTemplate(({ codemetaJson, github, gitStats }) => {
  const codemeta = helpers.firstOf(codemetaJson)
  const git = helpers.firstOf(gitStats)
  const gh = helpers.firstOf(github)
  return {
    commits: git?.data.commitCount,
    name: codemeta?.data.name,
    stars: gh?.data.stargazerCount,
    version: codemeta?.data.version,
  }
})
Run only specific sources

Extract metadata from only the sources you need, skipping everything else for faster results:

metascope --sources node-package-json git-stats
Pipe compact JSON to another tool
metascope | jq '.github.stargazerCount'
Provide a GitHub token

An optional GitHub token can allow access to metadata about private repositories, and raises the request limit if you're operating on a large collection of repositories:

metascope --github-token ghp_xxxxxxxxxxxx

Or set the GITHUB_TOKEN environment variable, or authenticate via gh auth login. Metascope will attempt to find a credential without bothering you.

Verbose logging
metascope --verbose

Logs source availability checks, extraction durations, and other diagnostics to stderr.

API

The metascope library exports getMetadata as its primary function, defineTemplate for type-safe template authoring, and a helpers namespace with utility functions for working with metadata in templates.

getMetadata

// Without a template — returns full MetadataContext
function getMetadata(options: GetMetadataOptions): Promise<MetadataContext>

// With a template — returns the template's return type
function getMetadata<T>(options: GetMetadataTemplateOptions<T>): Promise<T>

The function accepts a project directory path, optional credentials, and an optional template (a built-in name or a template function). It returns a promise resolving to either the full MetadataContext or the shaped output of your template.

All undefined values and empty source objects are deep-stripped from the output before returning.

To run only a subset of sources, pass a sources array with the desired source key names. When omitted, all sources run (the default). This is useful for faster extraction when you only need specific data:

const result = await getMetadata({
  path: '.',
  sources: ['nodePackageJson', 'gitStats'],
})

Templates can be combined with the sources option, but note that some of the built-in templates might suffer missing data if they rely on specific sources.

defineTemplate

function defineTemplate<T>(
  transform: (context: MetadataContext, templateData: TemplateData) => T,
): Template<T>

An identity wrapper that provides autocomplete and type inference when authoring templates. The optional second templateData argument provides user-supplied values (like author names or GitHub accounts) for parameterized ownership checks. Templates that don't need it can simply ignore the argument. Template developers can pass additional values as needed.

Examples

Get all metadata
import { getMetadata, helpers } from 'metascope'

const metadata = await getMetadata({ path: '.' })
console.log(helpers.firstOf(metadata.codemetaJson)?.data.name)
console.log(helpers.firstOf(metadata.github)?.data.stargazerCount)
console.log(helpers.firstOf(metadata.gitStats)?.data.commitCount)

See output sample for this repository.

Get metadata from specific sources only
import { getMetadata, helpers } from 'metascope'

const metadata = await getMetadata({
  path: '.',
  sources: ['nodePackageJson', 'licenseFile'],
})

// Only the requested sources are populated
console.log(helpers.firstOf(metadata.nodePackageJson)?.data.name)
console.log(helpers.firstOf(metadata.licenseFile)?.data.spdxId)
// Other sources are undefined
console.log(metadata.github) // Undefined
Get shaped metadata via a template
import { defineTemplate, getMetadata, helpers } from 'metascope'

const template = defineTemplate(({ codemetaJson, github }) => ({
  name: helpers.firstOf(codemetaJson)?.data.name,
  stars: helpers.firstOf(github)?.data.stargazerCount,
}))

// Result is typed as { name: ..., stars: ... }
const result = await getMetadata({ path: '.', template })
Provide credentials
import { getMetadata } from 'metascope'

const metadata = await getMetadata({
  credentials: { githubToken: 'ghp_xxxxxxxxxxxx' },
  path: '.',
})

Credential resolution follows a precedence chain: explicit options > environment variables > CLI tool fallbacks (e.g. gh auth token). This makes metascope work in both CI environments and local development without configuration.

Pass template data
import { defineTemplate, getMetadata, helpers } from 'metascope'

const template = defineTemplate(({ codemetaJson }, { authorName }) => {
  const codemeta = helpers.firstOf(codemetaJson)
  return {
    isAuthoredByMe: codemeta?.data.author?.some((a) => a.name === authorName),
    name: codemeta?.data.name,
  }
})

const result = await getMetadata({
  path: '.',
  template,
  templateData: { authorName: 'Jane Doe' },
})
Use a built-in template
import { getMetadata } from 'metascope'

const result = await getMetadata({ path: '.', template: 'frontmatter' })

Sources

Metascope extracts data from a wide range of data sources:

Local Files

Ecosystem Organization Metascope Key Source Specifications
Agnostic readmeFile README.md (and variants)
Agnostic CodeMeta (v1) codemetaJson codemeta.json
Agnostic CodeMeta (v2) codemetaJson codemeta.json
Agnostic CodeMeta (v3.1) codemetaJson codemeta.json
Agnostic CodeMeta (v3) codemetaJson codemeta.json
Agnostic Documented below metadataFile metadata.json (and .yaml / .yml variants)
Agnostic Git gitConfig .git/config
Agnostic Public Code publiccodeYaml publiccode.yml (Also matches .yaml)
Agnostic SPDX licenseFile LICENSE, LICENCE, COPYING, UNLICENSE (and .md/.txt variants)
Apple Apple Info.plist xcodeInfoPlist Info.plist
Apple Xcode Project xcodeProjectPbxproj *.xcodeproj/project.pbxproj
C++ Arduino Library arduinoLibraryProperties library.properties
C++ Cinder CinderBlock cinderCinderblockXml cinderblock.xml
C++ openFrameworks Addon (Legacy) openframeworksInstallXml install.xml (Legacy format, replaced by addon_config.mk)
C++ openFrameworks Addon openframeworksAddonConfigMk addon_config.mk
Go Go Modules goGoMod go.mod
Go GoReleaser goGoreleaserYaml .goreleaser.yaml (Also matches .yml)
Java Maven javaPomXml pom.xml
Java Processing Library processingLibraryProperties library.properties
Java Processing Sketch processingSketchProperties sketch.properties (Not really specified...)
JavaScript NPM nodePackageJson package.json
Obsidian Obsidian obsidianPluginManifestJson manifest.json
Python PyPi (Distutils) pythonSetupCfg setup.cfg
Python PyPi (Distutils) pythonSetupPy setup.py
Python PyPi (pep-0621) pythonPyprojectToml pyproject.toml
Python PyPi (PKG-INFO) pythonPkgInfo .egg-info/PKG-INFO
Ruby Ruby Gems rubyGemspec *.gemspec
Rust Crates rustCargoToml Cargo.toml

Local Tools

Ecosystem Organization Metascope Key Source Specifications
Agnostic dependencyUpdates Dependency freshness (outdated packages, libyears)
Agnostic fileStats Filesystem metadata (file counts, directory counts, total size)
Agnostic Git gitStats Git CLI statistics (commits, branches, tags, contributors)
Agnostic None codeStats Lines of code analysis from tokei via bundled native bindings

Remote Sources

You can skip network calls by passing --offline to the CLI.

Ecosystem Organization Metascope Key Source Specifications
Agnostic GitHub Repository Metadata github GitHub GraphQL metadata
JavaScript NPM Registry nodeNpmRegistry NPM registry API (download counts, publish dates, latest version)
Obsidian Obsidian Community Plugins obsidianPluginRegistry Obsidian community plugin stats (download counts)
Python PyPI Registry pythonPypiRegistry PyPI registry API (download counts, publish dates, latest version)

About metadata.json

Metascope supports a minimalist metadata.json (or .yaml) file is supported, which can capture the minimal metadata required to populate a GitHub project's repository page's description, homepage, and topics.

This is a non-standard format that exists primarily for use in combination with github-action-repo-sync.

Key Key Aliases CodeMeta Property Notes
description None description String description of project
homepage url repository website url For repository values, git+ prefix and .git suffix are automatically stripped
keywords tags topics keywords Array of strings, or a single comma-delimited string

If multiple key aliases are present in the object, priority for populating the associated codemeta.json goes to the key, then falls through to key aliases in the order shown above. (E.g. homepage takes priority over url.)

If you have more metadata to define but your project lacks a canonical package specification format, then creating a codemeta.json file is recommended over the non-standard metadata.json.

Templates

Metascope provides a basic templating / output transformation functionality to compose its output into more compact and focused representations.

Built-in templates

Five built-in templates are available by name. Pass the name as the template option on the CLI or in the API.

codemeta

The CodeMeta template provides a standard way to describe software using JSON-LD and schema.org terms. Most software projects already have rich metadata in manifests and other files (e.g. package.json, Cargo.toml, pyproject.toml, LICENSE, etc.), but the name and structure of semantically equivalent metadata is often inconsistent across ecosystems.

It leverages the crosswalk data generously compiled by CodeMeta contributors to assist in automating the mapping of various metadata formats to the CodeMeta standard. Where crosswalk data is unavailable or incomplete, heuristics are used instead.

This tool always outputs CodeMeta v3.1 files. When ingesting codemeta.json files defined in the older CodeMeta 1 and CodeMeta v2 contexts, all simple key re-mappings as defined in the crosswalk table are applied. However, some more nuanced conditional transformations (like the reassignment of copyright holding agents in v1) are not implemented.

More mature Python-based tools like codemetapy and codemeta-harvester perform a similar task, and either of these are recommended if you need codemeta.json output and aren't limited to a Node.js runtime.

Note that Metascope and its its author is not affiliated with the CodeMeta project / governing bodies.

metascope --template codemeta

See an output sample from the codemeta template run against this repository.

codemetaJson

A JSON-friendly derivation of the codemeta template. Produces the same aggregated metadata but parses it through a strict schema, stripping JSON-LD artifacts (like @context and @type) to yield plain JSON suitable for consumption by tools that don't understand JSON-LD.

metascope --template codemetaJson

See an output sample from the codemetaJson template run against this repository.

frontmatter

A compact, non-nested, polyglot overview of the project. Designed for Obsidian frontmatter — flat keys with natural language names, blending all available sources into a single trackable snapshot. Uses null for missing values to ensure stable keys.

metascope --template frontmatter

See an output sample from the frontmatter template run against this repository.

metadata

A minimal template that outputs the three fields used by metadata.json / metadata.yaml: description, homepage, and topics. Designed for use with github-action-repo-sync to populate a GitHub repository's description, homepage, and topics. Values from a metadata.json source file override what the codemeta template would otherwise produce.

metascope --template metadata

See an output sample from the metadata template run against this repository.

project

I needed this one for a legacy internal dashboard application. Includes ownership checks via authorName and githubAccount template data.

metascope --template project --author-name "Jane Doe" --github-account janedoe

See an output sample from the project template run against this repository.

Defining a custom template

Templates are pure functions that receive the full MetadataContext and an optional TemplateData object, and return whatever shape you like. They are applied after all sources have been extracted, so all available data is accessible.

Yes, you can just pipe output to jq and filter / transform as you please, but for complex templates with a lot of logic, TypeScript can be nicer to work with.

Use defineTemplate() for type inference and autocomplete.

Many helper functions for working with template data are also under the helpers namespace:

// In e.g. "metascope-template.ts":
import { defineTemplate, helpers } from 'metascope'

export default defineTemplate(({ codemetaJson, codeStats, github, gitStats }) => {
  const codemeta = helpers.firstOf(codemetaJson)
  const git = helpers.firstOf(gitStats)
  const gh = helpers.firstOf(github)
  const loc = helpers.firstOf(codeStats)
  return {
    commits: git?.data.commitCount,
    forks: gh?.data.forkCount,
    linesOfCode: loc?.data.total?.code,
    name: codemeta?.data.name,
    stars: gh?.data.stargazerCount,
    version: codemeta?.data.version,
  }
})

Passing template data

The second argument to a template function is a TemplateData object with optional authorName and githubAccount fields. This lets templates parameterize ownership checks instead of hardcoding author names:

import { defineTemplate, helpers } from 'metascope'

export default defineTemplate(({ codemetaJson }, { authorName, githubAccount }) => {
  const codemeta = helpers.firstOf(codemetaJson)
  const authors = codemeta?.data.author?.map((a) => a.name) ?? []
  const repo = codemeta?.data.codeRepository?.toLowerCase() ?? ''
  return {
    isMyProject: authors.includes(authorName),
    isOnMyGitHub: typeof githubAccount === 'string' && repo.includes(`/${githubAccount}/`),
    name: codemeta?.data.name,
  }
})

Values for the built-in templates are provided via the --author-name and --github-account CLI flags, or via the templateData option in the API. Templates that don't need this data can simply omit the second argument.

Using a custom template via the CLI

metascope --template ./metascope-template.ts

Template files are loaded via jiti, so TypeScript works out of the box without a build step.

Background

Metascope was built to support automated generation of project dashboards, badges, and documentation where a single source of truth for project metadata is useful. Rather than querying each API individually, metascope handles the discovery, authentication, and aggregation in one pass for a wide variety of project types.

Related projects

  • codemeta
    Standard shared metadata vocabulary (JSON-LD)
  • codemetapy
    Translate software metadata into the CodeMeta vocabulary (Python)
  • codemeta-harvester
    Aggregate software metadata into the CodeMeta vocabulary from source repositories and service endpoints (Python)
  • bibliothecary
    Manifest discovery and parsing for libraries.io (Ruby)
  • diggity
    Generates SBOMs for container images, filesystems, archives, and more (Go)
  • SOMEF
    Software Metadata Extraction Framework (Python)
  • Upstream Ontologist
    A common interface for finding metadata about upstream software projects (Rust)
  • GrimoireLab
    Platform for software development analytics and insights (Python)
  • OSS Review Toolkit
    A suite of CLI tools to automate software compliance checks (Kotlin)
  • Git Truck
    Repository visualization. (TypeScript)
  • Onefetch
    Offline command-line Git information tool (Rust)
  • Sokrates
    Polyglot source code examination tool (Java)

Slop factor

Medium.

The architecture and non-boilerplate parts of the documentation were human-driven, but sizable chunks of the implementation were mostly Claude Code's doing and have been subject to only moderate post-facto human scrutiny.

Maintainers

@kitschpatrol

Acknowledgments

Thank you to the CodeMeta Project Management Committee and contributors for their development and stewardship of the standard.

Jacob Peddicord's askalono project inspired the Dice-Sørensen scoring strategy used for classifying arbitrary license text.

Contributing

Issues and pull requests are welcome.

License

MIT © Eric Mika

About

A CLI tool and TypeScript library to easily extract metadata from all kinds of software repositories.

Topics

Resources

License

Stars

Watchers

Forks