Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
bb1845a
new properties (license, version) from citation.cff. Fixes #935
juanjemdIos Apr 7, 2026
f76adad
Improve extraction and structure of citation and reference publications
juanjemdIos Apr 10, 2026
da0c518
Revert reference_publication category to keep all citations in CAT_CI…
juanjemdIos Apr 10, 2026
d23cae9
upperCase typo error writting "Widoco" in a test
juanjemdIos Apr 10, 2026
78055b1
Standardize license fields and consolidate results. Fixes #955
juanjemdIos Apr 13, 2026
3d84e8f
fix documentation consistency and order tables. Fixes #954
juanjemdIos Apr 14, 2026
4ddc072
Apply suggestion from @dgarijo
dgarijo Apr 14, 2026
c9e0564
Update docs/index.md
juanjemdIos Apr 14, 2026
5686a9e
Update docs/index.md
juanjemdIos Apr 14, 2026
cce43bf
Update docs/index.md
juanjemdIos Apr 14, 2026
c5a74c5
resolve category inconsistencies and standardize properties and types…
juanjemdIos Apr 15, 2026
ef50754
Merge branch 'master' of https://github.com/juanjemdIos/somef
juanjemdIos Apr 15, 2026
20e1ef0
Funding and funder in codemeta export. Fixes #960
juanjemdIos Apr 15, 2026
b5e04da
Fixes mkdocs and ensures file names. CSS. New citation.cff and bibtex…
juanjemdIos Apr 16, 2026
5e46dfb
update supported dependency types in output.md
juanjemdIos Apr 16, 2026
7bf29b7
Apply suggestion from @dgarijo
dgarijo Apr 16, 2026
6489482
Apply suggestion from @dgarijo
dgarijo Apr 16, 2026
7479ea4
Apply suggestion from @dgarijo
dgarijo Apr 16, 2026
9867098
Apply suggestion from @dgarijo
dgarijo Apr 16, 2026
e8f3a2a
Apply suggestion from @dgarijo
dgarijo Apr 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,18 @@ Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the follo
- Email: email of author
- URL: website or ORCID associated with the author
- **Build file**: Build file(s) of the project. For example, files used to create a Docker image for the target software, package files, etc.
- **Citation**: Preferred citation as the authors have stated in their readme file. SOMEF recognizes Bibtex, Citation File Format files and other means by which authors cite their papers (e.g., by in-text citation). We aim to recognize the following properties:
- **Citation**: Preferred citation as the authors have stated in their readme file. SOMEF recognizes Bibtex, Citation File Format files and other means by which authors cite their papers (e.g., by in-text citation).
Comment thread
dgarijo marked this conversation as resolved.
Outdated
Comment thread
dgarijo marked this conversation as resolved.
Outdated
For CITATION.cff files, SOMEF now generates two separate entries: one for the software (is_preferred_citation: False) and another for the preferred citation (is_preferred_citation: True). This ensures metadata like DOI or version is correctly assigned to each entity.
We aim to recognize the following properties:
- Title: Title of the publication
- Author: list of author names in the publication
- URL: URL of the publication
- DOI: Digital object identifier of the publication
- Date published
- Version: Software version (if applicable)
- Journal: Journal name where the paper was published
- Year: Year of publication
- Pages: Page range in the journal
- **Code of conduct**: Link to the code of conduct of the project
- **Code repository**: Link to the GitHub/GitLab repository used for the extraction
- **Contact**: Contact person responsible for maintaining a software component
Expand All @@ -54,7 +60,7 @@ Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the follo
- **Invocation**: Execution command(s) needed to run a scientific software component
- **Issue tracker**: Link where to open issues for the target repository
- **Keywords**: set of terms used to commonly identify a software component
- **License**: License and usage terms of a software component
- **License**: License and usage terms of a software component. Now we also extract license from citation.cff.
- **Logo**: Main logo used to represent the target software component
- **Maintainer**: Individuals or teams responsible for maintaining the software component, extracted from the CODEOWNERS file
- **Name**: Name identifying a software component
Expand Down
12 changes: 9 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,18 @@ Given a readme file (or a GitHub repository) SOMEF will extract the following ca
- URL: website or ORCID associated with the author
- Affiliation: name of organization or affiliation
- **Build file**: Build file(s) of the project. For example, files used to create a Docker image for the target software, package files, etc.
- **Citation**: Preferred citation as the authors have stated in their readme file. SOMEF recognizes Bibtex, Citation File Format files and other means by which authors cite their papers (e.g., by in-text citation). We aim to recognize the following properties:
- **Citation**: Preferred citation(s) as the authors have stated in their readme file. SOMEF recognizes Bibtex, Citation File Format files and other means by which authors cite their papers (e.g., by in-text citation).
For CITATION.cff files, SOMEF now generates two separate entries: one for the software (is_preferred_citation: False) and another for the preferred citation (is_preferred_citation: True). This ensures metadata like DOI or version is correctly assigned to each entity.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if there is no preferred citation, then the software one is not false.
Instead of generating False, I would only leave is_preferred_citaiton: True when we find prefered citation. And do not add anything otherwise.

We aim to recognize the following properties:
Comment thread
juanjemdIos marked this conversation as resolved.
- Title: Title of the publication
- Author: list of author names in the publication
- URL: URL of the publication
- DOI: Digital object identifier of the publication
- Date published:
- Date published
- Version: Software version (if applicable)
Comment thread
juanjemdIos marked this conversation as resolved.
Outdated
- Journal: Journal name where the paper was published
- Year: Year of publication
- Pages: Page range in the journal
- **Code of conduct**: Link to the code of conduct of the project
- **Code repository**: Link to the GitHub/GitLab repository used for the extraction
- **Contact**: Contact person responsible for maintaining a software component
Expand All @@ -62,7 +68,7 @@ Given a readme file (or a GitHub repository) SOMEF will extract the following ca
- **Invocation**: Execution command(s) needed to run a scientific software component
- **Issue tracker**: Link where to open issues for the target repository
- **Keywords**: set of terms used to commonly identify a software component
- **License**: License and usage terms of a software component
- **License**: License and usage terms of a software component. Now we also extract license from citation.cff.
Comment thread
juanjemdIos marked this conversation as resolved.
Outdated
- **Logo**: Main logo used to represent the target software component
- **Name**: Name identifying a software component
- **Ontologies**: URL and path to the ontology files present in the repository
Expand Down
68 changes: 38 additions & 30 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ SOMEF aims to recognize the following categories (in alphabetical order):
- `acknowledgement`: Any text that the authors have prepared to acknnowledge the contribution from others, or project funding.
- `application_domain`: The application domain of the repository. This may be related to the research area of a software component (e.g., Astrophysics) or the general domain/functionality of the tool (i.e., machine learning projects). See all current recognized application domains [here](https://somef.readthedocs.io/en/latest/#myfootnote1).
- `authors`: Person or organization responsible of the project. This property is also used to indicate the responsible entities of a publication associated with the code repository.
- `citation`: Software citation (usually in `.bib` form) as the authors have stated in their readme file, or through a `CFF` file.
- `citation`: Software citation (usually in .bib or .cff format). SOMEF extracts and structures the metadata from these files (including authors, titles, and DOIs) instead of just returning a raw string.
- `code_of_conduct`: Link to the code of conduct file of the project
- `code_repository`: Link to the source code (typically the repository where the readme can be found)
- `contact`: Contact person responsible for maintaining a software component.
Expand All @@ -88,7 +88,9 @@ SOMEF aims to recognize the following categories (in alphabetical order):
- `forks_url`: Links to forks made of the project (GitHub only)
- `full_name`: Name + owner (owner/name) (if available)
- `full_title`: If the repository has a short name, we will attempt to extract the longer version of the repository name. For example, a repository may be called "Widoco", but the longer title is "Wizard for documenting ontologies".
- `funding`: Funding code for the related project.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, we don't do this

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check

- `has_build_file`: Build file to create a Docker image for the target software
- `has_package_file`: Specifies what package file is present in the code repository.
- `has_script_file`: Snippets of code contained in the repository.
- `homepage`: URL of the item.
- `identifier`: Identifiers detected within a repository (e.g., Digital Object Identifier).
Expand All @@ -105,8 +107,10 @@ SOMEF aims to recognize the following categories (in alphabetical order):
- `owner`: Name of the user or organization in charge of the repository
- `package_distribution`: Link to official package repositories where the software can be downloaded from (e.g., `pypi`).
- `package_file`: Link to a package file used in the repository (e.g., `pyproject.toml`, `setup.py`).
- `package_id`: Identifier extracted from packages. (e.g., `packages.json`)
- `programming_languages`: Languages used in the repository.
- `readme_url`: URL to the main README file in the repository.
- `reference_publication`: URL to the paper associated with the code repository.
- `related_papers`: URL to possible related papers within the repository stated within the readme file.
- `releases`: Pointer to the available versions of a software component.
- `repository_status`: Repository status as it is described in [repostatus.org](https://www.repostatus.org/).
Expand All @@ -118,11 +122,7 @@ SOMEF aims to recognize the following categories (in alphabetical order):
- `type`: Software type: Commandline Application, Notebook Application, Ontology, Scientific Workflow. Non-Software types: Static Website, Uncategorized
- `usage`: Usage examples and considerations of a code repository.
- `workflows`: URL and path to the computational workflow files present in the repository.
- `homepage`: URL to the homepage of the software or organization.
- `reference_publication`: URL to the paper associated with the code repository.
- `package_id`: Identifier extracted from packages. (e.g., `packages.json`)
- `funding`: Funding code for the related project.
- `has_package_file`: Specifies what package file is present in the code repository.


The following table summarized the properties used to describe a `category`:

Expand Down Expand Up @@ -211,22 +211,24 @@ The table below summarizes all types and their corresponding properties:
| **zipball_url** | Release | Url | URL to the zip file where to download a software release | -->


The tables below summarizes all types and their corresponding properties-
The tables below summarizes all types and their corresponding properties.
The following object types are currently supported (aligned with Schema.org and CodeMeta vocabularies)

An AGENT has the following properties:
An Agent has the following properties:

| Property | Expected value | Definition |
|---|---|---|
| **affiliation** | String | name of organization or affiliation |
| **email** | String | Email of an author |
| **family_name** | String | Last name of an author |
| **given_name** | String | First name of an author |
| **identifier** | String | id of an agent |
| **name** | String | Name used to designate the person or organization|
| **role** | String | The role of the agent in the development or maintenance of this software component |
| **url** | Url | Uniform resource locator of the resource |
| **affiliation** | String | name of organization or affiliation |
| **identifier** | String | id of an agent |
| **role** | String | role of agent |

An ASSET has the following properties:

An Asset has the following properties:

| Property | Expected value | Definition |
|---|---|---|
Expand All @@ -239,25 +241,25 @@ An ASSET has the following properties:
| **url** | Url | Uniform resource locator of the resource |



A LICENSE has the following properties:
A License has the following properties:

| Property | Expected value | Definition |
|---|---|---|
| **identifier** | String | id of licence |
| **name** | String | Title or name of the license |
| **spdx_id** | String | Spdx id corresponding to this license |
| **url** | Url | Uniform resource locator of the license |
| **identifier** | String | id of licence |

A PROGRAMMING_LANGUAGE has the following properties:

A Programming_language has the following properties:

| Property | Expected value | Definition |
|---|---|---|
| **name** | String | Name of the language |
| **size** | Integer | File size content (bytes) of a code repository using a given programming language |


A PUBLICATION has the following properties:
A Publication has the following properties:

| Property | Expected value | Definition |
|---|---|---|
Expand All @@ -267,7 +269,7 @@ A PUBLICATION has the following properties:
| **url** | Url | Uniform resource locator of the resource |


A RELEASE has the following properties:
A Release has the following properties:

| Property | Expected value | Definition |
|---|---|---|
Expand All @@ -285,45 +287,51 @@ A RELEASE has the following properties:
| **zipball_url** | Url | URL to the zip file where to download a software release |


A REQUIREMENT has the following properties:
A Requirement has the following properties:

| Property | Expected value | Definition |
|---|---|---|
| **name** | String | Name of the requeriment |
| **version** | String | named version of a requeriment |
| **dependency_type** | String | type: dev, runtime... Indicates whether the dependency is required at runtime or only for development/testing |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List all dependency types we support.
We support 3: development, runtime and documentation

| **dependency_resolver** | String | Identifies the ecosystem or package manager that resolves the dependency (e.g., npm, bower, pip, python, poetry, pdm, cargo, julia, maven, publicode).|
| **name** | String | Name of the requeriment |
| **version** | String | named version of a requeriment |

A RUNTIME_PLATFORM has the following properties:

A Runtime_platform has the following properties:

| Property | Expected value | Definition |
|---|---|---|
| **name** | String | Name of the runtime platform (e.g., Java) |
**version** | String | version of the runtime platform |
| **value** | String | name and version of the runtime platform |
| **version** | String | version of the runtime platform |


A SCHOLARLY_ARTICLE has the following properties:
A Scholarly_article has the following properties:

| Property | Expected value | Definition |
|---|---|---|
| **authors** | List of Agent| List of authors responsible for the publication, providing structured metadata for each |
| **date_published** | String | Date when the article or citation was officially published. |
| **doi** | String | Digital Object Identifier (DOI) of the reference, usually returned as a full URL.|
| **journal** | String | Journal where the publication appeared |
| **pages** | String | Page range of the publication |
| **title** | String | Title of reference or citation |
| **value** | String | Title of reference or citation |
| **url** | String | Link to reference or citation |
| **date_published** | String | date of publication reference or citation |
| **doi** | String | Identifier of reference|
| **value** | String | Title of reference or citation |
| **year** | Number | Year of publication |


A SOFTWARE_APPLICATION has the following properties:
A Software_application has the following properties:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one does not make a lot of sense. What is the difference between this and Requirement?


| Property | Expected value | Definition |
|---|---|---|
| **development_type** | String | runtime or dev |
| **name** | String | Name of the software |
| **value** | String | Name and version of the software |
| **version** | String | version of software |
| **development_type** | String | runtime or dev |

A TEXT_EXCERPT has the following properties:

A Text_excerpt has the following properties:

| Property | Expected value | Definition |
|---|---|---|
Expand Down
Loading
Loading