Skip to content

Add support for SQ-AN in SAMSequenceDictionary #956#1474

Merged
yfarjoun merged 13 commits into
masterfrom
yf_dgs_sq_an_tag
Sep 2, 2020
Merged

Add support for SQ-AN in SAMSequenceDictionary #956#1474
yfarjoun merged 13 commits into
masterfrom
yf_dgs_sq_an_tag

Conversation

@yfarjoun
Copy link
Copy Markdown
Contributor

@yfarjoun yfarjoun commented Apr 23, 2020

This is a replacement of PR #956

see discussion and comments there....

one major difference from that PR:

  • I removed potential problem when providing list of SAMSequenceRecords (didn't make a private copy of list) in commit 30181fb

@yfarjoun yfarjoun requested a review from lbergelson April 23, 2020 21:09
Copy link
Copy Markdown
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments.

It would be good to have a test that shows we can actually execute bam queries against alternate contig names successfully.

public Set<String> getAlternativeSequenceNames() {
final String anTag = getAttribute(ALTERNATIVE_SEQUENCE_NAME_TAG);
return (anTag == null) ? Collections.emptySet()
: Collections.unmodifiableSet(new HashSet<>(Arrays.asList(anTag.split(ALTERNATIVE_SEQUENCE_NAME_SEPARATOR))));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be coerced through a HashSet, it will shuffle the order. Use a LinkedHashSet instead.

}
}

private void encodeAltSequences(final Collection<String> alternativeSequences) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could probably save a bunch of code/ potential error paths if the error checking / deduplication is all done here.

Something like:

encoded == alternativeSequences.isEmpty() ? null : alternativeSequences.stream()
.sorted()
.distinct()
.peek(SAMSequenceRecord::validateAltRegExp)
.collect(Collectors.joining(ALTERNATIVE_SEQUENCE_NAME_SEPARATOR));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it...but I'm not sure what code you think I can remove now....

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you could remove the various guards in the code that call into this and let this be the single source of error checking.

*/
public Set<String> getAlternativeSequenceNames() {
final String anTag = getAttribute(ALTERNATIVE_SEQUENCE_NAME_TAG);
return (anTag == null) ? Collections.emptySet()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we worry about empty string here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty string shouldn't be valid....

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the regexp. I added a test to that effect

* @return dictionary consisting of the same sequences as the two inputs with the merged values of tags.
*/
static public SAMSequenceDictionary mergeDictionaries(final SAMSequenceDictionary dict1,
public static SAMSequenceDictionary mergeDictionaries(final SAMSequenceDictionary dict1,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmn, do we have to worry about alias name collisions when merging dicionaries?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as written, it will only merge dictionaries if all the tags agree (when present). this (implicitly) includes the AN tag, so the two dictionaries will have to have the same alternative names. So as written it is safe.

If we want the method to merge the following contigs:

dict1:
@sn ID=name1 AN=n1

and dict2:
@sn ID=n1 AN=name1

or even with dict3:
@sn ID=nana1 AN=name1

then some additional smarts would have to be written....

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably a good idea provided the md5's are the same, but it can be a different pr...

magicDGS and others added 11 commits June 11, 2020 16:59
- responding to review comments
- whitespace
- unmodifiable sets
- unify tests for sequence names and alternative names.
- responding to review comments
- whitespace
- unmodifiable sets
- unify tests for sequence names and alternative names.
- YIKES: removed potential problem when providing list of SAMSequenceRecords (didn't make private copy of list)
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 12, 2020

Codecov Report

Merging #1474 into master will increase coverage by 0.329%.
The diff coverage is 76.471%.

@@               Coverage Diff               @@
##              master     #1474       +/-   ##
===============================================
+ Coverage     69.235%   69.563%   +0.329%     
- Complexity      8721      9012      +291     
===============================================
  Files            588       601       +13     
  Lines          34620     35983     +1363     
  Branches        5787      6033      +246     
===============================================
+ Hits           23969     25031     +1062     
- Misses          8367      8603      +236     
- Partials        2284      2349       +65     
Impacted Files Coverage Δ Complexity Δ
.../main/java/htsjdk/samtools/SAMValidationError.java 96.154% <ø> (ø) 9.000 <0.000> (ø)
...c/main/java/htsjdk/samtools/SAMSequenceRecord.java 78.333% <75.000%> (-3.000%) 50.000 <36.000> (+11.000) ⬇️
...in/java/htsjdk/samtools/SAMSequenceDictionary.java 75.000% <88.889%> (+1.761%) 46.000 <3.000> (+1.000)
...c/main/java/htsjdk/tribble/gff/SequenceRegion.java 80.556% <0.000%> (-1.797%) 14.000% <0.000%> (+7.000%) ⬇️
...ava/htsjdk/samtools/util/htsget/HtsgetRequest.java 61.905% <0.000%> (ø) 34.000% <0.000%> (?%)
.../util/htsget/HtsgetMalformedResponseException.java 28.571% <0.000%> (ø) 1.000% <0.000%> (?%)
.../java/htsjdk/samtools/util/htsget/HtsgetClass.java 100.000% <0.000%> (ø) 1.000% <0.000%> (?%)
...va/htsjdk/samtools/util/htsget/HtsgetResponse.java 91.176% <0.000%> (ø) 8.000% <0.000%> (?%)
src/main/java/htsjdk/io/HtsPath.java 51.316% <0.000%> (ø) 15.000% <0.000%> (?%)
...sjdk/samtools/util/htsget/HtsgetErrorResponse.java 80.000% <0.000%> (ø) 4.000% <0.000%> (?%)
... and 26 more


@Test(dataProvider = "testAccessFileWithAlternateContigNameData")
public void testAccessFileWithAlternateContigName(final String contigName, final int expectedRecords) throws IOException {
try(SamReader bamReader = SamReaderFactory.make().open(AccessFileWithAlternateContigName);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tests querying a bam file with alternative names

{"lbracket{"},
{"rbracket}"}};
{"rbracket}"},
{""}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shows that an empty name is illegal

@yfarjoun
Copy link
Copy Markdown
Contributor Author

back to you @lbergelson

setAttribute(ALTERNATIVE_SEQUENCE_NAME_TAG, null);
} else {
// validate all the names and encode afterwards
alternativeSequences.forEach(SAMSequenceRecord::validateAltRegExp);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this validation since it's done encodeAltSequences now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* Adds an alternative sequence name if it is not the same as the sequence name or it is not present already.
*/
public void addAlternativeSequenceName(final String name) {
validateAltRegExp(name);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this validation since it's done encodeAltSequences now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if (mSequenceIndex != that.mSequenceIndex) {
return false;
}
// PIC-439. Allow undefined length.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if this is still an issue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the numbering scheme refers to….and I can't find it online either….however, it does not pertain to this PR…

Copy link
Copy Markdown
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yfarjoun I had a minor comment about removing another few lines of unnecessary code. Good to merge when you're ready 👍

Copy link
Copy Markdown
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@yfarjoun yfarjoun merged commit 5240c40 into master Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants