Add support for SQ-AN in SAMSequenceDictionary #956#1474
Conversation
lbergelson
left a comment
There was a problem hiding this comment.
A few minor comments.
It would be good to have a test that shows we can actually execute bam queries against alternate contig names successfully.
| public Set<String> getAlternativeSequenceNames() { | ||
| final String anTag = getAttribute(ALTERNATIVE_SEQUENCE_NAME_TAG); | ||
| return (anTag == null) ? Collections.emptySet() | ||
| : Collections.unmodifiableSet(new HashSet<>(Arrays.asList(anTag.split(ALTERNATIVE_SEQUENCE_NAME_SEPARATOR)))); |
There was a problem hiding this comment.
This shouldn't be coerced through a HashSet, it will shuffle the order. Use a LinkedHashSet instead.
| } | ||
| } | ||
|
|
||
| private void encodeAltSequences(final Collection<String> alternativeSequences) { |
There was a problem hiding this comment.
I think you could probably save a bunch of code/ potential error paths if the error checking / deduplication is all done here.
Something like:
encoded == alternativeSequences.isEmpty() ? null : alternativeSequences.stream()
.sorted()
.distinct()
.peek(SAMSequenceRecord::validateAltRegExp)
.collect(Collectors.joining(ALTERNATIVE_SEQUENCE_NAME_SEPARATOR));
There was a problem hiding this comment.
I did it...but I'm not sure what code you think I can remove now....
There was a problem hiding this comment.
I thought you could remove the various guards in the code that call into this and let this be the single source of error checking.
| */ | ||
| public Set<String> getAlternativeSequenceNames() { | ||
| final String anTag = getAttribute(ALTERNATIVE_SEQUENCE_NAME_TAG); | ||
| return (anTag == null) ? Collections.emptySet() |
There was a problem hiding this comment.
Should we worry about empty string here?
There was a problem hiding this comment.
empty string shouldn't be valid....
There was a problem hiding this comment.
according to the regexp. I added a test to that effect
| * @return dictionary consisting of the same sequences as the two inputs with the merged values of tags. | ||
| */ | ||
| static public SAMSequenceDictionary mergeDictionaries(final SAMSequenceDictionary dict1, | ||
| public static SAMSequenceDictionary mergeDictionaries(final SAMSequenceDictionary dict1, |
There was a problem hiding this comment.
Hmn, do we have to worry about alias name collisions when merging dicionaries?
There was a problem hiding this comment.
as written, it will only merge dictionaries if all the tags agree (when present). this (implicitly) includes the AN tag, so the two dictionaries will have to have the same alternative names. So as written it is safe.
If we want the method to merge the following contigs:
dict1:
@sn ID=name1 AN=n1
and dict2:
@sn ID=n1 AN=name1
or even with dict3:
@sn ID=nana1 AN=name1
then some additional smarts would have to be written....
There was a problem hiding this comment.
That's probably a good idea provided the md5's are the same, but it can be a different pr...
- responding to review comments - whitespace - unmodifiable sets - unify tests for sequence names and alternative names.
- responding to review comments - whitespace - unmodifiable sets - unify tests for sequence names and alternative names. - YIKES: removed potential problem when providing list of SAMSequenceRecords (didn't make private copy of list)
92ee55a to
aee6964
Compare
Codecov Report
@@ Coverage Diff @@
## master #1474 +/- ##
===============================================
+ Coverage 69.235% 69.563% +0.329%
- Complexity 8721 9012 +291
===============================================
Files 588 601 +13
Lines 34620 35983 +1363
Branches 5787 6033 +246
===============================================
+ Hits 23969 25031 +1062
- Misses 8367 8603 +236
- Partials 2284 2349 +65 |
|
|
||
| @Test(dataProvider = "testAccessFileWithAlternateContigNameData") | ||
| public void testAccessFileWithAlternateContigName(final String contigName, final int expectedRecords) throws IOException { | ||
| try(SamReader bamReader = SamReaderFactory.make().open(AccessFileWithAlternateContigName); |
There was a problem hiding this comment.
this tests querying a bam file with alternative names
| {"lbracket{"}, | ||
| {"rbracket}"}}; | ||
| {"rbracket}"}, | ||
| {""} |
There was a problem hiding this comment.
this shows that an empty name is illegal
|
back to you @lbergelson |
| setAttribute(ALTERNATIVE_SEQUENCE_NAME_TAG, null); | ||
| } else { | ||
| // validate all the names and encode afterwards | ||
| alternativeSequences.forEach(SAMSequenceRecord::validateAltRegExp); |
There was a problem hiding this comment.
we can remove this validation since it's done encodeAltSequences now
| * Adds an alternative sequence name if it is not the same as the sequence name or it is not present already. | ||
| */ | ||
| public void addAlternativeSequenceName(final String name) { | ||
| validateAltRegExp(name); |
There was a problem hiding this comment.
we can remove this validation since it's done encodeAltSequences now
| if (mSequenceIndex != that.mSequenceIndex) { | ||
| return false; | ||
| } | ||
| // PIC-439. Allow undefined length. |
There was a problem hiding this comment.
Do we know if this is still an issue?
There was a problem hiding this comment.
Not sure what the numbering scheme refers to….and I can't find it online either….however, it does not pertain to this PR…
lbergelson
left a comment
There was a problem hiding this comment.
@yfarjoun I had a minor comment about removing another few lines of unnecessary code. Good to merge when you're ready 👍
This is a replacement of PR #956
see discussion and comments there....
one major difference from that PR: