CreateSequenceDictionary support for alternative names (@SQ-AN)#1127
Conversation
|
Maybe we can wait until samtools/htsjdk#956 is in, but it's a good idea. I was looking forward that change for a while! |
pshapiro4broad
left a comment
There was a problem hiding this comment.
Many minor nits but overall code seems OK, please address review comments.
| " already exists. Delete this file and try again, or specify a different output file."); | ||
| } | ||
|
|
||
| // map for aliases mapping a contig to its' aliases |
There was a problem hiding this comment.
By its' I think you really mean its
| } | ||
|
|
||
| // map for aliases mapping a contig to its' aliases | ||
| final Map<String, Set<String>> contig2aliases = new HashMap<>(); |
There was a problem hiding this comment.
In general I wouldn't use numbers in variable names like this, in the place of the word "to". Looking at other Maps in picard, such maps are typically named by the value, not the key, such as aliases. In cases where both the key and value are mentioned, one convention is to use by, e.g. aliasesByContig.
|
|
||
| // fill the alias map | ||
| if(this.ALT_NAMES != null) { | ||
| try { |
There was a problem hiding this comment.
This is a lot of code to insert directly into a method. If the output of this block is the Map<> above, it shouldn't be hard to move it into a method that returns the Map.
| final Map<String, Set<String>> contig2aliases = new HashMap<>(); | ||
|
|
||
| // fill the alias map | ||
| if(this.ALT_NAMES != null) { |
There was a problem hiding this comment.
There should be a space between if and ( here and below
| if(this.ALT_NAMES != null) { | ||
| try { | ||
| // regex defined in the sam spec | ||
| final Pattern altNameRegex = Pattern.compile("[0-9A-Za-z][0-9A-Za-z\\*\\+@\\|\\-]*"); |
There was a problem hiding this comment.
Compiled patterns are thread safe so the usual convention is to store them in a static final. Also since this regex is somewhat confusing, either a better explanation would be helpful, or a comment that links to the place in the spec where this regex is specified.
| if(contig2aliases.containsKey(samSequenceRecord.getSequenceName())) { | ||
| final Set<String> aliases = contig2aliases.get(samSequenceRecord.getSequenceName()); | ||
| // "Alternative names is a comma separated list of alternative names" | ||
| samSequenceRecord.setAttribute("AN",String.join(",",aliases)); //TODO replace "AN" with constants/methods: https://github.com/samtools/htsjdk/pull/956/files |
There was a problem hiding this comment.
In general, end-of-line comments are hard to read/maintain, it's better to put them on their own line. If you had a static final for the magic string "AN" then that would be a good place for a comment like this.
| if(contig2aliases.containsKey(samSequenceRecord.getSequenceName())) { | ||
| final Set<String> aliases = contig2aliases.get(samSequenceRecord.getSequenceName()); | ||
| // "Alternative names is a comma separated list of alternative names" | ||
| samSequenceRecord.setAttribute("AN",String.join(",",aliases)); //TODO replace "AN" with constants/methods: https://github.com/samtools/htsjdk/pull/956/files |
There was a problem hiding this comment.
Need to have spaces after ,s here
|
|
||
| @Test | ||
| public void testAltNames() throws Exception { | ||
| final File altFile = File.createTempFile("CreateSequenceDictionaryTest.", ".alt"); |
There was a problem hiding this comment.
first arg to createTempFile() shouldn't end in a .
|
|
||
| // check chrM | ||
| ssr = dict.getSequence("chrM"); | ||
| Assert.assertNull(ssr, "chrM presnt in dictionary"); |
| if(contig2aliases.keySet().stream(). | ||
| filter(K->!K.equals(contigName)). // not an error is defined twice for same contig | ||
| flatMap(K->contig2aliases.get(K).stream()). | ||
| anyMatch(S->S.contains(contigName))) { |
There was a problem hiding this comment.
I think you could avoid the flatMap here by using contains on each set and then using findFirst. I can write something up if this doesn't make sense.
|
@pshapiro4broad thank you for the review ! :-) #1127 (comment) fixed typo |
|
@yfarjoun look ok? |
| */ | ||
| private Map<String, Set<String>> loadContigAliasesMap() throws PicardException { | ||
| // return an empty map if no mapping file was provided | ||
| if (this.ALT_NAMES == null) return Collections.emptyMap(); |
There was a problem hiding this comment.
Please make the formatting changes from my first set of comments:
ifstatements should be multi-line and have{}around the bodyfor(should be written asfor (catchshould be wrapped up to the line before it, as} catch (...) {//comments should have a space between the//and the first letter of the comment
| if (this.ALT_NAMES == null) return Collections.emptyMap(); | ||
| // the map returned by the function | ||
| final Map<String, Set<String>> aliasesByContig = new HashMap<>(); | ||
| try { |
There was a problem hiding this comment.
Spacing looks odd here; try should match up with the line before it, and the body of the try appears to have the wrong indent level.
| for (ReferenceSequence refSeq = refSeqFile.nextSequence(); refSeq != null; refSeq = refSeqFile.nextSequence()) { | ||
| final SAMSequenceRecord samSequenceRecord = makeSequenceRecord(refSeq); | ||
| //add aliases | ||
| if (aliasesByContig.containsKey(samSequenceRecord.getSequenceName())) { |
There was a problem hiding this comment.
As in my previous comment, you can avoid hashing twice by calling .get() and then checking for null instead of using containsKey(). You only need to use containsKey() if your map might contain null as a value, and you wish to detect that case.
|
@pshapiro4broad thanks again. |
pshapiro4broad
left a comment
There was a problem hiding this comment.
Another batch of formatting and minor code comments.
If you use an IDE like intellij you can use it to format your Java code for you. I think all the comments I made about formatting follow the default intellij formatting settings for Java.
| final Map<String, Set<String>> aliasesByContig = new HashMap<>(); | ||
| try { | ||
| for (final String line :IOUtil.slurpLines(this.ALT_NAMES)) { | ||
| if (StringUtil.isBlank(line)) { |
There was a problem hiding this comment.
should be indented four spaces from the for
| // the map returned by the function | ||
| final Map<String, Set<String>> aliasesByContig = new HashMap<>(); | ||
| try { | ||
| for (final String line :IOUtil.slurpLines(this.ALT_NAMES)) { |
There was a problem hiding this comment.
missing a space after the :
| // check alias not previously defined as contig | ||
| if (aliasesByContig.containsKey(altName)) { | ||
| throw new IOException("alternate name " + altName + | ||
| "previously defined as a contig in " + line); |
There was a problem hiding this comment.
missing space after the "
| filter(K->!K.equals(contigName)). | ||
| anyMatch(K->aliasesByContig.get(K).contains(contigName))) { | ||
| throw new IOException("contig " + contigName + | ||
| "previously defined as an alternate name in " + line); |
There was a problem hiding this comment.
missing space after the "
| continue; | ||
| } | ||
| final int tab = line.indexOf('\t'); | ||
| if (tab == -1 ) { |
There was a problem hiding this comment.
remove extra space between -1 and )
| if (tab == -1 ) { | ||
| throw new IOException("tabulation missing in " + line); | ||
| } | ||
| final String contigName = line.substring(0,tab); |
| // not an error if defined twice for same contig | ||
| filter(K->!K.equals(contigName)). | ||
| anyMatch(K->aliasesByContig.get(K).contains(contigName))) { | ||
| throw new IOException("contig " + contigName + |
There was a problem hiding this comment.
Indent is incorrect here. The throw should be indented four spaces from the if that contains it.
| // check contig not previously defined as alias | ||
| if (aliasesByContig.keySet().stream(). | ||
| // not an error if defined twice for same contig | ||
| filter(K->!K.equals(contigName)). |
There was a problem hiding this comment.
need spaces around -> operator, also on the next line.
|
|
||
| /** | ||
| * Regular expression defined in the sam spec. Any alternative contig should match this regular expression | ||
| * TODO: replace the pattern with a constant : see https://github.com/samtools/htsjdk/pull/956/files |
There was a problem hiding this comment.
Is this supposed to be the same as SAMSequenceRecord.ALTERNATIVE_SEQUENCE_NAME_REGEXP? It's not the same, it has a | character in it.
If it is supposed to be the same, you should make the one in SamSequenceRecord public and use it instead of defining another copy here.
| * Regular expression defined in the sam spec. Any alternative contig should match this regular expression | ||
| * TODO: replace the pattern with a constant : see https://github.com/samtools/htsjdk/pull/956/files | ||
| */ | ||
| private static final Pattern ALTERNATIVE_CONTIG_NAME_PATTERN = Pattern.compile("[0-9A-Za-z][0-9A-Za-z\\*\\+@\\|\\-]*"); |
There was a problem hiding this comment.
The \\ characters in this string aren't necessary since the characters * + and | do not require quoting. - would require quoting but doesn't since it occurs last in the character class.
| @Argument(doc = "Stop after writing this many sequences. For testing.") | ||
| public int NUM_SEQUENCES = Integer.MAX_VALUE; | ||
|
|
||
| @Argument(shortName = "AN", doc = "Optional file containing the alternative names for the contigs. " |
There was a problem hiding this comment.
Explain how this input will be used.
There was a problem hiding this comment.
To be clear, I mean in the documentation...something like: "The alternative names will be put into the appropriate @AN annotation for each contig". Also, please state how the "columns" need to be separated ("tab", I presume) and whether the columns should have a "header" (I presume that they do not)
|
@lindenb are you going to respond to the comments? I know it may seem like an arduous process, but we would like to include your contribution! |
|
@yfarjoun sorry, I forgot about that one, I'll try to answer tomorrow. |
|
Thank you again for your reviews and sorry for my late answer I fixed the following points, Back to you. #1127 (comment) indented four spaces from the for
I just checked: there is a '|' in the SAM spec. https://samtools.github.io/hts-specs/SAMv1.pdf
they' required by the java compiler because + is treated as #1127 (comment) Explain how this input will be used. |
|
Regarding ALTERNATIVE_SEQUENCE_NAME_REGEXP: indeed, Seems like the regexp is missing |
|
test failed for a travis-related problem IMHO |
|
I updated samtools/htsjdk#956 to reflect the regexp changes too. You can use that one, @lindenb |
|
@magicDGS 👍 thanks ! I suppose I've to wait for your PR to be merged |
|
I agree that the failures in travis are not due to the PR. The master branch is currently failing...there's a PR to fix that ( #1178 ) |
|
👍 thanks @lindenb |
Description
this PR add a new option AN in CreateSequenceDictionary to support the alternative name in a dictionary, as defined in the sam spec https://github.com/samtools/hts-specs/blob/master/SAMv1.tex#L212
In the code, I added a
Map<String,Set<String>>mapping the contigs to a set of aliases. This map is then used when a SamSequenceRecord is serialized.a test was added in CreateSequenceDictionaryTest
Related PR in htsjdk : https://github.com/samtools/htsjdk/pull/956/files
Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.
Content
Review
For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests