Skip to content

CreateSequenceDictionary support for alternative names (@SQ-AN)#1127

Merged
yfarjoun merged 5 commits into
broadinstitute:masterfrom
lindenb:pl_dict_aliases
Sep 5, 2018
Merged

CreateSequenceDictionary support for alternative names (@SQ-AN)#1127
yfarjoun merged 5 commits into
broadinstitute:masterfrom
lindenb:pl_dict_aliases

Conversation

@lindenb
Copy link
Copy Markdown
Contributor

@lindenb lindenb commented Mar 1, 2018

Description

this PR add a new option AN in CreateSequenceDictionary to support the alternative name in a dictionary, as defined in the sam spec https://github.com/samtools/hts-specs/blob/master/SAMv1.tex#L212

ALT_NAMES=File
AN=File                       Optional file containing the alternative names for the contigs. First column is the
                              original name, the second column is an alternative name. One contig may have more than one
                              alternative name.   Default value: null. 

In the code, I added a Map<String,Set<String>> mapping the contigs to a set of aliases. This map is then used when a SamSequenceRecord is serialized.

a test was added in CreateSequenceDictionaryTest

Related PR in htsjdk : https://github.com/samtools/htsjdk/pull/956/files


Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@coveralls
Copy link
Copy Markdown

coveralls commented Mar 1, 2018

Coverage Status

Coverage increased (+0.05%) to 80.099% when pulling 82fcb0b on lindenb:pl_dict_aliases into 3984f99 on broadinstitute:master.

@magicDGS
Copy link
Copy Markdown
Contributor

magicDGS commented Mar 1, 2018

Maybe we can wait until samtools/htsjdk#956 is in, but it's a good idea. I was looking forward that change for a while!

@gbggrant gbggrant requested a review from pshapiro4broad March 7, 2018 18:07
Copy link
Copy Markdown
Member

@pshapiro4broad pshapiro4broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many minor nits but overall code seems OK, please address review comments.

" already exists. Delete this file and try again, or specify a different output file.");
}

// map for aliases mapping a contig to its' aliases
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By its' I think you really mean its

}

// map for aliases mapping a contig to its' aliases
final Map<String, Set<String>> contig2aliases = new HashMap<>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I wouldn't use numbers in variable names like this, in the place of the word "to". Looking at other Maps in picard, such maps are typically named by the value, not the key, such as aliases. In cases where both the key and value are mentioned, one convention is to use by, e.g. aliasesByContig.


// fill the alias map
if(this.ALT_NAMES != null) {
try {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of code to insert directly into a method. If the output of this block is the Map<> above, it shouldn't be hard to move it into a method that returns the Map.

final Map<String, Set<String>> contig2aliases = new HashMap<>();

// fill the alias map
if(this.ALT_NAMES != null) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a space between if and ( here and below

if(this.ALT_NAMES != null) {
try {
// regex defined in the sam spec
final Pattern altNameRegex = Pattern.compile("[0-9A-Za-z][0-9A-Za-z\\*\\+@\\|\\-]*");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiled patterns are thread safe so the usual convention is to store them in a static final. Also since this regex is somewhat confusing, either a better explanation would be helpful, or a comment that links to the place in the spec where this regex is specified.

if(contig2aliases.containsKey(samSequenceRecord.getSequenceName())) {
final Set<String> aliases = contig2aliases.get(samSequenceRecord.getSequenceName());
// "Alternative names is a comma separated list of alternative names"
samSequenceRecord.setAttribute("AN",String.join(",",aliases)); //TODO replace "AN" with constants/methods: https://github.com/samtools/htsjdk/pull/956/files
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, end-of-line comments are hard to read/maintain, it's better to put them on their own line. If you had a static final for the magic string "AN" then that would be a good place for a comment like this.

if(contig2aliases.containsKey(samSequenceRecord.getSequenceName())) {
final Set<String> aliases = contig2aliases.get(samSequenceRecord.getSequenceName());
// "Alternative names is a comma separated list of alternative names"
samSequenceRecord.setAttribute("AN",String.join(",",aliases)); //TODO replace "AN" with constants/methods: https://github.com/samtools/htsjdk/pull/956/files
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to have spaces after ,s here


@Test
public void testAltNames() throws Exception {
final File altFile = File.createTempFile("CreateSequenceDictionaryTest.", ".alt");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first arg to createTempFile() shouldn't end in a .


// check chrM
ssr = dict.getSequence("chrM");
Assert.assertNull(ssr, "chrM presnt in dictionary");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo presnt

if(contig2aliases.keySet().stream().
filter(K->!K.equals(contigName)). // not an error is defined twice for same contig
flatMap(K->contig2aliases.get(K).stream()).
anyMatch(S->S.contains(contigName))) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could avoid the flatMap here by using contains on each set and then using findFirst. I can write something up if this doesn't make sense.

@lindenb
Copy link
Copy Markdown
Contributor Author

lindenb commented Mar 8, 2018

@pshapiro4broad thank you for the review ! :-)
back to you.

#1127 (comment) fixed typo
#1127 (comment) replace contig2aliases by aliasesByContig
#1127 (comment) moved code to its own function
#1127 (comment) moved regex to static final
#1127 (comment) moved the 'if(' to 'if ('
#1127 (comment) moved "AN" to a static final String with comment about Todo
#1127 (comment) add spaces
#1127 (comment) remove dot in tmp name
#1127 (comment) fixed typo
#1127 (comment) remove flatMap

@gbggrant gbggrant assigned gbggrant and yfarjoun and unassigned gbggrant Mar 8, 2018
@gbggrant gbggrant requested a review from yfarjoun March 8, 2018 12:14
@gbggrant
Copy link
Copy Markdown
Contributor

gbggrant commented Mar 8, 2018

@yfarjoun look ok?

*/
private Map<String, Set<String>> loadContigAliasesMap() throws PicardException {
// return an empty map if no mapping file was provided
if (this.ALT_NAMES == null) return Collections.emptyMap();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the formatting changes from my first set of comments:

  • if statements should be multi-line and have { } around the body
  • for( should be written as for (
  • catch should be wrapped up to the line before it, as } catch (...) {
  • // comments should have a space between the // and the first letter of the comment

if (this.ALT_NAMES == null) return Collections.emptyMap();
// the map returned by the function
final Map<String, Set<String>> aliasesByContig = new HashMap<>();
try {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spacing looks odd here; try should match up with the line before it, and the body of the try appears to have the wrong indent level.

for (ReferenceSequence refSeq = refSeqFile.nextSequence(); refSeq != null; refSeq = refSeqFile.nextSequence()) {
final SAMSequenceRecord samSequenceRecord = makeSequenceRecord(refSeq);
//add aliases
if (aliasesByContig.containsKey(samSequenceRecord.getSequenceName())) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in my previous comment, you can avoid hashing twice by calling .get() and then checking for null instead of using containsKey(). You only need to use containsKey() if your map might contain null as a value, and you wish to detect that case.

@lindenb
Copy link
Copy Markdown
Contributor Author

lindenb commented Mar 9, 2018

@pshapiro4broad thanks again.
back to you.
I hope I've fixed all your points.

Copy link
Copy Markdown
Member

@pshapiro4broad pshapiro4broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another batch of formatting and minor code comments.

If you use an IDE like intellij you can use it to format your Java code for you. I think all the comments I made about formatting follow the default intellij formatting settings for Java.

final Map<String, Set<String>> aliasesByContig = new HashMap<>();
try {
for (final String line :IOUtil.slurpLines(this.ALT_NAMES)) {
if (StringUtil.isBlank(line)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be indented four spaces from the for

// the map returned by the function
final Map<String, Set<String>> aliasesByContig = new HashMap<>();
try {
for (final String line :IOUtil.slurpLines(this.ALT_NAMES)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a space after the :

// check alias not previously defined as contig
if (aliasesByContig.containsKey(altName)) {
throw new IOException("alternate name " + altName +
"previously defined as a contig in " + line);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space after the "

filter(K->!K.equals(contigName)).
anyMatch(K->aliasesByContig.get(K).contains(contigName))) {
throw new IOException("contig " + contigName +
"previously defined as an alternate name in " + line);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space after the "

continue;
}
final int tab = line.indexOf('\t');
if (tab == -1 ) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra space between -1 and )

if (tab == -1 ) {
throw new IOException("tabulation missing in " + line);
}
final String contigName = line.substring(0,tab);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need space after the ,

// not an error if defined twice for same contig
filter(K->!K.equals(contigName)).
anyMatch(K->aliasesByContig.get(K).contains(contigName))) {
throw new IOException("contig " + contigName +
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent is incorrect here. The throw should be indented four spaces from the if that contains it.

// check contig not previously defined as alias
if (aliasesByContig.keySet().stream().
// not an error if defined twice for same contig
filter(K->!K.equals(contigName)).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need spaces around -> operator, also on the next line.


/**
* Regular expression defined in the sam spec. Any alternative contig should match this regular expression
* TODO: replace the pattern with a constant : see https://github.com/samtools/htsjdk/pull/956/files
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be the same as SAMSequenceRecord.ALTERNATIVE_SEQUENCE_NAME_REGEXP? It's not the same, it has a | character in it.

If it is supposed to be the same, you should make the one in SamSequenceRecord public and use it instead of defining another copy here.

* Regular expression defined in the sam spec. Any alternative contig should match this regular expression
* TODO: replace the pattern with a constant : see https://github.com/samtools/htsjdk/pull/956/files
*/
private static final Pattern ALTERNATIVE_CONTIG_NAME_PATTERN = Pattern.compile("[0-9A-Za-z][0-9A-Za-z\\*\\+@\\|\\-]*");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The \\ characters in this string aren't necessary since the characters * + and | do not require quoting. - would require quoting but doesn't since it occurs last in the character class.

@Argument(doc = "Stop after writing this many sequences. For testing.")
public int NUM_SEQUENCES = Integer.MAX_VALUE;

@Argument(shortName = "AN", doc = "Optional file containing the alternative names for the contigs. "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain how this input will be used.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I mean in the documentation...something like: "The alternative names will be put into the appropriate @AN annotation for each contig". Also, please state how the "columns" need to be separated ("tab", I presume) and whether the columns should have a "header" (I presume that they do not)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yfarjoun I added the sentences.

@yfarjoun
Copy link
Copy Markdown
Contributor

@lindenb are you going to respond to the comments? I know it may seem like an arduous process, but we would like to include your contribution!

@lindenb
Copy link
Copy Markdown
Contributor Author

lindenb commented May 27, 2018

@yfarjoun sorry, I forgot about that one, I'll try to answer tomorrow.

@lindenb
Copy link
Copy Markdown
Contributor Author

lindenb commented May 28, 2018

@yfarjoun @pshapiro4broad

Thank you again for your reviews and sorry for my late answer

I fixed the following points, Back to you.

#1127 (comment) indented four spaces from the for
#1127 (comment) missing space ':'
#1127 (comment) missing space
#1127 (comment) missing space
#1127 (comment) remove extra space
#1127 (comment) need space after the ,
#1127 (comment) fix indent
#1127 (comment) need space

#1127 (comment)

Is this supposed to be the same as SAMSequenceRecord.ALTERNATIVE_SEQUENCE_NAME_REGEXP? It's not the same, it has a | character in it.

I just checked: there is a '|' in the SAM spec. https://samtools.github.io/hts-specs/SAMv1.pdf

#1127 (comment)

The \ characters in this string aren't necessary since the characters * +

they' required by the java compiler because + is treated as invalid escape sequence

#1127 (comment) Explain how this input will be used.

@jmarshall
Copy link
Copy Markdown

Regarding ALTERNATIVE_SEQUENCE_NAME_REGEXP: indeed, | is listed in the spec, very intentionally so that NCBI-style pipe-containing contig names can be relegated to being aliases.

Seems like the regexp is missing . and _ compared to the one in the SAM spec.

@lindenb
Copy link
Copy Markdown
Contributor Author

lindenb commented May 28, 2018

test failed for a travis-related problem IMHO

@magicDGS
Copy link
Copy Markdown
Contributor

I updated samtools/htsjdk#956 to reflect the regexp changes too. You can use that one, @lindenb

@lindenb
Copy link
Copy Markdown
Contributor Author

lindenb commented May 28, 2018

@magicDGS 👍 thanks ! I suppose I've to wait for your PR to be merged

@yfarjoun
Copy link
Copy Markdown
Contributor

I agree that the failures in travis are not due to the PR. The master branch is currently failing...there's a PR to fix that ( #1178 )

@yfarjoun
Copy link
Copy Markdown
Contributor

yfarjoun commented Sep 5, 2018

👍 thanks @lindenb

Approved with PullApprove

@yfarjoun yfarjoun merged commit c827f0c into broadinstitute:master Sep 5, 2018
@lindenb lindenb deleted the pl_dict_aliases branch September 5, 2018 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants