Skip to content

New OA tag for storing Orignal Alignment information when modifying records#193

Closed
yfarjoun wants to merge 10 commits intomasterfrom
yf_add_previous_alignment_tag_PA
Closed

New OA tag for storing Orignal Alignment information when modifying records#193
yfarjoun wants to merge 10 commits intomasterfrom
yf_add_previous_alignment_tag_PA

Conversation

@yfarjoun
Copy link
Copy Markdown
Contributor

@yfarjoun yfarjoun commented Mar 2, 2017

This could be used in GATK's IndelRealignment, but the main use I have for it is in Picard's MergeBamAlignment where it now unmaps reads that look like bacterial contamination that got mapped due to a match of 20 or so bases randomly. Since this unmaping is lossy (as one can't keep original alignment information in place) I would like to store the alignment information in a tag. I used the same format as the OA tag.

@nh13
Copy link
Copy Markdown
Member

nh13 commented Mar 2, 2017

I'd be bit hesitant to include this tag in the spec, as it seems very new, and while potentially useful, may be better suited to be in a user-defined tag. I am not sure it has "general interest" at the moment I'd welcome the opinions of the spec maintainers and the community.

@yfarjoun
Copy link
Copy Markdown
Contributor Author

@nh13 I'm confused about your reaction. You have a 👍 on the description and a "hesitation" in a comment.

In Picard's MergeBamAlignment the alignment information is removed when it suspects a read as arising from cross-species contamination (by looking at the number of mapped bases and the fact that it is soft-clipped on both ends

I can foresee other post alignment modifications to happen in pipelines (for example in 10X pipeline there are multiple alignment events and the PA tag could store the original alignment information, When looking for SV's one can reach conclusion about the alignment that are different from the original)

We've been asked to add this tag so that the information about the previous alignment isn't lost when the data passes through our pipeline. I'm happy to put it in a user-defined tag for now, but we are generating thousands of WGS genomes that will become available for others to use, and the cost of re-tagging might be onerous....

@magicDGS
Copy link
Copy Markdown
Member

I'd like to have this tag for evaluate performance of realignment algorithms, and I think that it will be something useful also for other purposes.

@nh13
Copy link
Copy Markdown
Member

nh13 commented Apr 23, 2017

@yfarjoun hesitation is not a "no", so I am board given I can see multiple uses now. While one could store the "previous alignment" in a secondary record, and so the use case here could be to store the same information as the "SA" tag for secondary alignments rather than supplementary alignments". If you would agree, then could we call it "SR" or "S2" for secondary alignment? This would allow it to be used more broadly than something that is a "previous" alignment. What do you think?

@yfarjoun
Copy link
Copy Markdown
Contributor Author

I have no problem with making it a more general tag. I think that the secondary vs. supplementary nomenclature is bad enough already...so I'd like to avoid using these terms in the tag name and description

So....how about:

AA (Alternative/Additional Alignment)
XA (eXtra Alignment) bonus points for using X?
DA (Different Alignment)

@d-cameron
Copy link
Copy Markdown
Contributor

I was under the impression that the proposed "previous alignment" meant "alignment that this record had at an earlier stage in the pipeline". What's wrong with the existing OC and OP tags? Adding OR ("original reference") for completeness would make more sense than adding a new field with almost the same meaning.

@d-cameron
Copy link
Copy Markdown
Contributor

As for AA/XA/DA on secondary/supplmentary, isn't that already covered by the CC, CP, HI, and IH tags?

@d-cameron
Copy link
Copy Markdown
Contributor

As it stands, the difference between secondary and supplementary records is sane if you assume that each record is part of at most one potential alignment. That is, if you have a 100bp read with the first 50 bases aligning to a single location, and the next 50 aligning to two possible locations, you would need to write 4 records (not 3): a chimeric alignment record pair for the first possible alignment, and a chimeric alignment record pair for the second possible alignment (even though the first record in those pairs differ only by their SA,HI,CC,CP tags).

@yfarjoun
Copy link
Copy Markdown
Contributor Author

yfarjoun commented Apr 27, 2017 via email

@d-cameron
Copy link
Copy Markdown
Contributor

In which case, wouldn't it make more sense to deprecate CC, CP, OC, OP and replace them with something like CA/OA?

At least that will partially resolve the outstanding issue of CC/CP secondary alignment ambiguity cause by record with the same starting position but different CIGARs.

@jkbonfield
Copy link
Copy Markdown
Contributor

@d-cameron points are good here regarding either deprecating OC/OP (CC/CP are a different beast I think) or adding OR and maybe OF (flags? which includes the strand).

I like the suggested syntax of PA better than a raft of new tags, but am unsure if it is easier to do that than deprecate existing ones.

Regarding Alternative Alignments (AA) - AA is much better than XA which is a private tag (and used!). Making this general purpose seems nice, but I'll raise a query. What happens if an aligner emits the best alignment with alternatives in AA, and then we want to remove this alignment via MergeBamAlignment? I assume we add the current alignment to the front of the list of AAs with some notion that undoing this implicitly means picking the first of the alternatives provided?

If we use AA then I can see it being used instead of secondary alignments in some situations. We then have 3 choices: emit best only, emit best only but report others, emit all (one primary and many secondary). The latter allows random access, but maybe we don't really need that and switching from AA to real records and back again is possible albeit requiring a sort stage.

@yfarjoun
Copy link
Copy Markdown
Contributor Author

I'm happy to deprecate OC and OP, if that's the preferred option.

I think that our tag space is sufficiently small that having (rname ,pos ,strand ,CIGAR ,mapQ ,NM) encoded as OR, OP, OS, OC, OM, and ON seems like an excessive amount of tags to spend on this usecase....

@yfarjoun
Copy link
Copy Markdown
Contributor Author

I'm going to change the name to OA (OriginalAlignment), allow missing elements as dots . and deprecate OC and OP since they can be encoded using OA.

Does that satisfy folks?

@jkbonfield
Copy link
Copy Markdown
Contributor

+0.99 :)

I'm happy with the idea of deprecating tags. We should keep the original text, but add a statement like "deprecated in favour of OA" so over time people can switch. We'll just have to support both though.

As far as missing elements as dots, why not just have them absent as we already have a comma separated list. eg OA:Z:foo,100,+,*,,,.

Also note technically this expands on an (already existing) limitation on reference names - that they should not contain commas (we should have used space in SA or fixed @SQ lines back then, but this ship has sailed). Right now we only exclude * = for the first character and accept any non-whitespace printable for subsequent characters, but this has been overly liberal for far too long and already breaks other tags. We should perhaps update the main spec to disallow comma, or at very least strongly recommend against it with an indication of what will break if it is used.

@yfarjoun
Copy link
Copy Markdown
Contributor Author

Don't get me started on punctuation marks in reference names... 😄

You are right. no need for dots. Just missing since we have commas.

@yfarjoun
Copy link
Copy Markdown
Contributor Author

Anyone care to chime in? I've responded to the concerns and would like this to be merged...

@jkbonfield
@d-cameron
@magicDGS

@jkbonfield
Copy link
Copy Markdown
Contributor

+1 for the text and substantive change.

I think there is a minor oddity in the formatting, but it's still valid. The SA:Z tag uses

\tagregex{{\tt (}\emph{rname}{\tt ,}\emph{pos}{\tt ,}\emph{strand}{\tt ,}\emph{CIGAR}{\tt ,}\emph{mapQ}{\tt ,}\emph{NM}{\tt ;)}+}

while your PA:Z tag uses:

\tagregex{({\tt}\emph{rname}{\tt,}\emph{pos}{\tt,}\emph{strand}{\tt,}\emph{CIGAR}{\tt,}\emph{mapQ}{\tt,}\emph{NM}{\tt;})+}

Note the ( and ) are in \tt blocks for SA and not in PA. I think this is an accident given the {\tt} nop. Also minor differences in spacing before the commas. We should be consistent between both for formatting.

@yfarjoun
Copy link
Copy Markdown
Contributor Author

Thanks @jkbonfield. fixed.

@d-cameron
Copy link
Copy Markdown
Contributor

d-cameron commented Aug 18, 2017

Looks sane to me and I'm happy with the change although some clarification regarding the following would be nice:

  • Original record was unmapped. Can be resolved by explicitly stating that unmapped and unknown should both be treated as comma padding.

    • This ambiguity is potentially an issue. Does anyone actually ever revert realignments?
  • What should be done to hard clipping in the original alignment

    • I am strongly in favour of realignment tools writing hard clipping CIGARs consistent with the original record (not exactly the same as the ends swap if the alignment strand is different).
    • GATK indel realignment breaks GRIDSS because it strips the hard clips from split read alignments and I am unable to reconstruct the full sequence from the CIGAR because I don't know which subsequence of the full read this alignment is for.

The above is really a special case of a larger issue:

  • What should be done to other SAM records that refer to this record? It would be nice if the specifications required tools that rewrote alignments to also updated the following fields in record from the same template:
    • FLAG 0x2
    • FLAG 0x8
    • FLAG 0x20
    • RNEXT field
    • TNEXT field
    • TLEN field
    • CC tag
    • CP tag
    • MC tag
    • MD tag
    • MQ tag
    • NM tag
    • SA tag
    • NM tag

Ones that that should be rewritten but you could make an argument for them not be to are:

  • AS tag
  • H0 tag
  • H1 tag
  • H2 tag
  • NH tag
  • PG tag
  • PT tag
  • SM tag
  • U2 tag
  • UQ tag

Inconsistent data in the SAM inputs files have been the most common type of issue raised with GRIDSS. Maybe I should refresh my proposal for a "SAM strict" sub-specifications subset that requires the file to actually make sense and not just be syntactically valid.

@jkbonfield
Copy link
Copy Markdown
Contributor

jkbonfield commented Aug 18, 2017

I think really we need some form of SAM header field to indicate which fields may be invalid, if "invalid" is even the correct term in some cases.

Eg if we realign a read then the RNEXT/PNEXT/TLEN field may become false. How do we correct this? The other end may be many GBs away! We'd have to keep track while streaming the entire file, saving the modified read names to disk and then have a second pass to correct it (as the other end may have already been written out). The upshot is that you turn a relatively simple task into a nightmarishly complex one, plus it adds a large time penalty where typically no one cares.

A similar issue is what happens after filtering. If we remove a record with, eg, samtools view -F, or perform a sub-region query, then it may still be listed in the pnext/rnext daisy chain. Technically the other fields are still correct as we're looking at a subset of the data, but if we attempt to use them we may become stuck.

Most tags are a bit easier. If we realign something then we can just discard MD/NM (faster than recomputing, and we may not even have been using the reference). However some of them are still "next hit" and similar which also break on sub-setting.

The right solution here I think is to indicate that these fields maybe incorrect rather than to correct them. If you actually care then you'll need to name collate, run fixmates, and sort back again (etc).

@d-cameron
Copy link
Copy Markdown
Contributor

Eg if we realign a read then the RNEXT/PNEXT/TLEN field may become false

I've collated the full list of possibly invalidated spec-defined fields and realignment possibly breaks over half the fields in the sam tags spec.

If you actually care then you'll need to name collate, run fixmates, and sort back again (etc).

GRIDSS does care and actually fixes most of the regenerable tags but it still dies when record adjustment results in loss of information that you can't just fix (such loss of hard clipping information, or alignment-related information that the realignment did something different with/to). In some cases, you actually do need the tool changing to make the adjustments to the other fields to prevent loss of information.

In any case, +1 for this PR from me.

@jkbonfield
Copy link
Copy Markdown
Contributor

Completely agree with the issue of hard clipping information being lost. That's the sort of thing that this PR is aiming at solving, although a realigner could infact preserve hard clips even without filling out a PA tag.

@yfarjoun
Copy link
Copy Markdown
Contributor Author

As usual, I'm a little confused as to when I am allowed to hit the "merge" button...I'm hoping it will be clearer here than with other PR.. :-)

@magicDGS
Copy link
Copy Markdown
Member

👍 for me... I think that consistency problems may be addressed in a different issue/PR, because they are clearly important but not only related to this change.

@jkbonfield
Copy link
Copy Markdown
Contributor

@yfarjoun As far as I'm concerned, when there is agreement between spec maintainers then you can go ahead and merge. You fixed the very minor formatting issues and I'd already thumbed up, so it's an OK from me.

The subsequent comments I think are best dealt with in a new issue if we wish to tackle it, but to be honest I'm not really sure what can be done without fundamentally breaking a lot of algorithms and making them complex. As I said, about the only workable thing I can think of is a way to indicate when fields have been invalidated, but that would be a new PR if someone wished to champion the cause.

@yfarjoun
Copy link
Copy Markdown
Contributor Author

So I'm rebasing and merging...OK?

@yfarjoun yfarjoun force-pushed the yf_add_previous_alignment_tag_PA branch from 50f0415 to 2253e92 Compare August 21, 2017 11:25
@yfarjoun
Copy link
Copy Markdown
Contributor Author

@d-cameron are you still unhappy enough with this PR to maintain the 👎 you placed on it?

@yfarjoun
Copy link
Copy Markdown
Contributor Author

yfarjoun commented Aug 21, 2017 via email

@yfarjoun
Copy link
Copy Markdown
Contributor Author

done.

@yfarjoun
Copy link
Copy Markdown
Contributor Author

sorry about that....removed the erroneous change.

Table entries not capitalised.
Format regexp as per CT/PT and remove misleading whitespace.
Wordsmithing. Fix quote mark typo.
"Deprecated by" gets the actor wrong (it's us, not the tag).
Should be \subsubsection, and it's now October.
@yfarjoun
Copy link
Copy Markdown
Contributor Author

let's merge this!

@yfarjoun
Copy link
Copy Markdown
Contributor Author

👍

@jkbonfield
Copy link
Copy Markdown
Contributor

Agreed in principle. I'd like to tidy up the wording marginally, but frankly it can be done at merge time.

"MAPQ is a numeric string" is a bit meaningless (and we don't say "numeric string" for POS). Probably just "MAPQ as a numeric mapping quality". Numeric helps here as we have other qualities which are ASCII val+33.

@yfarjoun
Copy link
Copy Markdown
Contributor Author

so many wordsmiths!!! :-)

@jkbonfield
Copy link
Copy Markdown
Contributor

jkbonfield commented Dec 12, 2018 via email

@yfarjoun
Copy link
Copy Markdown
Contributor Author

@jkbonfield OK to merge now?

@jmarshall
Copy link
Copy Markdown
Member

As discussed in the last couple of meetings (check the minutes), this one is with me for wording improvements. That's why you assigned it to me…

Use "original alignment entry" to be more specific.
Say "textual SAM representations" (but "generally" because STRAND is not
direct from a SAM field); now there's nothing needed for some subfields
so compress the bullet list into a list in a sentence.

Restore useful text in OP even if we're deprecating it.
@jmarshall
Copy link
Copy Markdown
Member

While agreeing with this feature, I have two problems with the wording:

  • It is true that it's not intuitive which value of bit 0x10 corresponds to forward/reverse, but this is not the place to fix that. The place to find that information is in the description of the FLAG field, and PR Add note that SAM alignments are always represented on the forward strand #371 spells it out there.

  • The bullet list of fields takes a lot of space to say not very much. Saying “in their textual SAM representations” once makes this more succinct and IMHO clearer. I've pushed suggested wording in a new commit for your consideration.

@yfarjoun
Copy link
Copy Markdown
Contributor Author

I accept these changes. Thanks @jmarshall!

@yfarjoun
Copy link
Copy Markdown
Contributor Author

I hope this can be merged now....thumbs?

@jmarshall
Copy link
Copy Markdown
Member

Thanks Yossi. It's 👍 from me, happy to squash and merge along with its new buddy #371. @jkbonfield?

jmarshall pushed a commit that referenced this pull request Jan 30, 2019
)

Deprecate OC and OP as they're superseded by the more general OA.
@jmarshall
Copy link
Copy Markdown
Member

Squashed and merged as 9bdcdbc. 🎉

@jmarshall jmarshall closed this Jan 30, 2019
@jmarshall jmarshall added the Merged For marking PRs that have been merged outwith GitHub and are shown as merely “Closed” label Jan 30, 2019
@jmarshall
Copy link
Copy Markdown
Member

@yfarjoun: Please delete the branch when it is no longer relevant.

@yfarjoun yfarjoun deleted the yf_add_previous_alignment_tag_PA branch February 5, 2019 18:21
@yfarjoun
Copy link
Copy Markdown
Contributor Author

yfarjoun commented Feb 5, 2019

sorry, I usually delete when I merge...but you merged..so I expected you to delete. I'll try to remember next year when I get a PR merged 😅

@jmarshall
Copy link
Copy Markdown
Member

Thanks. I tend to expect the PR author to delete their own branch… and the two expectations aren't so compatible 😄.

Let's endeavour to merge the next one in less than a year!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged For marking PRs that have been merged outwith GitHub and are shown as merely “Closed” sam

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants