New OA tag for storing Orignal Alignment information when modifying records by yfarjoun · Pull Request #193 · samtools/hts-specs

yfarjoun · 2017-03-02T13:53:50Z

This could be used in GATK's IndelRealignment, but the main use I have for it is in Picard's MergeBamAlignment where it now unmaps reads that look like bacterial contamination that got mapped due to a match of 20 or so bases randomly. Since this unmaping is lossy (as one can't keep original alignment information in place) I would like to store the alignment information in a tag. I used the same format as the OA tag.

nh13 · 2017-03-02T14:49:40Z

I'd be bit hesitant to include this tag in the spec, as it seems very new, and while potentially useful, may be better suited to be in a user-defined tag. I am not sure it has "general interest" at the moment I'd welcome the opinions of the spec maintainers and the community.

yfarjoun · 2017-04-22T15:43:07Z

@nh13 I'm confused about your reaction. You have a 👍 on the description and a "hesitation" in a comment.

In Picard's MergeBamAlignment the alignment information is removed when it suspects a read as arising from cross-species contamination (by looking at the number of mapped bases and the fact that it is soft-clipped on both ends

I can foresee other post alignment modifications to happen in pipelines (for example in 10X pipeline there are multiple alignment events and the PA tag could store the original alignment information, When looking for SV's one can reach conclusion about the alignment that are different from the original)

We've been asked to add this tag so that the information about the previous alignment isn't lost when the data passes through our pipeline. I'm happy to put it in a user-defined tag for now, but we are generating thousands of WGS genomes that will become available for others to use, and the cost of re-tagging might be onerous....

magicDGS · 2017-04-23T15:19:05Z

I'd like to have this tag for evaluate performance of realignment algorithms, and I think that it will be something useful also for other purposes.

nh13 · 2017-04-23T18:34:27Z

@yfarjoun hesitation is not a "no", so I am board given I can see multiple uses now. While one could store the "previous alignment" in a secondary record, and so the use case here could be to store the same information as the "SA" tag for secondary alignments rather than supplementary alignments". If you would agree, then could we call it "SR" or "S2" for secondary alignment? This would allow it to be used more broadly than something that is a "previous" alignment. What do you think?

yfarjoun · 2017-04-24T01:23:53Z

I have no problem with making it a more general tag. I think that the secondary vs. supplementary nomenclature is bad enough already...so I'd like to avoid using these terms in the tag name and description

So....how about:

AA (Alternative/Additional Alignment)
XA (eXtra Alignment) bonus points for using X?
DA (Different Alignment)

d-cameron · 2017-04-27T01:35:45Z

I was under the impression that the proposed "previous alignment" meant "alignment that this record had at an earlier stage in the pipeline". What's wrong with the existing OC and OP tags? Adding OR ("original reference") for completeness would make more sense than adding a new field with almost the same meaning.

d-cameron · 2017-04-27T01:37:04Z

As for AA/XA/DA on secondary/supplmentary, isn't that already covered by the CC, CP, HI, and IH tags?

d-cameron · 2017-04-27T01:43:43Z

As it stands, the difference between secondary and supplementary records is sane if you assume that each record is part of at most one potential alignment. That is, if you have a 100bp read with the first 50 bases aligning to a single location, and the next 50 aligning to two possible locations, you would need to write 4 records (not 3): a chimeric alignment record pair for the first possible alignment, and a chimeric alignment record pair for the second possible alignment (even though the first record in those pairs differ only by their SA,HI,CC,CP tags).

yfarjoun · 2017-04-27T02:16:22Z

In the proposed PA/whatever tag I store in the same tag the following data: ref,pos:strand,cigar,NM (same as the SA tag). it's too bad that the OC and OP can't be co-opted for this purpose as I would much rather do that then add another tag...

…

On Wed, Apr 26, 2017 at 9:35 PM, Daniel Cameron ***@***.***> wrote: I was under the impression that the proposed "previous alignment" meant "alignment that this record had at an earlier stage in the pipeline". What's wrong with the existing OC and OP tags? Adding OR ("original reference") for completeness would make more sense than adding a new field with almost the same meaning. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#193 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACnk0rViuEd0Np8qws1cZqi-kQ4QmVvNks5rz_DygaJpZM4MRAyQ> .

d-cameron · 2017-04-27T03:01:49Z

In which case, wouldn't it make more sense to deprecate CC, CP, OC, OP and replace them with something like CA/OA?

At least that will partially resolve the outstanding issue of CC/CP secondary alignment ambiguity cause by record with the same starting position but different CIGARs.

jkbonfield · 2017-05-10T09:34:34Z

@d-cameron points are good here regarding either deprecating OC/OP (CC/CP are a different beast I think) or adding OR and maybe OF (flags? which includes the strand).

I like the suggested syntax of PA better than a raft of new tags, but am unsure if it is easier to do that than deprecate existing ones.

Regarding Alternative Alignments (AA) - AA is much better than XA which is a private tag (and used!). Making this general purpose seems nice, but I'll raise a query. What happens if an aligner emits the best alignment with alternatives in AA, and then we want to remove this alignment via MergeBamAlignment? I assume we add the current alignment to the front of the list of AAs with some notion that undoing this implicitly means picking the first of the alternatives provided?

If we use AA then I can see it being used instead of secondary alignments in some situations. We then have 3 choices: emit best only, emit best only but report others, emit all (one primary and many secondary). The latter allows random access, but maybe we don't really need that and switching from AA to real records and back again is possible albeit requiring a sort stage.

yfarjoun · 2017-05-26T17:36:31Z

I'm happy to deprecate OC and OP, if that's the preferred option.

I think that our tag space is sufficiently small that having (rname ,pos ,strand ,CIGAR ,mapQ ,NM) encoded as OR, OP, OS, OC, OM, and ON seems like an excessive amount of tags to spend on this usecase....

yfarjoun · 2017-05-31T13:19:57Z

I'm going to change the name to OA (OriginalAlignment), allow missing elements as dots . and deprecate OC and OP since they can be encoded using OA.

Does that satisfy folks?

jkbonfield · 2017-05-31T13:47:55Z

+0.99 :)

I'm happy with the idea of deprecating tags. We should keep the original text, but add a statement like "deprecated in favour of OA" so over time people can switch. We'll just have to support both though.

As far as missing elements as dots, why not just have them absent as we already have a comma separated list. eg OA:Z:foo,100,+,*,,,.

Also note technically this expands on an (already existing) limitation on reference names - that they should not contain commas (we should have used space in SA or fixed @SQ lines back then, but this ship has sailed). Right now we only exclude * = for the first character and accept any non-whitespace printable for subsequent characters, but this has been overly liberal for far too long and already breaks other tags. We should perhaps update the main spec to disallow comma, or at very least strongly recommend against it with an indication of what will break if it is used.

yfarjoun · 2017-05-31T13:58:14Z

Don't get me started on punctuation marks in reference names... 😄

You are right. no need for dots. Just missing since we have commas.

yfarjoun · 2017-08-18T08:28:07Z

Anyone care to chime in? I've responded to the concerns and would like this to be merged...

@jkbonfield
@d-cameron
@magicDGS

jkbonfield · 2017-08-18T08:46:22Z

+1 for the text and substantive change.

I think there is a minor oddity in the formatting, but it's still valid. The SA:Z tag uses

\tagregex{{\tt (}\emph{rname}{\tt ,}\emph{pos}{\tt ,}\emph{strand}{\tt ,}\emph{CIGAR}{\tt ,}\emph{mapQ}{\tt ,}\emph{NM}{\tt ;)}+}

while your PA:Z tag uses:

\tagregex{({\tt}\emph{rname}{\tt,}\emph{pos}{\tt,}\emph{strand}{\tt,}\emph{CIGAR}{\tt,}\emph{mapQ}{\tt,}\emph{NM}{\tt;})+}

Note the ( and ) are in \tt blocks for SA and not in PA. I think this is an accident given the {\tt} nop. Also minor differences in spacing before the commas. We should be consistent between both for formatting.

yfarjoun · 2017-08-18T10:39:06Z

Thanks @jkbonfield. fixed.

d-cameron · 2017-08-18T12:06:27Z

Looks sane to me and I'm happy with the change although some clarification regarding the following would be nice:

Original record was unmapped. Can be resolved by explicitly stating that unmapped and unknown should both be treated as comma padding.
- This ambiguity is potentially an issue. Does anyone actually ever revert realignments?
What should be done to hard clipping in the original alignment
- I am strongly in favour of realignment tools writing hard clipping CIGARs consistent with the original record (not exactly the same as the ends swap if the alignment strand is different).
- GATK indel realignment breaks GRIDSS because it strips the hard clips from split read alignments and I am unable to reconstruct the full sequence from the CIGAR because I don't know which subsequence of the full read this alignment is for.

The above is really a special case of a larger issue:

What should be done to other SAM records that refer to this record? It would be nice if the specifications required tools that rewrote alignments to also updated the following fields in record from the same template:
- FLAG 0x2
- FLAG 0x8
- FLAG 0x20
- RNEXT field
- TNEXT field
- TLEN field
- CC tag
- CP tag
- MC tag
- MD tag
- MQ tag
- NM tag
- SA tag
- NM tag

Ones that that should be rewritten but you could make an argument for them not be to are:

AS tag
H0 tag
H1 tag
H2 tag
NH tag
PG tag
PT tag
SM tag
U2 tag
UQ tag

Inconsistent data in the SAM inputs files have been the most common type of issue raised with GRIDSS. Maybe I should refresh my proposal for a "SAM strict" sub-specifications subset that requires the file to actually make sense and not just be syntactically valid.

jkbonfield · 2017-08-18T13:30:45Z

I think really we need some form of SAM header field to indicate which fields may be invalid, if "invalid" is even the correct term in some cases.

Eg if we realign a read then the RNEXT/PNEXT/TLEN field may become false. How do we correct this? The other end may be many GBs away! We'd have to keep track while streaming the entire file, saving the modified read names to disk and then have a second pass to correct it (as the other end may have already been written out). The upshot is that you turn a relatively simple task into a nightmarishly complex one, plus it adds a large time penalty where typically no one cares.

A similar issue is what happens after filtering. If we remove a record with, eg, samtools view -F, or perform a sub-region query, then it may still be listed in the pnext/rnext daisy chain. Technically the other fields are still correct as we're looking at a subset of the data, but if we attempt to use them we may become stuck.

Most tags are a bit easier. If we realign something then we can just discard MD/NM (faster than recomputing, and we may not even have been using the reference). However some of them are still "next hit" and similar which also break on sub-setting.

The right solution here I think is to indicate that these fields maybe incorrect rather than to correct them. If you actually care then you'll need to name collate, run fixmates, and sort back again (etc).

d-cameron · 2017-08-20T11:35:55Z

Eg if we realign a read then the RNEXT/PNEXT/TLEN field may become false

I've collated the full list of possibly invalidated spec-defined fields and realignment possibly breaks over half the fields in the sam tags spec.

If you actually care then you'll need to name collate, run fixmates, and sort back again (etc).

GRIDSS does care and actually fixes most of the regenerable tags but it still dies when record adjustment results in loss of information that you can't just fix (such loss of hard clipping information, or alignment-related information that the realignment did something different with/to). In some cases, you actually do need the tool changing to make the adjustments to the other fields to prevent loss of information.

In any case, +1 for this PR from me.

jkbonfield · 2017-08-21T08:20:58Z

Completely agree with the issue of hard clipping information being lost. That's the sort of thing that this PR is aiming at solving, although a realigner could infact preserve hard clips even without filling out a PA tag.

yfarjoun · 2017-08-21T08:23:40Z

As usual, I'm a little confused as to when I am allowed to hit the "merge" button...I'm hoping it will be clearer here than with other PR.. :-)

magicDGS · 2017-08-21T08:27:37Z

👍 for me... I think that consistency problems may be addressed in a different issue/PR, because they are clearly important but not only related to this change.

jkbonfield · 2017-08-21T08:55:14Z

@yfarjoun As far as I'm concerned, when there is agreement between spec maintainers then you can go ahead and merge. You fixed the very minor formatting issues and I'd already thumbed up, so it's an OK from me.

The subsequent comments I think are best dealt with in a new issue if we wish to tackle it, but to be honest I'm not really sure what can be done without fundamentally breaking a lot of algorithms and making them complex. As I said, about the only workable thing I can think of is a way to indicate when fields have been invalidated, but that would be a new PR if someone wished to champion the cause.

yfarjoun · 2017-08-21T11:20:13Z

So I'm rebasing and merging...OK?

yfarjoun · 2017-08-21T11:26:25Z

@d-cameron are you still unhappy enough with this PR to maintain the 👎 you placed on it?

yfarjoun · 2017-08-21T12:20:47Z

Got it. waiting.

…

On Mon, Aug 21, 2017 at 3:01 PM, John Marshall ***@***.***> wrote: Thanks for squashing, which makes review easier at this point. This is *not ready for merge*, comments from me are incoming. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#193 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACnk0lOrD1pt_asvMfdJ4Tbv1-A_U-lAks5saXGagaJpZM4MRAyQ> .

yfarjoun · 2018-09-20T14:18:50Z

done.

yfarjoun · 2018-09-20T16:03:13Z

sorry about that....removed the erroneous change.

Table entries not capitalised. Format regexp as per CT/PT and remove misleading whitespace. Wordsmithing. Fix quote mark typo. "Deprecated by" gets the actor wrong (it's us, not the tag). Should be \subsubsection, and it's now October.

yfarjoun · 2018-12-12T17:16:08Z

let's merge this!

yfarjoun · 2018-12-12T17:16:16Z

👍

jkbonfield · 2018-12-12T17:28:51Z

Agreed in principle. I'd like to tidy up the wording marginally, but frankly it can be done at merge time.

"MAPQ is a numeric string" is a bit meaningless (and we don't say "numeric string" for POS). Probably just "MAPQ as a numeric mapping quality". Numeric helps here as we have other qualities which are ASCII val+33.

yfarjoun · 2018-12-12T17:33:14Z

so many wordsmiths!!! :-)

jkbonfield · 2018-12-12T17:43:23Z

On Wed, Dec 12, 2018 at 09:33:15AM -0800, Yossi Farjoun wrote: so many wordsmiths!!! :-)

Capitalise that s in "so" please. ;-) I'm not sure I like three exclaimation marks in a row either!!!!

…

-- James Bonfield (jkb@sanger.ac.uk) The Sanger Institute, Hinxton, Cambs, CB10 1SA

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

yfarjoun · 2018-12-20T14:33:46Z

@jkbonfield OK to merge now?

jmarshall · 2018-12-20T15:17:12Z

As discussed in the last couple of meetings (check the minutes), this one is with me for wording improvements. That's why you assigned it to me…

Use "original alignment entry" to be more specific. Say "textual SAM representations" (but "generally" because STRAND is not direct from a SAM field); now there's nothing needed for some subfields so compress the bullet list into a list in a sentence. Restore useful text in OP even if we're deprecating it.

jmarshall · 2019-01-10T12:01:57Z

While agreeing with this feature, I have two problems with the wording:

It is true that it's not intuitive which value of bit 0x10 corresponds to forward/reverse, but this is not the place to fix that. The place to find that information is in the description of the FLAG field, and PR Add note that SAM alignments are always represented on the forward strand #371 spells it out there.
The bullet list of fields takes a lot of space to say not very much. Saying “in their textual SAM representations” once makes this more succinct and IMHO clearer. I've pushed suggested wording in a new commit for your consideration.

yfarjoun · 2019-01-18T15:21:01Z

I accept these changes. Thanks @jmarshall!

yfarjoun · 2019-01-18T15:21:51Z

I hope this can be merged now....thumbs?

jmarshall · 2019-01-29T15:39:44Z

Thanks Yossi. It's 👍 from me, happy to squash and merge along with its new buddy #371. @jkbonfield?

) Deprecate OC and OP as they're superseded by the more general OA.

jmarshall · 2019-01-30T11:23:26Z

Squashed and merged as 9bdcdbc. 🎉

jmarshall · 2019-02-05T18:06:45Z

@yfarjoun: Please delete the branch when it is no longer relevant.

yfarjoun · 2019-02-05T18:23:05Z

sorry, I usually delete when I merge...but you merged..so I expected you to delete. I'll try to remember next year when I get a PR merged 😅

jmarshall · 2019-02-05T19:43:33Z

Thanks. I tend to expect the PR author to delete their own branch… and the two expectations aren't so compatible 😄.

Let's endeavour to merge the next one in less than a year!

mp15 approved these changes Mar 2, 2017

View reviewed changes

yfarjoun force-pushed the yf_add_previous_alignment_tag_PA branch from 50f0415 to 2253e92 Compare August 21, 2017 11:25

nit

f20e1ae

oops

74c8ae6

[SQUASH ME] Formatting tweaks

00ae3ca

Table entries not capitalised. Format regexp as per CT/PT and remove misleading whitespace. Wordsmithing. Fix quote mark typo. "Deprecated by" gets the actor wrong (it's us, not the tag). Should be \subsubsection, and it's now October.

yfarjoun assigned jmarshall Dec 7, 2018

Yossi Farjoun added 3 commits December 20, 2018 09:29

wordsmithing

d7ddd03

update log

c7e6b16

wordsmithing

5d972b6

jmarshall approved these changes Jan 29, 2019

View reviewed changes

jmarshall pushed a commit that referenced this pull request Jan 30, 2019

New OA tag for storing Original Alignment when modifying record (PR #193

9bdcdbc

) Deprecate OC and OP as they're superseded by the more general OA.

jmarshall closed this Jan 30, 2019

jmarshall added the Merged For marking PRs that have been merged outwith GitHub and are shown as merely “Closed” label Jan 30, 2019

yfarjoun deleted the yf_add_previous_alignment_tag_PA branch February 5, 2019 18:21

Conversation

yfarjoun commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nh13 commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yfarjoun commented Apr 22, 2017

Uh oh!

magicDGS commented Apr 23, 2017

Uh oh!

nh13 commented Apr 23, 2017

Uh oh!

yfarjoun commented Apr 24, 2017

Uh oh!

d-cameron commented Apr 27, 2017

Uh oh!

d-cameron commented Apr 27, 2017

Uh oh!

d-cameron commented Apr 27, 2017

Uh oh!

yfarjoun commented Apr 27, 2017 via email

Uh oh!

d-cameron commented Apr 27, 2017

Uh oh!

jkbonfield commented May 10, 2017

Uh oh!

yfarjoun commented May 26, 2017

Uh oh!

yfarjoun commented May 31, 2017

Uh oh!

jkbonfield commented May 31, 2017

Uh oh!

yfarjoun commented May 31, 2017

Uh oh!

yfarjoun commented Aug 18, 2017

Uh oh!

jkbonfield commented Aug 18, 2017

Uh oh!

yfarjoun commented Aug 18, 2017

Uh oh!

d-cameron commented Aug 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkbonfield commented Aug 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-cameron commented Aug 20, 2017

Uh oh!

jkbonfield commented Aug 21, 2017

Uh oh!

yfarjoun commented Aug 21, 2017

Uh oh!

magicDGS commented Aug 21, 2017

Uh oh!

jkbonfield commented Aug 21, 2017

Uh oh!

yfarjoun commented Aug 21, 2017

Uh oh!

yfarjoun commented Aug 21, 2017

Uh oh!

yfarjoun commented Aug 21, 2017 via email

Uh oh!

yfarjoun commented Sep 20, 2018

Uh oh!

yfarjoun commented Sep 20, 2018

Uh oh!

yfarjoun commented Dec 12, 2018

Uh oh!

yfarjoun commented Dec 12, 2018

Uh oh!

jkbonfield commented Dec 12, 2018

Uh oh!

yfarjoun commented Dec 12, 2018

Uh oh!

jkbonfield commented Dec 12, 2018 via email

Uh oh!

yfarjoun commented Dec 20, 2018

Uh oh!

jmarshall commented Dec 20, 2018

yfarjoun commented Mar 2, 2017 •

edited

Loading

nh13 commented Mar 2, 2017 •

edited

Loading

d-cameron commented Aug 18, 2017 •

edited

Loading

jkbonfield commented Aug 18, 2017 •

edited

Loading