New OA tag for storing Orignal Alignment information when modifying records#193
New OA tag for storing Orignal Alignment information when modifying records#193
Conversation
|
I'd be bit hesitant to include this tag in the spec, as it seems very new, and while potentially useful, may be better suited to be in a user-defined tag. I am not sure it has "general interest" at the moment I'd welcome the opinions of the spec maintainers and the community. |
|
@nh13 I'm confused about your reaction. You have a 👍 on the description and a "hesitation" in a comment. In Picard's MergeBamAlignment the alignment information is removed when it suspects a read as arising from cross-species contamination (by looking at the number of mapped bases and the fact that it is soft-clipped on both ends I can foresee other post alignment modifications to happen in pipelines (for example in 10X pipeline there are multiple alignment events and the PA tag could store the original alignment information, When looking for SV's one can reach conclusion about the alignment that are different from the original) We've been asked to add this tag so that the information about the previous alignment isn't lost when the data passes through our pipeline. I'm happy to put it in a user-defined tag for now, but we are generating thousands of WGS genomes that will become available for others to use, and the cost of re-tagging might be onerous.... |
|
I'd like to have this tag for evaluate performance of realignment algorithms, and I think that it will be something useful also for other purposes. |
|
@yfarjoun hesitation is not a "no", so I am board given I can see multiple uses now. While one could store the "previous alignment" in a secondary record, and so the use case here could be to store the same information as the "SA" tag for secondary alignments rather than supplementary alignments". If you would agree, then could we call it "SR" or "S2" for secondary alignment? This would allow it to be used more broadly than something that is a "previous" alignment. What do you think? |
|
I have no problem with making it a more general tag. I think that the secondary vs. supplementary nomenclature is bad enough already...so I'd like to avoid using these terms in the tag name and description So....how about: AA (Alternative/Additional Alignment) |
|
I was under the impression that the proposed "previous alignment" meant "alignment that this record had at an earlier stage in the pipeline". What's wrong with the existing OC and OP tags? Adding OR ("original reference") for completeness would make more sense than adding a new field with almost the same meaning. |
|
As for AA/XA/DA on secondary/supplmentary, isn't that already covered by the CC, CP, HI, and IH tags? |
|
As it stands, the difference between secondary and supplementary records is sane if you assume that each record is part of at most one potential alignment. That is, if you have a 100bp read with the first 50 bases aligning to a single location, and the next 50 aligning to two possible locations, you would need to write 4 records (not 3): a chimeric alignment record pair for the first possible alignment, and a chimeric alignment record pair for the second possible alignment (even though the first record in those pairs differ only by their SA,HI,CC,CP tags). |
|
In the proposed PA/whatever tag I store in the same tag the following data:
ref,pos:strand,cigar,NM (same as the SA tag). it's too bad that the OC and
OP can't be co-opted for this purpose as I would much rather do that then
add another tag...
…On Wed, Apr 26, 2017 at 9:35 PM, Daniel Cameron ***@***.***> wrote:
I was under the impression that the proposed "previous alignment" meant
"alignment that this record had at an earlier stage in the pipeline".
What's wrong with the existing OC and OP tags? Adding OR ("original
reference") for completeness would make more sense than adding a new field
with almost the same meaning.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#193 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0rViuEd0Np8qws1cZqi-kQ4QmVvNks5rz_DygaJpZM4MRAyQ>
.
|
|
In which case, wouldn't it make more sense to deprecate CC, CP, OC, OP and replace them with something like CA/OA? At least that will partially resolve the outstanding issue of CC/CP secondary alignment ambiguity cause by record with the same starting position but different CIGARs. |
|
@d-cameron points are good here regarding either deprecating OC/OP (CC/CP are a different beast I think) or adding OR and maybe OF (flags? which includes the strand). I like the suggested syntax of PA better than a raft of new tags, but am unsure if it is easier to do that than deprecate existing ones. Regarding Alternative Alignments (AA) - AA is much better than XA which is a private tag (and used!). Making this general purpose seems nice, but I'll raise a query. What happens if an aligner emits the best alignment with alternatives in AA, and then we want to remove this alignment via MergeBamAlignment? I assume we add the current alignment to the front of the list of AAs with some notion that undoing this implicitly means picking the first of the alternatives provided? If we use AA then I can see it being used instead of secondary alignments in some situations. We then have 3 choices: emit best only, emit best only but report others, emit all (one primary and many secondary). The latter allows random access, but maybe we don't really need that and switching from AA to real records and back again is possible albeit requiring a sort stage. |
|
I'm happy to deprecate OC and OP, if that's the preferred option. I think that our tag space is sufficiently small that having (rname ,pos ,strand ,CIGAR ,mapQ ,NM) encoded as OR, OP, OS, OC, OM, and ON seems like an excessive amount of tags to spend on this usecase.... |
|
I'm going to change the name to OA (OriginalAlignment), allow missing elements as dots Does that satisfy folks? |
|
+0.99 :) I'm happy with the idea of deprecating tags. We should keep the original text, but add a statement like "deprecated in favour of OA" so over time people can switch. We'll just have to support both though. As far as missing elements as dots, why not just have them absent as we already have a comma separated list. eg Also note technically this expands on an (already existing) limitation on reference names - that they should not contain commas (we should have used space in SA or fixed |
|
Don't get me started on punctuation marks in reference names... 😄 You are right. no need for dots. Just missing since we have commas. |
|
Anyone care to chime in? I've responded to the concerns and would like this to be merged... |
|
+1 for the text and substantive change. I think there is a minor oddity in the formatting, but it's still valid. The SA:Z tag uses while your PA:Z tag uses: Note the ( and ) are in \tt blocks for SA and not in PA. I think this is an accident given the |
|
Thanks @jkbonfield. fixed. |
|
Looks sane to me and I'm happy with the change although some clarification regarding the following would be nice:
The above is really a special case of a larger issue:
Ones that that should be rewritten but you could make an argument for them not be to are:
Inconsistent data in the SAM inputs files have been the most common type of issue raised with GRIDSS. Maybe I should refresh my proposal for a "SAM strict" sub-specifications subset that requires the file to actually make sense and not just be syntactically valid. |
|
I think really we need some form of SAM header field to indicate which fields may be invalid, if "invalid" is even the correct term in some cases. Eg if we realign a read then the RNEXT/PNEXT/TLEN field may become false. How do we correct this? The other end may be many GBs away! We'd have to keep track while streaming the entire file, saving the modified read names to disk and then have a second pass to correct it (as the other end may have already been written out). The upshot is that you turn a relatively simple task into a nightmarishly complex one, plus it adds a large time penalty where typically no one cares. A similar issue is what happens after filtering. If we remove a record with, eg, Most tags are a bit easier. If we realign something then we can just discard MD/NM (faster than recomputing, and we may not even have been using the reference). However some of them are still "next hit" and similar which also break on sub-setting. The right solution here I think is to indicate that these fields maybe incorrect rather than to correct them. If you actually care then you'll need to name collate, run fixmates, and sort back again (etc). |
I've collated the full list of possibly invalidated spec-defined fields and realignment possibly breaks over half the fields in the sam tags spec.
GRIDSS does care and actually fixes most of the regenerable tags but it still dies when record adjustment results in loss of information that you can't just fix (such loss of hard clipping information, or alignment-related information that the realignment did something different with/to). In some cases, you actually do need the tool changing to make the adjustments to the other fields to prevent loss of information. In any case, +1 for this PR from me. |
|
Completely agree with the issue of hard clipping information being lost. That's the sort of thing that this PR is aiming at solving, although a realigner could infact preserve hard clips even without filling out a PA tag. |
|
As usual, I'm a little confused as to when I am allowed to hit the "merge" button...I'm hoping it will be clearer here than with other PR.. :-) |
|
👍 for me... I think that consistency problems may be addressed in a different issue/PR, because they are clearly important but not only related to this change. |
|
@yfarjoun As far as I'm concerned, when there is agreement between spec maintainers then you can go ahead and merge. You fixed the very minor formatting issues and I'd already thumbed up, so it's an OK from me. The subsequent comments I think are best dealt with in a new issue if we wish to tackle it, but to be honest I'm not really sure what can be done without fundamentally breaking a lot of algorithms and making them complex. As I said, about the only workable thing I can think of is a way to indicate when fields have been invalidated, but that would be a new PR if someone wished to champion the cause. |
|
So I'm rebasing and merging...OK? |
50f0415 to
2253e92
Compare
|
@d-cameron are you still unhappy enough with this PR to maintain the 👎 you placed on it? |
|
Got it. waiting.
…On Mon, Aug 21, 2017 at 3:01 PM, John Marshall ***@***.***> wrote:
Thanks for squashing, which makes review easier at this point. This is *not
ready for merge*, comments from me are incoming.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#193 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0lOrD1pt_asvMfdJ4Tbv1-A_U-lAks5saXGagaJpZM4MRAyQ>
.
|
|
done. |
|
sorry about that....removed the erroneous change. |
Table entries not capitalised. Format regexp as per CT/PT and remove misleading whitespace. Wordsmithing. Fix quote mark typo. "Deprecated by" gets the actor wrong (it's us, not the tag). Should be \subsubsection, and it's now October.
|
let's merge this! |
|
👍 |
|
Agreed in principle. I'd like to tidy up the wording marginally, but frankly it can be done at merge time. "MAPQ is a numeric string" is a bit meaningless (and we don't say "numeric string" for POS). Probably just "MAPQ as a numeric mapping quality". Numeric helps here as we have other qualities which are ASCII val+33. |
|
so many wordsmiths!!! :-) |
|
On Wed, Dec 12, 2018 at 09:33:15AM -0800, Yossi Farjoun wrote:
so many wordsmiths!!! :-)
Capitalise that s in "so" please. ;-)
I'm not sure I like three exclaimation marks in a row either!!!!
…--
James Bonfield (jkb@sanger.ac.uk)
The Sanger Institute, Hinxton, Cambs, CB10 1SA
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
|
|
@jkbonfield OK to merge now? |
|
As discussed in the last couple of meetings (check the minutes), this one is with me for wording improvements. That's why you assigned it to me… |
Use "original alignment entry" to be more specific. Say "textual SAM representations" (but "generally" because STRAND is not direct from a SAM field); now there's nothing needed for some subfields so compress the bullet list into a list in a sentence. Restore useful text in OP even if we're deprecating it.
|
While agreeing with this feature, I have two problems with the wording:
|
|
I accept these changes. Thanks @jmarshall! |
|
I hope this can be merged now....thumbs? |
|
Thanks Yossi. It's 👍 from me, happy to squash and merge along with its new buddy #371. @jkbonfield? |
|
Squashed and merged as 9bdcdbc. 🎉 |
|
@yfarjoun: Please delete the branch when it is no longer relevant. |
|
sorry, I usually delete when I merge...but you merged..so I expected you to delete. I'll try to remember next year when I get a PR merged 😅 |
|
Thanks. I tend to expect the PR author to delete their own branch… and the two expectations aren't so compatible 😄. Let's endeavour to merge the next one in less than a year! |
This could be used in GATK's IndelRealignment, but the main use I have for it is in Picard's MergeBamAlignment where it now unmaps reads that look like bacterial contamination that got mapped due to a match of 20 or so bases randomly. Since this unmaping is lossy (as one can't keep original alignment information in place) I would like to store the alignment information in a tag. I used the same format as the OA tag.