-
Notifications
You must be signed in to change notification settings - Fork 8
Consensus string #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consensus string #20
Changes from 2 commits
13a1e62
f4cb9c9
4c87925
fd8f16d
4158491
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,190 @@ | ||
| # Consensus and Profile | ||
|
|
||
| 🤔 [Problem link](https://rosalind.info/problems/cons/) | ||
|
|
||
| !!! warning "The Problem". | ||
|
|
||
| A matrix is a rectangular table of values divided into rows and columns. | ||
| An m×n matrix has m rows and n columns. | ||
| Given a matrix A, we write Ai,j. | ||
| to indicate the value found at the intersection of row i and column j. | ||
|
|
||
| Say that we have a collection of DNA strings, | ||
| all having the same length n. | ||
| Their profile matrix is a 4×n matrix P in which P1, | ||
| j represents the number of times that 'A' occurs in the jth position of one of the strings, | ||
| P2,j represents the number of times that C occurs in the jth position, | ||
| and so on (see below). | ||
|
|
||
| A consensus string c is a string of length n | ||
| formed from our collection by taking the most common symbol at each position; | ||
| the jth symbol of c therefore corresponds to the symbol having the maximum value | ||
| in the j-th column of the profile matrix. | ||
| Of course, there may be more than one most common symbol, | ||
| leading to multiple possible consensus strings. | ||
|
|
||
| ### DNA Strings | ||
| A T C C A G C T | ||
| G G G C A A C T | ||
| A T G G A T C T | ||
| A A G C A A C C | ||
| T T G G A A C T | ||
| A T G C C A T T | ||
| A T G G C A C T | ||
|
|
||
| ### Profile | ||
|
|
||
| A 5 1 0 0 5 5 0 0 | ||
| C 0 0 1 4 2 0 6 1 | ||
| G 1 1 6 3 0 1 0 0 | ||
| T 1 5 0 0 0 1 1 6 | ||
|
|
||
| Consensus A T G C A A C T | ||
|
|
||
| Given: | ||
| A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format. | ||
|
|
||
| Return: | ||
| A consensus string and profile matrix for the collection. | ||
| (If several possible consensus strings exist, | ||
| then you may return any one of them.) | ||
|
|
||
| Sample Dataset | ||
| >Rosalind_1 | ||
| ATCCAGCT | ||
| >Rosalind_2 | ||
| GGGCAACT | ||
| >Rosalind_3 | ||
| ATGGATCT | ||
| >Rosalind_4 | ||
| AAGCAACC | ||
| >Rosalind_5 | ||
| TTGGAACT | ||
| >Rosalind_6 | ||
| ATGCCATT | ||
| >Rosalind_7 | ||
| ATGGCACT | ||
|
|
||
| Sample Output | ||
| ATGCAACT | ||
| A: 5 1 0 0 5 5 0 0 | ||
| C: 0 0 1 4 2 0 6 1 | ||
| G: 1 1 6 3 0 1 0 0 | ||
| T: 1 5 0 0 0 1 1 6 | ||
|
|
||
|
|
||
| The first thing we will need to do is read in the input fasta. | ||
| In this case, we will not be reading in an actual fasta file, | ||
| but a set of strings in fasta format. | ||
| If we were reading in an actual fasta file, | ||
| we could use the [FASTX.jl](https://github.com/BioJulia/FASTX.jl) package to help us with that. | ||
|
|
||
| Since the task required here is something that was already demonstrated in the [GC-content tutorial](./05-gc.md), | ||
| we can borrow the function from that tutorial. | ||
|
|
||
| ```julia | ||
|
|
||
| fake_file = IOBuffer(""" | ||
| >Rosalind_1 | ||
| ATCCAGCT | ||
| >Rosalind_2 | ||
| GGGCAACT | ||
| >Rosalind_3 | ||
| ATGGATCT | ||
| >Rosalind_4 | ||
| AAGCAACC | ||
| >Rosalind_5 | ||
| TTGGAACT | ||
| >Rosalind_6 | ||
| ATGCCATT | ||
| >Rosalind_7 | ||
| ATGGCACT | ||
| """ | ||
| ) | ||
|
|
||
| function parse_fasta(buffer) | ||
| records = [] # this is a Vector of type `Any` | ||
| record_name = "" | ||
| sequence = "" | ||
| for line in eachline(buffer) | ||
| if startswith(line, ">") | ||
| !isempty(record_name) && push!(records, (record_name, sequence)) | ||
| record_name = lstrip(line, '>') | ||
| sequence = "" | ||
| else | ||
| sequence *= line | ||
| end | ||
| end | ||
| push!(records, (record_name, sequence)) | ||
| return records | ||
| end | ||
|
|
||
| records = parse_fasta(fake_file) | ||
| ``` | ||
|
|
||
| Once the fasta is read in, we can iterate over each read and store its nucleotide sequence in a data matrix. | ||
|
|
||
| From there, we can generate the profile matrix. | ||
| We'll need to sum the number of times each nucleotide appears at a particular row of the data matrix. | ||
|
|
||
| Then, we can identify the most common nucleotide at each column of the data matrix, | ||
| which represents each index of the consensus string. | ||
| After we have done this for all columns of the data matrix, | ||
| we can generate the consensus string. | ||
|
|
||
|
|
||
| ```julia | ||
| using DataFrames | ||
|
|
||
| function consensus(fasta_string) | ||
|
|
||
| # extract strings from fasta | ||
| records = parse_fasta(fasta_string) | ||
|
|
||
| # make a vector of just strings | ||
| data_vector = last.(records) | ||
|
|
||
| # convert data_vector to matrix where each column is a char and each row is a string | ||
| data_matrix = reduce(vcat, permutedims.(collect.(data_vector))) | ||
|
|
||
| # make profile matrix | ||
|
|
||
| ## Is it possible to do this in a more efficient vectorized way? I wanted to see if we could do countmap() for each column in a simple way that would involve looping over each column. I think this ended up being more efficient since we are just looping over each of the nucleotides | ||
|
|
||
| consensus_matrix_list = Vector{Int64}[] | ||
| for nuc in ['A', 'C', 'G', 'T'] | ||
| nuc_count = vec(sum(x->x==nuc, data_matrix, dims=1)) | ||
| push!(consensus_matrix_list, nuc_count) | ||
| end | ||
|
|
||
| consensus_matrix = vcat(consensus_matrix_list) | ||
|
|
||
| # convert matrix to DF and add row names for nucleotides | ||
| consensus_df = DataFrame(consensus_matrix, ["A", "C", "G", "T"]) | ||
|
|
||
|
|
||
| # make column with nucleotide with max value | ||
| # argmax returns the index or key of the first one encountered | ||
| nuc_max_df = transform(consensus_df, AsTable(:) => ByRow(argmax) => :MaxColName) | ||
|
|
||
| # return consensus string | ||
| return join(nuc_max_df.MaxColName) | ||
|
|
||
| end | ||
|
|
||
| consensus(fake_file) | ||
| ``` | ||
|
|
||
| As mentioned in the problem description above, | ||
| it is possible that there can be multiple consensus strings, | ||
| as some nucleotides may appear the same number of times | ||
| in each column of the data matrix. | ||
|
|
||
| If this is the case, | ||
| the function we are using (`argmax`) returns the nucleotide with the most occurences that it first encounters. | ||
|
|
||
| The way our function is written, | ||
| we first scan for 'A', 'C', then 'G' and 'T', | ||
| so the final consensus string will be biased towards more A's, then C's, G's and T's. | ||
| This simply based on which nucleotide counts it will encounter first in the profile matrix. | ||
|
|
||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I may add some examples of multiple consensus strings. I may also add another solution that returns all consensus strings.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added an example, but I will pass on adding another solution. I think time will be better spent on the BioPython recipes. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to do this in a more efficient vectorized way? I wanted to see if we could do
countmap()for each column in a simple way that would involve looping over each column. I think this ended up being more efficient since we are just looping over each of the nucleotides