-
Notifications
You must be signed in to change notification settings - Fork 14
Scripts for review #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| generate_attributes_table <- function(csv_file_path, | ||
| attributes_file_path) { | ||
| # Check that files exist | ||
| stopifnot(file.exists(csv_file_path)) | ||
| stopifnot(file.exists(attributes_file_path)) | ||
|
|
||
| # Read in files | ||
| data <- read.csv(csv_file_path, stringsAsFactors = FALSE, skip = 2) | ||
| n <- dim(data)[2] | ||
| attributes <- try(read.csv(attributes_file_path, stringsAsFactors = FALSE)) | ||
|
|
||
|
|
||
| # Initialize data frame | ||
| table <- data.frame(attributeName = rep("NA", n), | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a function in the base R package called You could change this something like att_table |
||
| attributeDefinition = rep("NA", n), | ||
| measurementScale = rep("NA", n), | ||
| domain = rep("NA", n), | ||
| formatString = rep("NA", n), | ||
| definition = rep("NA", n), | ||
| unit = rep("NA", n), | ||
| numberType = rep("NA", n), | ||
| missingValueCode = rep("NA", n), | ||
| missingValueCodeExplanation = rep("NA", n), | ||
| stringsAsFactors = F) | ||
|
|
||
| qualifiers<- c("_PI", "_QC", "_F", "_IU", "_H_V_R", "_H_V_A", "_1", "_2", "_3", "_4", "_5", "_6", "_7", "_8", "_9", "_SD", "_N") | ||
| num_qualifiers<- c("_1", "_2", "_3", "_4", "_5", "_6", "_7", "_8", "_9") | ||
|
|
||
| for (i in seq_len(n)) { | ||
| # add attribute name | ||
| table$attributeName[i] = colnames(data)[i] | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can move colnames(data) outside the for loop so it only runs once rather than n times. Something like col_names <- colnames(data). Then you can reference col_names[i] in the for loop. |
||
|
|
||
| ## check if the name has a qualifier at the end | ||
| if (any(endsWith(colnames(data)[i], suffix = qualifiers))) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same comment here about moving colnames(data) outside the for loop. |
||
| # identify the qualifier | ||
| current_qual <- which(endsWith(colnames(data)[i], suffix = qualifiers)) | ||
| qualifier<- qualifiers[current_qual] | ||
| len<- nchar(qualifier) | ||
| main_label<- substr(colnames(data)[i], 1, nchar(colnames(data)[i])-len) | ||
|
|
||
| # get definition for main label | ||
| main_def <- attributes$uniqueAttributeDefinition[attributes$uniqueAttributeLabel == main_label] | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. at the beginning of your function you could specify The benefit is making your code more readable - also not your fault, the column names from that csv aren't great. |
||
|
|
||
| # get definition for qualifier label, special case if it is a number | ||
| if (qualifier %in% num_qualifiers){ | ||
| qual_def <- attributes$uniqueAttributeDefinition[attributes$uniqueAttributeLabel == "_#"] | ||
| } else{ | ||
| qual_def <- attributes$uniqueAttributeDefinition[attributes$uniqueAttributeLabel == qualifier] | ||
| } | ||
|
|
||
| # concatenate the definitions | ||
| table$attributeDefinition[i] = paste(main_def, ". ", qual_def) | ||
|
|
||
| # check if it is a time variable | ||
| if (grepl("TIME", main_label)){ | ||
| table$measurementScale[i] = "dateTime" | ||
| table$domain[i] = "dateTimeDomain" | ||
| table$formatString[i] = "YYYYMMDDHHMM" | ||
| table$unit[i] = "NA" | ||
| } else { | ||
| table$measurementScale[i] = "ratio" | ||
| table$domain[i] = "numericDomain" | ||
| table$numberType[i] <- "real" | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Most style guides advocate using If you're interested in reading more this blog post is pretty good https://renkun.me/2014/01/28/difference-between-assignment-operators-in-r/ . |
||
| table$unit[i] = attributes$SI_unit[attributes$uniqueAttributeLabel == main_label] | ||
| table$missingValueCode[i] = "-9999" | ||
| table$missingValueCodeExplanation[i] = "Missing values are represented as -9999" | ||
| } | ||
|
|
||
| # case if there is no qualifier | ||
| } else { | ||
| table$attributeDefinition[i] = attributes$uniqueAttributeDefinition[attributes$uniqueAttributeLabel == colnames(data)[i]] | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove extra spaces by the |
||
| # check if it is a time variable | ||
| if (grepl("TIME", colnames(data)[i])){ | ||
| table$measurementScale[i] = "dateTime" | ||
| table$domain[i] = "dateTimeDomain" | ||
| table$formatString[i] = "YYYYMMDDHHMM" | ||
| table$unit[i] = "NA" | ||
| } else { | ||
| table$measurementScale[i] = "ratio" | ||
| table$domain[i] = "numericDomain" | ||
| table$numberType[i] <- "real" | ||
| table$unit[i] = attributes$SI_unit[attributes$uniqueAttributeLabel == colnames(data)[i]] | ||
| table$missingValueCode[i] = "-9999" | ||
| table$missingValueCodeExplanation[i] = "Missing values are represented as -9999" | ||
| } | ||
| } | ||
|
|
||
| } | ||
|
|
||
| return(table) | ||
|
|
||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,170 @@ | ||
| --- | ||
| title: "Code Chunks" | ||
| output: html_document | ||
| --- | ||
|
|
||
| ```{r setup, include=FALSE} | ||
| knitr::opts_chunk$set(echo = TRUE) | ||
| ``` | ||
|
|
||
| ## Updating Coverage | ||
|
|
||
| If you want to add to the current coverage without having to restate all the other coverage elements using `set_coverage`, these chunks allow you to just add taxonomic coverage and single date temporal coverage. | ||
|
|
||
| ### Adding taxonomic coverage | ||
|
|
||
| ```{r, eval = FALSE} | ||
| # add each new element as a tax object | ||
| tax1 <- new("taxonomicClassification", | ||
| taxonRankName = new("taxonRankName", "Species"), | ||
| taxonRankValue = new("taxonRankValue", "Calamagrostis deschampsioides")) | ||
|
|
||
| tax2 <- new("taxonomicClassification", | ||
| taxonRankName = new("taxonRankName", "Species"), | ||
| taxonRankValue = new("taxonRankValue", "Carex aquatilis")) | ||
|
|
||
| # combine all tax elements into taxonomic coverage object | ||
| taxcov <- new("taxonomicCoverage", | ||
| taxonomicClassification = c(tax1, tax2)) | ||
|
|
||
| eml@dataset@coverage@taxonomicCoverage <- c(taxcov) | ||
|
|
||
| ``` | ||
|
|
||
| ### Adding single date temporal coverage | ||
|
|
||
| ```{r, eval = FALSE} | ||
| date <- new("singleDateTime", | ||
| calendarDate = "2011-09-15") | ||
|
|
||
| tempcov1 <- new("temporalCoverage", | ||
| singleDateTime = date) | ||
|
|
||
| eml@dataset@coverage@temporalCoverage <- c(tempcov1) | ||
| ``` | ||
|
|
||
| ## Data Objects | ||
| ### Adding data tables for a whole folder of files with the same attributes | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section is so awesome! I just had to do this but forgot you wrote it up until I mostly finished. In any case, interesting to see slightly different ways of doing the same thing! |
||
|
|
||
| ```{r, eval = FALSE} | ||
| # set path to data folder | ||
| data_path <- '/home/visitor/…”' | ||
|
|
||
| # list the files. recursive = TRUE will get all files even if they are in folders in folders | ||
| paths <- list.files(data_path, full.names=TRUE, recursive = TRUE) | ||
|
|
||
| # generate pids. replace format_id with the one that applies | ||
| new_pids1 <- sapply(paths, function(path) {publish_object(mn, path, format_id = "text/plain")}) | ||
|
|
||
| attributes1 <- read.csv("/home/me/file_attributes.csv", stringsAsFactors = FALSE, na.strings = c('')) | ||
| attributeList1 <- set_attributes(attributes1) | ||
|
|
||
| # assign attributes to data tables | ||
| eml <- eml_add_entities(eml, data.frame(type = "dataTable", | ||
| path = paths, | ||
| pid = new_pids1, | ||
| format_id = "text/plain")) | ||
| eml@dataset@dataTable[[1]]@attributeList <- attributeList1 | ||
| eml@dataset@dataTable[[1]]@attributeList@id <- new("xml_attribute", "shared_attributes1") | ||
|
|
||
| #run a loop over all the data pids to add data tables with descriptions | ||
| for (i in 2:length(eml@dataset@dataTable)) { | ||
| eml@dataset@dataTable[[i]]@attributeList@references <- new("references", "shared_attributes1") | ||
| eml@dataset@dataTable[[i]]@entityDescription <- new("entityDescription", .Data = "entity description") | ||
| } | ||
|
|
||
| ``` | ||
|
|
||
| ## System Metadata | ||
| ### Obsolescence chain | ||
|
|
||
| This chunk is to obsolete one dataset. If there are more to add to the chain, more steps can be added, carefully making sure to fill in `obsoletes` and `obsoletedBy` slots for each one. | ||
|
|
||
| ```{r, eval = FALSE} | ||
| # get oldest version of the file you want to be visible. Use get_all_versions and look at the first. | ||
| # urn:uuid:... | ||
|
|
||
| # PID for data set to be hidden: doi:10… | ||
|
|
||
| # adding data set to hide in the slot before the first version of the visible data set | ||
| gsmOG <- getSystemMetadata(mn, "urn:uuid:...") | ||
| gsmOG@obsoletes <- "doi:10…" | ||
| updateSystemMetadata(mn, "urn:uuid:...", gsmOG) | ||
|
|
||
| # adding first version to obsolescence chain after hidden version | ||
| gsmObs1 <- getSystemMetadata(mn, "doi:10…") | ||
| gsmObs1@obsoletedBy <- "urn:uuid:..." | ||
| updateSystemMetadata(mn, "doi:10…", gsmObs1) | ||
|
|
||
| ``` | ||
|
|
||
| ### Set rights and access | ||
|
|
||
| This chunk sets rights and access for metadata, resource map, and all data objects in a package | ||
|
|
||
| ```{r, eval = FALSE} | ||
| ## Fix rights holder and access | ||
| PI_name <- "http://orcid.org/…." | ||
| project_pid <- "resource_map_doi:10…." | ||
| project_package <- get_package(mn, project_pid) | ||
| set_rights_and_access(mn, c(project_package$metadata, project_package$resource_map, project_package$data), PI_name, c("read", "write", "changePermission")) | ||
|
|
||
| ``` | ||
|
|
||
| ## Miscellaneous | ||
|
|
||
| ### Adding sampling info to methods section | ||
|
|
||
| ```{r, eval = FALSE} | ||
| step1 <- new('methodStep', | ||
| description = "methods text") | ||
|
|
||
| stEx <- new("studyExtent", | ||
| description = "study extent description") | ||
|
|
||
| samp <- new("sampling", | ||
| studyExtent = stEx, | ||
| samplingDescription = "sampling description text") | ||
|
|
||
| methods1 <- new("methods", | ||
| methodStep = c(step1), | ||
| sampling = samp) | ||
| eml@dataset@methods <- methods1 | ||
| ``` | ||
|
|
||
| ### Adding a pre generated identifier to the eml | ||
|
|
||
| When you pre generate a UUID or DOI, the change is not automatically reflected in the "Identifier" section of the eml so this makes sure that the eml lines up with the identifier being used. | ||
|
|
||
| ```{r, eval = FALSE} | ||
| ## Generate DOI and add to EML | ||
| doiPid <- generateIdentifier(mn, "DOI") | ||
| eml@packageId <- new("xml_attribute", | ||
| .Data = doiPid) | ||
|
|
||
| ``` | ||
|
|
||
| ### Dealing with netCDFs | ||
|
|
||
| This section is for dealing with netCDF (.nc) files. These files require data tables but since they can not be simply opened on the computer using a text editor or Excel, you can use Panoply to explore them or these R commands: | ||
|
|
||
|
|
||
| ```{r} | ||
| library(arcticdatautils) | ||
| library(ncdf4) | ||
| filepath <- '/home/sharisnochs/Petterson/ICECAPS_precip_product_2010_2015.nc' | ||
|
|
||
| # gets attribute info | ||
| atts <- get_ncdf4_attributes(filepath) | ||
| # preview of View(atts) | ||
| atts[1:5,] | ||
|
|
||
|
|
||
| # returns the actual values for a specified attribute | ||
| t <- nc_open(filepath) | ||
| test <- ncvar_get(t, 't2m') | ||
| # preview of View(test) | ||
| test[1:10] | ||
| ``` | ||
|
|
||
| The `formatId` in the sysmeta will most likely be `netCDF-4`. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you'd like you can document what this function does. Check out the multiline comments (#' is the multiline comment symbol in R) at the top of this function: https://github.com/NCEAS/datamgmt/blob/master/R/guess_member_node.R