Functions to help organize unwieldy data

1987 USDA Ag Census –trying to make it useable by creating search functions

April Merleaux
2022-04-02

The underlying data files are read-in and tidied in another script.
usable output of that script includes

tidy_MJ_Counties[includes 1987 crime data, USDA county typology, USDA rural county econ codes; census regions; eliminates counties with 0 population and 0 MJ manufacture arrests; selects only MJ crimes; filters only counties with code >3 (eg no major metro areas)]

MJ_Counties_joinedUSDA [includes everything in tidy_MJ_Counties + all 1987 Ag Census data, minus the columns called flag]

USDA35206_labels [codebook for Ag Census data]

There are a couple of notable features of the data as I’ve tidied it. MJ_Counties_joinedUSDA has 1186 observations of 2637 variables, so many that it’s not super useable. The variables are identified with codes, and I have a lookup table USDA35206_labels. I first tried manually breaking the variables into useful, thematic subsets.

#sample keyword searches and colnameslookup
farmvalue <- MJ_Counties_joinedUSDA %>%
  select(FIPS5, ITEM01019:ITEM01029) 
head(farmvalue)
# A tibble: 6 x 12
  FIPS5 ITEM01019 ITEM01020 ITEM01021 ITEM01022 ITEM01023 ITEM01024
  <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 01005     19702     39562     10891      8811       138        57
2 01017      4389     12024       529      3860       157        60
3 01019     26593     54606     16356     10237       164        84
4 01021      9454     13429      5528      3927       250       171
5 01023      4146     13865       349      3797       126        77
6 01031     73123     86844     15076     58047       185        98
# ... with 5 more variables: ITEM01025 <dbl>, ITEM01026 <dbl>,
#   ITEM01027 <dbl>, ITEM01028 <dbl>, ITEM01029 <dbl>

I wanted to rename those variables from the lookup table. I wrote a function that would look those up. I will admit that I don’t really understand why it works, but it does.

#create function to lookup variable names, to be used on the results of a search
USDA_names <- function(x){
  names(x)[names(x) %in% USDA35206_labels$code] = USDA35206_labels$name[match(names(x)[names(x) %in% USDA35206_labels$code], USDA35206_labels$code)]
}

farmvalue <- MJ_Counties_joinedUSDA %>%
  select(FIPS5, ITEM01019:ITEM01029) 
colnames(farmvalue) <- USDA_names(farmvalue)
head(farmvalue)
# A tibble: 6 x 12
  FIPS5 `Market value of ag~ `Market value of ag ~ `Crops,incl nurser~
  <chr>                <dbl>                 <dbl>               <dbl>
1 01005                19702                 39562               10891
2 01017                 4389                 12024                 529
3 01019                26593                 54606               16356
4 01021                 9454                 13429                5528
5 01023                 4146                 13865                 349
6 01031                73123                 86844               15076
# ... with 8 more variables:
#   Livestock, poultry&their prod ($1,000), 1987 <dbl>,
#   Farms by value of sales:  Less than $2,500, 1987 <dbl>,
#   Farms by value of sales:  $2,500 to $4,999, 1987 <dbl>,
#   Farms by value of sales:  $5,000 to $9,999, 1987 <dbl>,
#   Farms by value of sales:  $10,000 to $24,999, 1987 <dbl>,
#   Farms by value of sales:  $25,000 to $49,999, 1987 <dbl>,
#   Farms by value of sales:  $50,000 to $99,999, 1987 <dbl>,
#   Farms by value of sales:  $100,000 or more, 1987 <dbl>

That is better. I wanted to write a function that would enable me to do a search for a keyword in the lookup table and get a subset of variables that matched those keywords. The first step was to write out the steps. The first example here is the result, with the keyword “operator.” The result is a table called operator_chars that has all the variables that contain the word operator (USDA-speak for farmer).

#filter variables for those containing keywords, then select from MJ_Counties_joinedUSDA and rename 
operator_filter <- USDA35206_labels %>%
  filter(grepl("operator", name, ignore.case = TRUE))
operator_chars <- MJ_Counties_joinedUSDA %>%
  select(FIPS5, all_of(operator_filter$code))
colnames(operator_chars) <- USDA_names(operator_chars)

Since that worked, I created a function to do that.

#create function to systematize breaking up the data 
#input must be in quotes
searchUSDA <- function(x){
  y <- USDA35206_labels %>%
    filter(grepl(x, name, ignore.case = TRUE)) %>%
    add_row(MJcolnames)
  result <- MJ_Counties_joinedUSDA[, y$code, with = TRUE]
  colnames(result) <- USDA_names(result)
  view(result)
}

#to call the function
keyword <- searchUSDA("keyword")

That works really well for keyword searching. There are a couple of problems. The biggest problem is that I have not figured out how to name the output. I would like the user input (x) to become the name of the output. As is, the name is result until/unless I rename it in calling the function. I have tried many things to make the user input be used as the name of an object, but R never likes those attempts.

The reason it matters is because I want to iterate searchUSDA over a vector of search terms. Ideally the result would be a series of tables, each named after the search term used to create it.

It is easy enough to iterate the function over a string of search terms.

#search strings
USDA_search_strings <- c("operator", "value", "crops", "gov|CCC", "wheat", "soy", "corn", "land", "sales", "SIC", "exp")

lapply(USDA_search_strings, searchUSDA)

The output of this is unsatisfying. The object name “result” gets replaced with each iteration, so the only one that actually remains at the end of process is the last item in the list (in this case, the search for “exp”). It’s clear in the console that it has in fact iterated through each search term, but it discards the results with the next iteration because it writes over the object “result.” I have been trying to figure out how to get it to rename the output with each iteration, ideally by naming it whatever the search term was. I haven’t figured out how to give each iteration a separate name, though.

I have tried assigning the result to x (the user input), which doesn’t work. I have tried {{x}}. I have tried various other things.

Or perhaps there is a way to create a temporary name that gets replaced.

#throws error: Error in colnames("x") <- USDA_names("x") :  target of assignment expands to non-language object
searchUSDA0 <- function(x){
  y <- USDA35206_labels %>%
    filter(grepl(x, name, ignore.case = TRUE)) %>%
    add_row(MJcolnames)
  "x" <- MJ_Counties_joinedUSDA[, y$code, with = TRUE]
 colnames("x") <- USDA_names("x") 
}


#throws error: "Error in { : could not find function "{<-""
searchUSDA1 <- function(x){
  y <- USDA35206_labels %>%
    filter(grepl(x, name, ignore.case = TRUE)) %>%
    add_row(MJcolnames)
 {{x}} <- MJ_Counties_joinedUSDA[, y$code, with = TRUE]
 colnames({{x}}) <- USDA_names({{x}})
}  


#doesn't throw an error, but only returns the variable names (eg it doesn't actually do the second step)
searchUSDA2 <- function(x){
  y <- USDA35206_labels %>%
    filter(grepl(x, name, ignore.case = TRUE)) %>%
    add_row(MJcolnames)
  x <- MJ_Counties_joinedUSDA[, y$code, with = TRUE]
 colnames(x) <- USDA_names(x)
}  

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Merleaux (2022, April 3). Data Analytics and Computational Social Science: Functions to help organize unwieldy data. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httprpubscomamerleaux885445/

BibTeX citation

@misc{merleaux2022functions,
  author = {Merleaux, April},
  title = {Data Analytics and Computational Social Science: Functions to help organize unwieldy data},
  url = {https://github.com/DACSS/dacss_course_website/posts/httprpubscomamerleaux885445/},
  year = {2022}
}