--- title: "Introduction to caRecall" authors: Nathan Smith, Mitch Harris, Ryan Koenig output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{vrd_vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction The `caRecall` package is an API wrapper for the Government of Canada [Vehicle Recalls Database (VRD)](https://tc.api.canada.ca/en/detail?api=VRDB) used by the Defect Investigations and Recalls Division for vehicles, tires, and child car seats. The API wrapper provides access to recall summary information searched using make, model, and year range, as well as detailed recall information searched using recall number. This vignette provides an overview of the functions available in `caRecall` and provides a demonstration of using the package to extract information from the VRD. In addition to `caRecall`, this vignette makes use of the [dplyr](https://dplyr.tidyverse.org/) and [ggplot2](https://ggplot2.tidyverse.org/) packages. ```r library(caRecall) library(dplyr) library(ggplot2) ``` ## API Access Key An API key is required to run the functions in `caRecall` and query the Vehicle Recalls Database. The key can be acquired at https://tc.api.canada.ca/en/detail?api=VRDB. The API key can be set in the environment using `Sys.setenv(VRD_API = 'your_API_key_here')`. Alternatively, instead of setting the API key in the environment, the key can be passed into the `caRecall` functions as an argument. ## `caRecall` Functions The `caRecall` package focuses on querying data from the Vehicle Recalls Database and functions fall into the following categories: - Recall summary information - Recall counts - Detailed recall information ### Recall Summary Information The `recall_by_make()`, `recall_by_model()`, `recall_by_years()`, and `recall_by_number()` functions all retrieve recall summary information from the VRD based on user search terms. An overview of the search terms and function arguments includes: - **make or model**: List of make, manufacturer, or model depending on the function used to search on. Note that the search terms are not case sensitive. - **recall_number**: List of recall numbers to search on (however, this is typically unknown unless previous searches have been conducted). - **start_year and end_year**: The year range of the search can be specified and is based on the manufactured year and not the year a recall occurred. If the argument(s) are left empty, the search will include the start and end of the data available, respectively. Note that a vehicle year is typically a year later than the manufactured year (for example, a 2009 Ford would be entered in the database as being manufactured in 2008). This also means that a recall date may occur prior to the year of a vehicle. - **limit**: The VRD API default number of items to return for a query is 25. There does not a appear to be a maximum limit associated with the API but all functions in `caRecall` also default to a limit of 25. - **partial**: The `caRecall` functions allow the user to specify if partial matches on the search terms are used or not (for example, "Su" would return "Suzuki", and "Subaru" matches). Note that if `partial = FALSE` is used, the number of entries returned by the functions may be less than the count provided by the `count` functions described in the following section which return counts for all partial matches (and is the default behavior of the VRD API). - **api_key**: If an API key is not set in the environment, an access key can be passed directly into a function. An example of using `recall_by_model()` to search for the first five recalls by 'Ford' in 1998 would be as follows: ```r recall_by_make("Ford", start_year = 1998, limit = 5) #> # A tibble: 5 x 6 #> `Recall number` `Manufacturer Name` `Model name` `Make name` Year `Recall date` #> #> 1 1997118 FORD WINDSTAR FORD 1998 1997-06-16 #> 2 1997119 FORD B SERIES FORD 1998 1997-06-17 #> 3 1997124 FORD WINDSTAR FORD 1998 1997-06-23 #> 4 1997165 FORD MUSTANG FORD 1998 1997-09-30 #> 5 1997165 FORD WINDSTAR FORD 1998 1997-09-30 ``` Or if a user wanted to search by manufactured year and return the first 10 entries between 1999 and 2001, `recall_by_years()` could be used: ```r recall_by_years(start_year = 1999, end_year = 2001, limit = 10) #> # A tibble: 10 x 6 #> `Recall number` `Manufacturer Name` `Model name` `Make name` Year `Recall date` #> #> 1 1993076 MERCEDES-BENZ 300 MERCEDES-BENZ 2000 1993-05-31 #> 2 1998048 FORD F-350 SUPER DUTY FORD 1999 1998-03-25 #> 3 1998049 FORD F-350 SUPER DUTY FORD 1999 1998-03-26 #> 4 1998067 MAZDA MIATA MAZDA 1999 1998-04-08 #> 5 1998069 FORD SUPER DUTY FORD 1999 1998-04-08 #> 6 1998104 FORD F-350 SUPER DUTY FORD 1999 1998-05-25 #> 7 1998105 FORD F-250 FORD 1999 1998-05-25 #> 8 1998105 FORD F-350 SUPER DUTY FORD 1999 1998-05-25 #> 9 1998105 FORD F-450 SUPER DUTY FORD 1999 1998-05-25 #> 10 1998105 FORD F-550 SUPER DUTY FORD 1999 1998-05-25 ``` ### Recall Counts The `count_recall_by_make()`, `count_recall_by_model()`, and `count_recall_by_years()` all return the number of recalls available in the VRD based on the search terms. It can be used to quickly get a count of recalls prior to making a query for summary information. For example, if a user is interesting in first determining the total number of recalls by the manufacturer 'GOODBABY' for all products manufactured up to 2008, the `count_recall_by_make()` could be used. ```r count_recall_by_make("GOODBABY", manufacturer = TRUE, end_year = 1998) #> # A tibble: 1 x 1 #> `Result Count` #> #> 1 47 ``` ### Detailed Recall Information The `recall_details()` function provides detailed recall information for a specified recall number. The information includes details of what the recall is for (for example, 'Seats and Restraints') as well as a description on the reason for the recall. This function can only use a recall number search term and therefore likely requires prior searching using the summary recall functions to determine recall numbers of interest. Note that each recall number provided to the function results in a separate API call. Accordingly, the rate of API calls is limited to one call per second to stay within the API limit. For example, providing a list of 60 recall numbers to the function would take approximately 1 minute to complete. For example, if a Ford Windstar with recall number of 1997118 was of interest: ```r recall_windstar <- recall_details(1997118) tibble(t(recall_windstar)) # transpose for readability #> # A tibble: 15 x 1 #> `t(recall_windstar)`[,1] #> #> 1 "1997118" #> 2 "97S69" #> 3 "Light Truck & Van" #> 4 "Camionnette et fourgonnette" #> 5 "WINDSTAR" #> 6 "FORD" #> 7 "27" #> 8 "Seats And Restraints" #> 9 "Sièges et dispositifs de retenue" #> 10 "Safety Mfr" #> 11 "Sécurité - fabricant" #> 12 "NOTE: VEHICLES EQUIPPED WITH SECOND ROW BENCH SEATS AND INTEGRATED CHILD SEAT (ICS). ALL VEHICLES IN THIS RECALL HAVE BEEN CORRECT~ #> 13 "NOTA : Vise les véhicules équipés d’une seconde banquette et du siège intégré pour enfant. Tous les véhicules visés par le présent~ #> 14 "1998" #> 15 "1997-06-16" ``` ## Example In this section, several examples will be shown using `caRecall` to extract and visualize information of interest. ### Setup API key After getting an API access key from https://tc.api.canada.ca/en/detail?api=VRDB, the key needs to either be set in the environment (as `VRD_API`) or passed as an argument into each `caRecall` function. To check if the VRD API key is in the environment, the following `caRecall` function can be used: ```r get_vrd_key() ``` If the key is in the environment as `VRD_API` it will return the key value. If not, the function will return an error that the key is not set. The key can be set in the environment using `Sys.setenv(VRD_API = "YOUR KEY HERE")`. Once it is set, you should be able to leave the `api_key` argument empty in a `caRecall` function call. ### Getting Data from the VRD Lets say we want to look in the VRD for the companies that had the most recalls with the items they manufactured in the year 2000. Fist we could see how many items manufactured in 2000 were recalled in the database. ```r count_recall_by_years(start_year = 2000, end_year = 2000) #> # A tibble: 1 x 1 #> `Result Count` #> #> 1 2422 ``` We could then query the database for the summary information associated with these recalls. ```r recall_summary_2000 <- recall_by_years(start_year = 2000, end_year = 2000, limit = 3000) recall_summary_2000 #> # A tibble: 2,422 x 6 #> `Recall number` `Manufacturer Name` `Model name` `Make name` Year `Recall date` #> #> 1 1993076 MERCEDES-BENZ 300 MERCEDES-BENZ 2000 1993-05-31 #> 2 1999056 FIAT CHRYSLER AUTOMOBILES NEON CHRYSLER 2000 1999-04-14 #> 3 1999108 FIAT CHRYSLER AUTOMOBILES NEON CHRYSLER 2000 1999-06-07 #> 4 1999111 FLEETWOOD TIOGA FLEETWOOD 2000 1999-06-08 #> 5 1999137 POLARIS SNOWMOBILE POLARIS 2000 1999-07-26 #> 6 1999138 MAZDA MPV MAZDA 2000 1999-07-27 #> 7 1999147 MAZDA MPV MAZDA 2000 1999-07-30 #> 8 1999151 GENERAL MOTORS S10 CHEVROLET 2000 1999-08-16 #> 9 1999151 GENERAL MOTORS SONOMA GMC 2000 1999-08-16 #> 10 1999155 GENERAL MOTORS SUNFIRE PONTIAC 2000 1999-08-19 #> # ... with 2,412 more rows ``` Note that we also could have also just set the `limit` argument to a high enough value that all entries in the database would likely be included in the summary information query; however, this could also result in returning a very large dataset if the number of available entries is very high. We can count the number of recalls for each company by grouping on the `Manufacturer name` column and summarizing the count. We'll only look at manufacturers with recall counts above 50. ```r counts_manu_2000 <- recall_summary_2000 %>% group_by(`Manufacturer Name`) %>% tally() %>% filter(n >= 50) counts_manu_2000 #> # A tibble: 11 x 2 #> `Manufacturer Name` n #> #> 1 COACHMEN 70 #> 2 DAIMLERCHRYSLER 66 #> 3 FLEETWOOD 53 #> 4 FORD 151 #> 5 FREIGHTLINER 189 #> 6 GENERAL MOTORS 239 #> 7 GULF STREAM COACH 56 #> 8 INTERNATIONAL 170 #> 9 MONACO COACH 64 #> 10 PACCAR 50 #> 11 WINNEBAGO INDUSTRIES 61 ``` There are 11 manufacturers with greater than 50 recalls for products manufactured in 2000. ```r nb.cols <- 11 mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(nb.cols) counts_manu_2000 %>% ggplot() + aes(x=`Manufacturer Name`, y=n, fill=`Manufacturer Name`) + scale_fill_manual(values = mycolors) + geom_col() + geom_text(aes(label=n), vjust=-.5, size=3) + labs(y="Recall Count") + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.line = element_line(color = "black"), legend.title = element_text(size = 12), legend.text = element_text(size = 8), axis.title.x=element_blank()) + scale_y_continuous(expand = c(0,0), limits = c(0,250)) + ggtitle("Manufacturers with Most Recalls in 2000") ``` plot of chunk plot1 We could also look at how many recalls two manufacturers have had compared with each other over the over the last 10 years (2011 to 2020). We would start by querying the API for entries with the two manufacturers. This time, we'll just set the limit to be reasonably high instead of checking the count available in the database first. ```r manu_comparison_list <- c("Ford", "Toyota") recall_comparison <- recall_by_make(manu_comparison_list, start_year = 2011, end_year = 2020, limit = 10000) recall_comparison #> # A tibble: 1,266 x 6 #> `Recall number` `Manufacturer Name` `Model name` `Make name` Year `Recall date` #> #> 1 2009290 TOYOTA CAMRY TOYOTA 2011 2009-10-07 #> 2 2010366 FORD EDGE FORD 2011 2010-10-25 #> 3 2010439 TOYOTA SIENNA TOYOTA 2011 2010-12-13 #> 4 2010463 FORD EDGE FORD 2011 2010-12-23 #> 5 2010463 FORD F-150 FORD 2011 2010-12-23 #> 6 2010463 FORD F-250 SUPER DUTY FORD 2011 2010-12-23 #> 7 2010463 FORD F-350 SUPER DUTY FORD 2011 2010-12-23 #> 8 2010463 FORD F-450 SUPER DUTY FORD 2011 2010-12-23 #> 9 2010463 FORD F-550 SUPER DUTY FORD 2011 2010-12-23 #> 10 2011046 FORD EXPLORER FORD 2011 2011-01-31 #> # ... with 1,256 more rows ``` We can then plot the total amount of recalls for each year grouped by manufacturer. ```r recall_comparison %>% ggplot() + aes(x=`Year`, color = `Make name`) + geom_point(stat='count', size=4) + scale_x_continuous(breaks = seq(2010,2021,1)) + labs(title = 'Recall Count per Year', y = 'Annual Count', color = 'Manufacturer') + theme(axis.title.x=element_blank()) ``` plot of chunk plot2 The plot shows a much lower number of recalls in 2020 for both manufactures; however, this is likely because there has not been sufficient time for recalls to have occurred on more recent products.