class: title-slide, center, middle .top-left[ <img src="images/uo-logo2.png" width="100%" /> ] .top-right[ <img src="images/psy-logo.png" width="100%" /> ] # Importing Data & Project-Oriented Workflows UO R Bootcamp 2025 --- # Importing data Importing data in R has 2 challenging aspects... -- 1. You need to call a function that works with a particular data format (e.g., `.csv`, `.txt`, `.sav`) -- 2. You need to tell R where to look for the data --- # Importing data -- .pull-left[ .center[ ### {readr} <img src="images/hex/readr.png" width="55%" /> `read_csv()`, `read_tsv()`, `read_delim()`, `read_fwf()`, etc... ] ] -- .pull-right[ .center[ ### {rio} <img src="images/hex/rio.png" width="55%" /> `import()` ] ] --- # Importing data .pull-left[ .center[ ### {readr} <img src="images/hex/readr.png" width="55%" /> `read_csv()`, `read_tsv()`, `read_delim()`, `read_fwf()`, etc... ] ] .pull-right[ .center[ ### {rio} <img src="images/hex/rio.png" width="55%" /> `import()` ✔️ ] ] --- # Importing data ### `rio::import()` With `{rio}`, we just call `import()` and under the hood it calls the right read function given the file's extension (`.csv`, `.txt`, `.sav`, `.xlsx`). -- We'll get some practice with this in a few minutes. --- # Project-oriented workflows When R looks for a file, it has a starting point. This is called the **working directory**. -- The working directory that you're currently in is displayed in the console window and the files tab. Let's take a look in RStudio. -- *** If you ever get lost, you can print your working directory with `getwd()` -- If you are working in a `.Rmd` document, R by default will set whatever folder on your computer where that `.Rmd` file lives as your working directory --- # Project-oriented workflows ``` r getwd() ``` ``` ## [1] "/Users/ishryock/Documents/GitHub/summeRbootcamp2025/static/slides" ``` For example, I created these slides in a `.Rmd` document that lives in this folder on my computer ☝️ --- class: split-three # Project-oriented workflows The best way to simplify issues with working directories is to use <u>**RStudio Projects**</u>. -- *** .column[.content[.center[ <br><br><br><br><br><br><br> ### Step 1 <img src="images/create_project1.png" width="90%" /> ]]] .column[.content[.center[ <br><br><br><br><br><br><br> ### Step 2 <img src="images/create_project2.png" width="90%" /> ]]] .column[.content[.center[ <br><br><br><br><br><br><br> ### Step 3 <img src="images/create_project3.png" width="90%" /> ]]] --- # Project-oriented workflows When you create a Project in RStudio, it is associated with a folder somewhere on your computer. -- It will automatically create a `.Rproj` file in that folder, which will keep track of the "top level" of your project. -- *** <img src="images/example_rproj.png" width="2165" /> --- class: yourturn # Your turn 1
−
+
06
:
00
Q1. Load the `{rio}` package. Q2. Run the following code to import the data called `pragmatic_scales_data.csv`. Why do you get an error? Where is this file saved? *Hint*: Look through the folder(s) in the Files pane ``` r ps_data <- import("pragmatic_scales_data.csv") ``` Q3. Fix the error in the code above to import the data. Q4. Remember that `{rio}` is flexible with file types---`rio::import()` will call the right function under the hood to read in the file based on the file extension. Use `{rio}` to import `pragmatic_scales_data.sav` (an SPSS file type) and save it to a new object named `ps_data_2`. --- class: solution # Solution .panelset[ .panel[.panel-name[Q1] ``` r library(rio) ``` ] .panel[.panel-name[Q2] ``` r ps_data <- import("06_pragmatic_scales_data.csv") ``` ``` ## Error: No such file: 06_pragmatic_scales_data.csv ``` *** The file *pragmatic_scales_data.csv* is saved in the *data* folder, so we need to tell R to look in that folder. ] .panel[.panel-name[Q3] ``` r ps_data <- import("data/06_pragmatic_scales_data.csv") ``` ] .panel[.panel-name[Q4] ``` r ps_data_2 <- import("data/06_pragmatic_scales_data.sav") ``` ] ] --- # Exporting data You can also use `{rio}` to export your data using `export()`. -- *** Here are the arguments you will need to use for `export()`: ``` r export(x, file) ``` -- `x` is the `data.frame` object in your RStudio Environment you want to export -- `file` is the path/filename for the resulting file -- *** For example, let's say I want to export `ps_data` as an `.xlsx` file and put it into the `data/` subdirectory. ``` r export(ps_data, "data/ps_data.xlsx") ``` --- class: yourturn # Your turn 2
−
+
05
:
00
1. Look through the Files pane and find the file `another_data_set.csv`. Make note of what subdirectory it is saved in. Import the data and save to an object called `another_df`. 1. Now export the data you just imported and save it into the `data/` directory. Make sure the name of the resulting file is `another_data_set`, and export it as a `.xlsx` file. 1. One of your colleagues insists you send them a `.sav` file so that they can run the analyses in SPSS. Make another copy of `another_data_set` in the `data/` subdirectory that is in the `.sav` format. --- class: solution # Solution .panelset[ .panel[.panel-name[Q1] ``` r another_df <- import("data/more_data/another_data_set.csv") ``` ] .panel[.panel-name[Q2] ``` r export(another_df, "data/another_data_set.xlsx") ``` ] .panel[.panel-name[Q3] ``` r export(another_df, "data/another_data_set.sav") ``` ] ] --- # Viewing data Now that your data is loaded in R, you'll want to take a look at it. There are a few different ways to do that, each offering different information. -- *** ### `View()` One way is to click on the `View` button in the environment pane. -- You should see `ps_data` in the environment pane with a little data table icon at the far right. Click on that icon. -- You'll notice that this ran `View(ps_data)` in the console. We could have instead just typed this directly ourselves`*`---notice the capital `V` in `View()` --- # Viewing data .panelset[ .panel[.panel-name[`head()`] You can also see just the first six rows of a data frame with `head()`, which is especially helpful for large data sets. ``` r head(ps_data) ``` ``` ## subid item correct age condition ## 1 M22 faces 1 2.00 Label ## 2 M22 houses 1 2.00 Label ## 3 M22 pasta 0 2.00 Label ## 4 M22 beds 0 2.00 Label ## 5 T22 beds 0 2.13 Label ## 6 T22 faces 0 2.13 Label ``` ] .panel[.panel-name[`tail()`] `tail()` is the complement to `head()`, displaying just the final six rows from a data frame. ``` r tail(ps_data) ``` ``` ## subid item correct age condition ## 583 MSCH84 pasta 1 2.83 No Label ## 584 MSCH84 beds 0 2.83 No Label ## 585 MSCH85 faces 0 2.69 No Label ## 586 MSCH85 houses 0 2.69 No Label ## 587 MSCH85 pasta 0 2.69 No Label ## 588 MSCH85 beds 0 2.69 No Label ``` ] .panel[.panel-name[`str()`] We saw `str()` when we first introduced data frames. It's worth mentioning it again because it can be so useful when you import data to see how your variables were read in (i.e. their types) ``` r str(ps_data) ``` ``` ## 'data.frame': 588 obs. of 5 variables: ## $ subid : chr "M22" "M22" "M22" "M22" ... ## $ item : chr "faces" "houses" "pasta" "beds" ... ## $ correct : int 1 1 0 0 0 0 1 1 0 0 ... ## $ age : num 2 2 2 2 2.13 2.13 2.13 2.13 2.32 2.32 ... ## $ condition: chr "Label" "Label" "Label" "Label" ... ``` ] .panel[.panel-name[`glimpse()`] `glimpse()` is very similar to `str()` but is a tidyverse function, and it shows you a little more raw data ``` r glimpse(ps_data) ``` ``` ## Rows: 588 ## Columns: 5 ## $ subid <chr> "M22", "M22", "M22", "M22", "T22", "T22", "T22", "T22", "T17… ## $ item <chr> "faces", "houses", "pasta", "beds", "beds", "faces", "houses… ## $ correct <int> 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, … ## $ age <dbl> 2.00, 2.00, 2.00, 2.00, 2.13, 2.13, 2.13, 2.13, 2.32, 2.32, … ## $ condition <chr> "Label", "Label", "Label", "Label", "Label", "Label", "Label… ``` ] ] --- # Viewing data A `tibble` is much like the data frame in base R, but **it has nicer printing methods**. -- As a result, you only have to call a `tibble` to see much of the information you would be interested in. --- # Viewing data ``` r ps_data <- tibble(ps_data) ps_data ``` ``` ## # A tibble: 588 × 5 ## subid item correct age condition ## <chr> <chr> <int> <dbl> <chr> ## 1 M22 faces 1 2 Label ## 2 M22 houses 1 2 Label ## 3 M22 pasta 0 2 Label ## 4 M22 beds 0 2 Label ## 5 T22 beds 0 2.13 Label ## 6 T22 faces 0 2.13 Label ## 7 T22 houses 1 2.13 Label ## 8 T22 pasta 1 2.13 Label ## 9 T17 pasta 0 2.32 Label ## 10 T17 faces 0 2.32 Label ## # ℹ 578 more rows ``` --- class: yourturn # Your turn 3
−
+
04
:
00
1. Take a look at `another_df`, which should be in your Global Environment. Click the "View" button in the Environment pane and also use `View()` in your Console. 2. Now look at some summary information about `another_df` using `str()` and `glimpse()`. *Hint*. You will need to load the tidyverse package first in order to use `glimpse()`. 3. Lastly, find the number of rows and columns in `another_df` using `nrow()` and `ncol()`, respectively. Make sure your answers match the summary information given to you above. --- class: solution # Solution .panelset[ .panel[.panel-name[Q1] ``` r View(another_df) ``` ] .panel[.panel-name[Q2] ``` r library(tidyverse) str(another_df) ``` ``` ## 'data.frame': 32 obs. of 4 variables: ## $ subid : chr "A001" "A001" "A001" "A001" ... ## $ stimuli: chr "A" "B" "C" "D" ... ## $ correct: int 0 0 1 0 1 1 1 1 0 0 ... ## $ age : num 2.5 2.5 2.5 2.5 2.75 2.75 2.75 2.75 3.6 3.6 ... ``` ``` r glimpse(another_df) ``` ``` ## Rows: 32 ## Columns: 4 ## $ subid <chr> "A001", "A001", "A001", "A001", "B002", "B002", "B002", "B002"… ## $ stimuli <chr> "A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D", "A… ## $ correct <int> 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1,… ## $ age <dbl> 2.50, 2.50, 2.50, 2.50, 2.75, 2.75, 2.75, 2.75, 3.60, 3.60, 3.… ``` ] .panel[.panel-name[Q3] ``` r nrow(another_df) ``` ``` ## [1] 32 ``` ``` r ncol(another_df) ``` ``` ## [1] 4 ``` ] ] --- ## The Art of File Naming Replace `paper_final_FINAL_REALLY.docx` with a consistent pattern: ``` YYYY-MM-DD_ProjectName_DocumentType.ext ``` **Examples** ``` 2025-01-15_DissertationProposal.docx 2025-02-03_SurveyData_Raw.csv 2025-03-10_LitReview_Submitted.pdf ``` --- ## Why this helps - **Chronological ordering**: files sort by date automatically - **Clarity**: purpose is obvious from the name - **Searchability**: easier to filter/find by date or keywords - **Version control**: clean history even without Git/cloud - Pro tip: Implement a method for backing up your data **before** it's lost --- ## Collaboration & versions - Add initials for the last editor when useful: `2025-04-18_InterviewAnalysis_JS.docx` - Prefer **cloud version history** (Drive/OneDrive) instead of making duplicates --- ## Structuring Research Projects Use a numbered, reproducible layout. Everything lives under the root **ProjectName**. ```text ProjectName/ ├─ 01_Data/ │ ├─ Raw/ # original, unmodified data │ ├─ Processed/ # cleaned and transformed data │ └─ Metadata/ # codebooks, data dictionaries ├─ 02_Scripts/ │ ├─ Data_Cleaning/ # code to process raw data │ ├─ Analysis/ # statistical analysis scripts │ └─ Visualization/ # code for figures and tables ├─ 03_Output/ │ ├─ Tables/ # generated results tables │ ├─ Figures/ # plots, diagrams, visualizations │ └─ Models/ # saved statistical models ├─ 04_Documents/ │ ├─ Drafts/ # working papers, draft manuscripts │ ├─ Final/ # submitted versions │ └─ Feedback/ # reviewer comments, notes └─ README.txt # project overview, navigation guide ``` > Within each folder, apply the same **file naming** convention. --- ## Why this structure works - **Reproducibility**: trace from raw data → outputs - **Clarity**: inputs, code, and results are separated - **Collaboration**: teammates know where things live - **Publication**: easier to prep data/code for journals --- class: yourturn, center, middle # Q & A
−
+
05
:
00
--- class: yourturn, center, middle # Break!
−
+
10
:
00