Lv cleaning edits2 4 (!5) · Merge requests · the-graph-courses / further-data-analysis-with-r-staging

Laure Vancau requested to merge lv-cleaning-edits2-4 into master Jun 02, 2022

Global comments:

I really like the recap you make of the previous data cleaning steps you did before in the “Our Data”. How about putting them as a list and putting them inside a recap div?
All the lessons knit perfectly: very good
Your general flow for the chapter:

  Familiarize yourself with the data set (Chapter 0)

  Check for structural errors (chapter 1 column names, chapter 2 remove empty)

  Check for data irregularities (chapter 3 deduplication, chapter 4 transformations)

We need to rework lesson 4 (please see comments below)

Lesson 2:

What about partially incomplete columns like sequaela?

Pipeline proposition: I think here it could be good to explain the importance of checking for empty rows and eliminating them when calculating a percentage for example. So, for example, you check for empty rows from the start, then you do some filtering, selecting and then you want to do an operation BUT you should first check again that you do not have empty rows. (Do you see my point?)

Lesson 3:

I like your vector and dataset approach: it makes it very clear. Add-ons :

Explain the output of distinct with id_ind
Add an example with distinct and two variables

Pipeline proposition:

Investigate duplicates with get_dupes
Remove with duplicates or unique or distinct
For duplicates based on specific variables, use distinct

Lesson 4 :

For me this lesson is not structured. I do not think it makes sense to take what was introduced in the mutate chapter and do it again.

The goal of this lesson is : How do you clean INSIDE columns.

Here is what I think is relevant from what you have:

Transforming to factors those columns that should be factors (the ones you listed : sex, age_category, education, occupation, is_drug_parac, is_drug_antibio etc)
Recoding

Here are the necessary add-ons: (If the yaounde dataset does not have the material for it already then you can create it either when preparing the data or within the lesson)

Recode use replace_na
Use unique on different columns for looking at variable encoding and see if there are some errors. Maybe have some “Unknown” instead of “NA” (when truthfully they are the same) (If there is nothing to comment on then you need to introduce some): correct them with gsub and na_if (see here : for gender column for example https://towardsdatascience.com/data-cleaning-in-r-made-simple-1b77303b0b17)
Zeros instead of NA/null values
Check for invalid values (in age for example: https://towardsdatascience.com/data-cleaning-in-r-made-simple-1b77303b0b17)
Obvious inconsistencies (x_nonnegative <- x >= 0 for height, weight)
For numerical columns, check they are not inf or other special value with: The function is.finite determines which values are `regular' values.
Numeric values stored as text/character data types
Misspellings / White space in string columns using gsub

Edited Jun 02, 2022 by Laure Vancau

Lv cleaning edits2 4

Merge request reports