This function is designed to make it easy to perform some data integrity checks, specifically checking for values that are impossible or unrealistic. These values can then be replaced by another value, or the offending cases can be deleted from the dataframe.

checkDataIntegrity(
  x,
  dat,
  newValue = NA,
  removeCases = FALSE,
  validValueSuffix = "_validValue",
  newValueSuffix = "_newValue",
  totalVarName = "numberOfInvalidValues",
  append = TRUE,
  replace = TRUE,
  silent = FALSE,
  rmarkdownOutput = FALSE,
  callingSelf = FALSE
)

Arguments

x

This can be either a vector or a list. If it is a vector, it should have two elements, the first one being a regular expression matching one or more variables in the dataframe specified in dat, and second one being the condition the matching variables have to satisfy. If it is a list, it should be a list of such vectors. The conditions should start with a Comparison operator followed by a value (e.g. "<30" or ">=0).

dat

The dataframe containing the variables of which we should check the integrity.

newValue

The new value to be assigned to cases not satisfying the specified conditions.

removeCases

Whether to delete cases that do not satisfy the criterion from the dataframe (if FALSE, they're not deleted, but the offending value is replaced by newValue).

validValueSuffix

Suffix to append to variable names when creating variable names for new variables that contain TRUE and FALSE to specify for each original variable whether its value satisfied the specified criterion.

newValueSuffix

If replace is FALSE, original values are not replaced, but instead new variables are created where the offending values have been replaced. This suffix is appended to each original variable name to create the new variable name.

totalVarName

This is the name of a variable that contains, for each case, the total number of invalid values among all variables checked.

append

Whether to append the columns to the dataframe, or only return the new columns.

replace

Whether to replace the offending values with the value specified in newValue or whether to create new columns (see newValueSuffix).

silent

Whether to display the log, or only set it as attribute of the returned dataframe.

rmarkdownOutput

Whether to format the log so that it's ready to be included in RMarkdown reports.

callingSelf

For internal use; whether the function calls itself.

Value

The dataframe with the corrections, and the log stored in attribute checkDataIntegrity_log.

Author

Gjalt-Jorn Peters

Maintainer: Gjalt-Jorn Peters gjalt-jorn@userfriendlyscience.com

Examples


### Default behavior: return dataframe with
### offending values replaced by NA

checkDataIntegrity(c('mpg', '<30'),
                   mtcars);
#> Matching cases to criterion '<30' for all variables matching regular expression 'mpg'.
#> In 4 rows, for variable 'mpg', replacing values that do not satisfy criterion 'mpg<30' with 'NA'.
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Fiat 128              NA   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic           NA   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla        NA   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa          NA   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#>                     mpg_validValue numberOfInvalidValues
#> Mazda RX4                     TRUE                     0
#> Mazda RX4 Wag                 TRUE                     0
#> Datsun 710                    TRUE                     0
#> Hornet 4 Drive                TRUE                     0
#> Hornet Sportabout             TRUE                     0
#> Valiant                       TRUE                     0
#> Duster 360                    TRUE                     0
#> Merc 240D                     TRUE                     0
#> Merc 230                      TRUE                     0
#> Merc 280                      TRUE                     0
#> Merc 280C                     TRUE                     0
#> Merc 450SE                    TRUE                     0
#> Merc 450SL                    TRUE                     0
#> Merc 450SLC                   TRUE                     0
#> Cadillac Fleetwood            TRUE                     0
#> Lincoln Continental           TRUE                     0
#> Chrysler Imperial             TRUE                     0
#> Fiat 128                     FALSE                     1
#> Honda Civic                  FALSE                     1
#> Toyota Corolla               FALSE                     1
#> Toyota Corona                 TRUE                     0
#> Dodge Challenger              TRUE                     0
#> AMC Javelin                   TRUE                     0
#> Camaro Z28                    TRUE                     0
#> Pontiac Firebird              TRUE                     0
#> Fiat X1-9                     TRUE                     0
#> Porsche 914-2                 TRUE                     0
#> Lotus Europa                 FALSE                     1
#> Ford Pantera L                TRUE                     0
#> Ferrari Dino                  TRUE                     0
#> Maserati Bora                 TRUE                     0
#> Volvo 142E                    TRUE                     0

### Check two conditions, and instead of returning the
### dataframe with the results appended, only return the
### columns indicating which cases 'pass', what the new
### values would be, and how many invalid values were
### found for each case (to easily remove cases that
### provided many invalid values)

checkDataIntegrity(list(c('mpg', '<30'),
                        c('gear', '<5')),
                   mtcars,
                   append=FALSE);
#> Matching cases to criterion '<30' for all variables matching regular expression 'mpg'.
#> In 4 rows, for variable 'mpg', replacing values that do not satisfy criterion 'mpg<30' with 'NA'.
#> Matching cases to criterion '<5' for all variables matching regular expression 'gear'.
#> In 5 rows, for variable 'gear', replacing values that do not satisfy criterion 'gear<5' with 'NA'.
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Fiat 128              NA   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic           NA   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla        NA   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1   NA    2
#> Lotus Europa          NA   4  95.1 113 3.77 1.513 16.90  1  1   NA    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1   NA    4
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1   NA    6
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1   NA    8
#> Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#>                     mpg_validValue numberOfInvalidValues gear_validValue
#> Mazda RX4                     TRUE                     0            TRUE
#> Mazda RX4 Wag                 TRUE                     0            TRUE
#> Datsun 710                    TRUE                     0            TRUE
#> Hornet 4 Drive                TRUE                     0            TRUE
#> Hornet Sportabout             TRUE                     0            TRUE
#> Valiant                       TRUE                     0            TRUE
#> Duster 360                    TRUE                     0            TRUE
#> Merc 240D                     TRUE                     0            TRUE
#> Merc 230                      TRUE                     0            TRUE
#> Merc 280                      TRUE                     0            TRUE
#> Merc 280C                     TRUE                     0            TRUE
#> Merc 450SE                    TRUE                     0            TRUE
#> Merc 450SL                    TRUE                     0            TRUE
#> Merc 450SLC                   TRUE                     0            TRUE
#> Cadillac Fleetwood            TRUE                     0            TRUE
#> Lincoln Continental           TRUE                     0            TRUE
#> Chrysler Imperial             TRUE                     0            TRUE
#> Fiat 128                     FALSE                     1            TRUE
#> Honda Civic                  FALSE                     1            TRUE
#> Toyota Corolla               FALSE                     1            TRUE
#> Toyota Corona                 TRUE                     0            TRUE
#> Dodge Challenger              TRUE                     0            TRUE
#> AMC Javelin                   TRUE                     0            TRUE
#> Camaro Z28                    TRUE                     0            TRUE
#> Pontiac Firebird              TRUE                     0            TRUE
#> Fiat X1-9                     TRUE                     0            TRUE
#> Porsche 914-2                 TRUE                     1           FALSE
#> Lotus Europa                 FALSE                     2           FALSE
#> Ford Pantera L                TRUE                     1           FALSE
#> Ferrari Dino                  TRUE                     1           FALSE
#> Maserati Bora                 TRUE                     1           FALSE
#> Volvo 142E                    TRUE                     0            TRUE