Data Analytics and Visualization using R

In this document we used R to do the automate data analytics using R programming language, an open-source language used for statistical computing and graphics

Hypothesis

H0: There is no significant difference between the Information officers and Industrial workers
H1: There is significant difference between the Information officers and Industrial workers
H0: The salary/wage has no significant influence on education, race, and job-class
H1: The salary/wage has significant influence on education, race, and job-class

Below are the variables we are going to used for this Data Analysis

Table2 Demographic Characteristics

Characteristic	N = 3,000¹
education
1. < HS Grad	268 (8.9%)
2. HS Grad	971 (32%)
3. Some College	650 (22%)
4. College Grad	685 (23%)
5. Advanced Degree	426 (14%)
age	42 (34, 51)
health
1. <=Good	858 (29%)
2. >=Very Good	2,142 (71%)
¹ n (%); Median (IQR)

Table2 Distribution by Marital Status

Characteristic	N = 3,000¹
maritl
1. Never Married	648 (22%)
2. Married	2,074 (69%)
3. Widowed	19 (0.6%)
4. Divorced	204 (6.8%)
5. Separated	55 (1.8%)
health_ins
1. Yes	2,083 (69%)
2. No	917 (31%)
¹ n (%)

Table2 Socio-economic factors

Characteristic	N = 3,000¹
wage	105 (85, 129)
jobclass
1. Industrial	1,544 (51%)
2. Information	1,456 (49%)
race
1. White	2,480 (83%)
2. Black	293 (9.8%)
3. Asian	190 (6.3%)
4. Other	37 (1.2%)
¹ Median (IQR); n (%)

Characteristic	N = 3,000¹
region
1. New England	0 (0%)
2. Middle Atlantic	3,000 (100%)
3. East North Central	0 (0%)
4. West North Central	0 (0%)
5. South Atlantic	0 (0%)
6. East South Central	0 (0%)
7. West South Central	0 (0%)
8. Mountain	0 (0%)
9. Pacific	0 (0%)
¹ n (%)

Characteristic	N = 3,000¹
year
2003	513 (17%)
2004	485 (16%)
2005	447 (15%)
2006	392 (13%)
2007	386 (13%)
2008	388 (13%)
2009	389 (13%)
¹ n (%)

Linear regression analysis

Parameter                 | Coefficient |           95% CI | t(2998) |      p | Std. Coef. | Std. Coef. 95% CI |      Fit
-------------------------------------------------------------------------------------------------------------------------
(Intercept)               |      103.32 | [101.28, 105.36] |   99.43 | < .001 |      -0.20 |    [-0.25, -0.15] |         
jobclass [2. Information] |       17.27 | [ 14.35,  20.20] |   11.58 | < .001 |       0.41 |    [ 0.34,  0.48] |         
                          |             |                  |         |        |            |                   |         
AIC                       |             |                  |         |        |            |                   | 30774.50
AICc                      |             |                  |         |        |            |                   | 30774.51
BIC                       |             |                  |         |        |            |                   | 30792.52
R2                        |             |                  |         |        |            |                   |     0.04
R2 (adj.)                 |             |                  |         |        |            |                   |     0.04
Sigma                     |             |                  |         |        |            |                   |    40.83

In this section we used a report function to generate the narrative of binary linear regression

We fitted a linear model (estimated using OLS) to predict wage with jobclass (formula: wage ~ jobclass). The model explains a statistically significant and weak proportion of variance (R2 = 0.04, F(1, 2998) = 134.07, p < .001, adj. R2 = 0.04). The model’s intercept, corresponding to jobclass = 1. Industrial, is at 103.32 (95% CI [101.28, 105.36], t(2998) = 99.43, p < .001). Within this model:

The effect of jobclass [2. Information] is statistically significant and positive (beta = 17.27, 95% CI [14.35, 20.20], t(2998) = 11.58, p < .001; Std. beta = 0.41, 95% CI [0.34, 0.48])

Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using a Wald t-distribution approximation.

Multiple linear regression

analysis in this model we are going to explore the influence of wage/salary as dependent variables and Age, Race, Education, and Marital status of the employee

Table of multiple logistics regression analysis

Characteristic	Beta	95% CI¹	p-value
jobclass
1. Industrial	—	—
2. Information	6.0	3.2, 8.8	<0.001
education
1. < HS Grad	—	—
2. HS Grad	11	5.9, 16	<0.001
3. Some College	22	17, 28	<0.001
4. College Grad	38	33, 43	<0.001
5. Advanced Degree	63	58, 69	<0.001
race
1. White	—	—
2. Black	-7.5	-12, -3.1	<0.001
3. Asian	-4.0	-9.4, 1.5	0.2
4. Other	-12	-23, 0.33	0.057
¹ CI = Confidence Interval

ExpCatViz( 
  Wage %>% 
    select(education, jobclass),
  target = "education"
  )

[[1]]

plot_frq(Wage$education)

Wage %>% 
  group_by(race) %>% 
  plot_frq(education) %>% 
  plot_grid()

#save_plot(filename = "myplot", fig = p, png, width = 30, height = 19)

plot_xtab(x = Wage$education, 
          grp = Wage$jobclass,
          margin = "row",
          bar.pos = "stack",
          show.summary = T,
          coord.flip = T)

tab_xtab(var.row = Wage$education, 
         var.col = Wage$jobclass,
         show.row.prc = T)

education	jobclass		Total
education	1. Industrial	2. Information	Total
1. < HS Grad	190 70.9 %	78 29.1 %	268 100 %
2. HS Grad	636 65.5 %	335 34.5 %	971 100 %
3. Some College	342 52.6 %	308 47.4 %	650 100 %
4. College Grad	274 40 %	411 60 %	685 100 %
5. Advanced Degree	102 23.9 %	324 76.1 %	426 100 %
Total	1544 51.5 %	1456 48.5 %	3000 100 %
χ²=282.643 · df=4 · Cramer's V=0.307 · p=0.000

Analyses were conducted using the R Statistical language (version 4.3.2; R Core Team, 2023) on Windows 10 x64 (build 19045), using the packages lme4 (version 1.1.35.1; Bates D et al., 2015), Matrix (version 1.6.1.1; Bates D et al., 2023), likert (version 1.3.5; Bryer J, Speerschneider K, 2016), glmulti (version 1.0.8; Calcagno V, 2020), xtable (version 1.8.4; Dahl D et al., 2019), SmartEDA (version 0.3.9; Dayanand Ubrangala et al., 2022), effects (version 4.2.2; Fox J, Weisberg S, 2019), carData (version 3.0.5; Fox J et al., 2022), lubridate (version 1.9.3; Grolemund G, Wickham H, 2011), DiagrammeR (version 1.0.10; Iannone R, 2023), ISLR (version 1.4; James G et al., 2021), lmerTest (version 3.1.3; Kuznetsova A et al., 2017), sjPlot (version 2.8.15; Lüdecke D, 2023), performance (version 0.10.8; Lüdecke D et al., 2021), report (version 0.5.8; Makowski D et al., 2023), e1071 (version 1.7.14; Meyer D et al., 2023), leaps (version 3.1; Miller TLboFcbA, 2020), tibble (version 3.2.1; Müller K, Wickham H, 2023), dlookr (version 0.6.3; Ryu C, 2024), gtsummary (version 1.7.2; Sjoberg D et al., 2021), rJava (version 1.0.11; Urbanek S, 2024), ggplot2 (version 3.4.4; Wickham H, 2016), forcats (version 1.0.0; Wickham H, 2023), stringr (version 1.5.1; Wickham H, 2023), tidyverse (version 2.0.0; Wickham H et al., 2019), dplyr (version 1.1.4; Wickham H et al., 2023), purrr (version 1.0.2; Wickham H, Henry L, 2023), readr (version 2.1.5; Wickham H et al., 2024) and tidyr (version 1.3.1; Wickham H et al., 2024).

References

Bates D, Mächler M, Bolker B, Walker S (2015). “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software, 67(1), 1-48. doi:10.18637/jss.v067.i01 https://doi.org/10.18637/jss.v067.i01.
Bates D, Maechler M, Jagan M (2023). Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.6-1.1, https://CRAN.R-project.org/package=Matrix.
Bryer J, Speerschneider K (2016). likert: Analysis and Visualization Likert Items. R package version 1.3.5, https://CRAN.R-project.org/package=likert.
Calcagno V (2020). glmulti: Model Selection and Multimodel Inference Made Easy. R package version 1.0.8, https://CRAN.R-project.org/package=glmulti.
Dahl D, Scott D, Roosen C, Magnusson A, Swinton J (2019). xtable: Export Tables to LaTeX or HTML. R package version 1.8-4, https://CRAN.R-project.org/package=xtable.
Dayanand Ubrangala, R K, Prasad Kondapalli R, Putatunda S (2022). SmartEDA: Summarize and Explore the Data. R package version 0.3.9, https://CRAN.R-project.org/package=SmartEDA.
Fox J, Weisberg S (2019). An R Companion to Applied Regression, 3rd edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/index.html. Fox J, Weisberg S (2018). “Visualizing Fit and Lack of Fit in Complex Regression Models with Predictor Effect Plots and Partial Residuals.” Journal of Statistical Software, 87(9), 1-27. doi:10.18637/jss.v087.i09 https://doi.org/10.18637/jss.v087.i09. Fox J (2003). “Effect Displays in R for Generalised Linear Models.” Journal of Statistical Software, 8(15), 1-27. doi:10.18637/jss.v008.i15 https://doi.org/10.18637/jss.v008.i15. Fox J, Hong J (2009). “Effect Displays in R for Multinomial and Proportional-Odds Logit Models: Extensions to the effects Package.” Journal of Statistical Software, 32(1), 1-24. doi:10.18637/jss.v032.i01 https://doi.org/10.18637/jss.v032.i01.
Fox J, Weisberg S, Price B (2022). carData: Companion to Applied Regression Data Sets. R package version 3.0-5, https://CRAN.R-project.org/package=carData.
Grolemund G, Wickham H (2011). “Dates and Times Made Easy with lubridate.” Journal of Statistical Software, 40(3), 1-25. https://www.jstatsoft.org/v40/i03/.
Iannone R (2023). DiagrammeR: Graph/Network Visualization. R package version 1.0.10, https://CRAN.R-project.org/package=DiagrammeR.
James G, Witten D, Hastie T, Tibshirani R (2021). ISLR: Data for an Introduction to Statistical Learning with Applications in R. R package version 1.4, https://CRAN.R-project.org/package=ISLR.
Kuznetsova A, Brockhoff PB, Christensen RHB (2017). “lmerTest Package: Tests in Linear Mixed Effects Models.” Journal of Statistical Software, 82(13), 1-26. doi:10.18637/jss.v082.i13 https://doi.org/10.18637/jss.v082.i13.
Lüdecke D (2023). sjPlot: Data Visualization for Statistics in Social Science. R package version 2.8.15, https://CRAN.R-project.org/package=sjPlot.
Lüdecke D, Ben-Shachar M, Patil I, Waggoner P, Makowski D (2021). “performance: An R Package for Assessment, Comparison and Testing of Statistical Models.” Journal of Open Source Software, 6(60), 3139. doi:10.21105/joss.03139 https://doi.org/10.21105/joss.03139.
Makowski D, Lüdecke D, Patil I, Thériault R, Ben-Shachar M, Wiernik B (2023). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://easystats.github.io/report/.
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2023). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-14, https://CRAN.R-project.org/package=e1071.
Miller TLboFcbA (2020). leaps: Regression Subset Selection. R package version 3.1, https://CRAN.R-project.org/package=leaps.
Müller K, Wickham H (2023). tibble: Simple Data Frames. R package version 3.2.1, https://CRAN.R-project.org/package=tibble.
R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Ryu C (2024). dlookr: Tools for Data Diagnosis, Exploration, Transformation. R package version 0.6.3, https://CRAN.R-project.org/package=dlookr.
Sjoberg D, Whiting K, Curry M, Lavery J, Larmarange J (2021). “Reproducible Summary Tables with the gtsummary Package.” The R Journal, 13, 570-580. doi:10.32614/RJ-2021-053 https://doi.org/10.32614/RJ-2021-053, https://doi.org/10.32614/RJ-2021-053.
Urbanek S (2024). rJava: Low-Level R to Java Interface. R package version 1.0-11, https://CRAN.R-project.org/package=rJava.
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
Wickham H (2023). forcats: Tools for Working with Categorical Variables (Factors). R package version 1.0.0, https://CRAN.R-project.org/package=forcats.
Wickham H (2023). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.5.1, https://CRAN.R-project.org/package=stringr.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4, https://CRAN.R-project.org/package=dplyr.
Wickham H, Henry L (2023). purrr: Functional Programming Tools. R package version 1.0.2, https://CRAN.R-project.org/package=purrr.
Wickham H, Hester J, Bryan J (2024). readr: Read Rectangular Text Data. R package version 2.1.5, https://CRAN.R-project.org/package=readr.
Wickham H, Vaughan D, Girlich M (2024). tidyr: Tidy Messy Data. R package version 1.3.1, https://CRAN.R-project.org/package=tidyr.