bachelor-thesis/chap5/results.tex

% A rapid method that creates many corrected errors, has efﬁcient error correction, and leaves
% few uncorrected errors can still be considered a successful method, since it produces
% accurate text in relatively little time. pp. 56 MacKenzie
\section{Results of the Main User Study}
\label{sec:results}
This section addresses the statistical analysis of the data obtained throughout
the main, within-subject, user study (n = 24) that consisted of five repeated
measurements. Because the data was from related, dependent groups, we used
\textit{\gls{rmANOVA}} if all required assumption were met and
\textit{Friedman's Test} otherwise. To identify the specific pairs of treatments
that differed significantly, we ran either \textit{Dependent T-Tests} or
\textit{Wilcoxon Signed Rank Tests} (both with \textit{Holm correction
  (sequetially rejective Bonferroni test)} \cite{holm_correction}) as post-hoc
tests \cite{field_stats, downey_stats}. The reliability of the two sub-scales
(hedonic and pragmatic quality) in the \glsfirst{UEQ-S} was estimated using
\textit{Cronbach's alpha} \cite{tavakol_cronbachs_alpha}. All results are
reported statistically significant with an $\alpha$-level of $p < 0.05$. We used
95\,\% confidence intervals when presenting certain results. Normality of data or
residuals was checked using visual assessment of \gls{Q-Q} plots and
additionally \textit{Shapiro-Wilk} Test. Further, we used \textit{Mauchly's Test
  for Sphericity} to evaluate if there was statistically significant variation
in the variances of the differences of contrasting groups \cite{field_stats,
  downey_stats}.

\subsection{Own Keyboard}
\label{sec:res_OPC}
As mentioned in Section \ref{sec:main_design}, the keyboard \textit{Own} was
used as a reference for some metrics captured during the experiment. Since the
measurements with \textit{Own} took place at the start (T0\_1) and end (T0\_2)
of the experiment, we compared the results of both typing tests to detect
possible variations in performance due to fatigue. Using dependent T-tests, we
found that there were no significant differences in \glsfirst{KSPS} for T0\_1 (M
= 5.39, sd = 1.49) compared to T0\_2 (M = 5.47, sd = 1.48, t = -1.53, p =
0.139), \glsfirst{UER} was overall negligible with T0\_1 (M = 0.005, sd = 0.013,
85th percentile = 0.0051) and T0\_2 (M = 0.008, sd = 0.028, 85th percentile =
0.0052) and \glsfirst{WPM} showed a trend to approach significance with T0\_1 (M
= 54.2, sd = 14.7) compared to T0\_2 (M = 53.0, sd = 14.5, t = 1.92, p =
0.067). Further, using dependent T-tests we were able to find statistically
significant differences in \glsfirst{AdjWPM} for T0\_1 (M = 53.9, sd = 14.5) and
T0\_2 (M = 52.5, sd = 14.3, t = 2.44, p = 0.023), \glsfirst{CER} for T0\_1 (M =
0.057, sd = 0.028) and T0\_2 (M = 0.078, sd = 0.038, t = -3.54, p = 0.002) and
\glsfirst{TER} for T0\_1 (M = 0.063, sd = 0.031) and T0\_2 (M = 0.086, sd =
0.039, t = -4.27, p = 0.0003). Because of the differences we decided to use the
means of all metrics gathered for each participant through T0\_1 and T0\_2 as
the reference values to compute the \textit{\gls{OPC}} for the test keyboards
(\textit{Athena, Aphrodite, Nyx} and \textit{Hera}). This value was later used
to make statements about the performance of the individual test keyboards
compared to the participant's own, familiar keyboard.

Additionally, using a dependent T-test, we compared the muscle activity (\% of
\glsfirst{MVC}) and found, that there are significant differences in left flexor
(\glsfirst{FDP} \& \glsfirst{FDS}) \%\gls{MVC} for T0\_1 (M = 12.0, sd = 8.27)
and T0\_2 (M = 8.53, sd = 7.16, t = 3.18, p = 0.004). Residuals of right flexor
(\gls{FDF} \& \gls{FDS}) were not normally distributed, therefore we used the
Wilcoxon Signed Rank Test and found an significant difference for T0\_1 (M =
10.8, sd = 8.18, Med = 9.52) and T0\_2 (M = 7.71, sd = 6.08, Med = 5.32, p =
0.021). It has to be noted, that we had to remove two erroneous measurements for
the right flexor (n = 22). No significant differences have been found in left or
right extensor (\glsfirst{ED}) \%\gls{MVC} between T0\_1 and T0\_2. All results
can be observed in Table \ref{tbl:res_own_before_after}.

\begin{table}[H]
  \centering
  \small
  \ra{1.3}
  \begin{tabular}{?l^l^l^l^l^l^l^l}
    \toprule
    \rowstyle{\itshape}
    Y          & Comparison    & Statistic & p            & Estimate & CI             & Hypothesis \\
    \midrule
    \multicolumn{6}{l}{\textbf{Parametric (Dependent T-test)}}                                     \\
    WPM        & T0\_1 - T0\_2 & 1.92      & 0.07^\dagger & 1.18     & [-0.09, 2.45]  & two-tailed \\
    AdjWPM     & T0\_1 - T0\_2 & 2.44      & 0.02^*       & 1.35     & [0.21, 2.50]   & two-tailed \\
    KSPS       & T0\_1 - T0\_2 & -1.53     & 0.14         & -0.08    & [-0.19, 0.03]  & two-tailed \\
    CER        & T0\_1 - T0\_2 & -3.54     & 0.002^*      & -0.02    & [-0.03, -0.01] & two-tailed \\
    TER        & T0\_1 - T0\_2 & -4.27     & 0.0003^*     & -0.02    & [-0.03, -0.01] & two-tailed \\
    \%MVC_{LF} & T0\_1 - T0\_2 & 3.18      & 0.004^*      & 3.44     & [1.20, 5.68]   & two-tailed \\
    \%MVC_{LE} & T0\_1 - T0\_2 & 1.44      & 0.163        & 0.956    & [-0.42, 2.33]  & two-tailed \\
    \multicolumn{6}{l}{\textbf{Non Parametric (Wilcoxon Signed Rank Test)}}                        \\
    \%MVC_{RF} & T0\_1 - T0\_2 & 197       & 0.021^*      & 1.83     & [0.39, 3.93]   & two-tailed \\
    \%MVC_{RE} & T0\_1 - T0\_2 & 173       & 0.527        & 0.28     & [-0.58, 0.91]  & two-tailed \\
    \bottomrule
  \end{tabular}
  \caption{Statistical analysis of differences between typing tests T0\_1 and
    T0\_2 for keyboard \textit{Own}. For $\%MVC_{RF}$ two erroneous measurements
    were removed (n = 22). Statistically significant differences (p < 0.05) are
    marked with an asterisk and p values indicating a trend towards significance
    are denoted with $\dagger$. Confidence intervals are given for the estimate
    in the difference in means (T-test) and difference of the location parameter
    (Wilcoxon). The subscript LF, RF, LE, RE stand for left or right forearm
    flexor or extensor muscles}
  \label{tbl:res_own_before_after}
\end{table}

We also evaluated the means of \glsfirst{KCQ} questions 8 to 12 which concerned
perceived fatigue in fingers, wrists, arms, shoulders and neck respectively
(7-point Likert scale) as well as the slopes (improving, deteriorating, stable)
of the \gls{UX Curve}s drawn by each participant after the whole experiment, to
identify possible differences in perceived fatigue from T0\_1 to T0\_2. As shown
in Figure \ref{fig:res_own_per_fat}, participants \gls{KCQ} reported slight
improvements in terms of finger (diff = 0.33) and wrist (diff = 0.33) fatigue in
T0\_2 compared to T0\_1, no difference in arm fatigue (diff = 0) and very
slightly increased fatigue in shoulder (diff = -0.12) and neck (diff = -0.13) in
T0\_2 compared to T0\_1. Sixteen of the twenty-four \gls{UX Curve}s regarding
overall perceived fatigue had positive slope when measured from start of T0\_1
to end of T0\_2 ($\pm$ 1 mm). The subjective reports about the decrease in
finger and wrist fatigue emphasize the decrease in muscle activity for the
flexor muscles we described in the last paragraph.

\begin{figure}[H]
  \centering
  \includegraphics[width=0.98\textwidth]{images/res_own_per_fat}
  \caption{Trends for reported fatigue through the \gls{KCQ} (questions 8:
    finger, 9: wrist, 10: arm, 11: shoulder, 12: neck) and histogram for the
    slopes (IM: improving, DE: deteriorating, ST: stable) of \gls{UX Curve}s
    concerning perceived fatigue. The curves were evaluated by looking at the y
    value of the starting point for T0\_1 and comparing it to y value of the end
    point for T0\_2 with a margin of $\pm$ 1 mm}
  \label{fig:res_own_per_fat}
\end{figure}
\subsection{Performance Metrics}
% As briefly mentioned in the last section, the individual measurements were then converted into
% percentage values of the mean of the reference values gathered from typing tests
% with keyboard \textit{Own} (\gls{OPC}).
\label{sec:res_perf}
\subsubsection{Typing Speed}
\label{sec:res_typing_speed}
The typing speed for each individual keyboard and typing test was automatically
captured with the help of the typing test functionality offered by
\glsfirst{GoTT}. We captured \gls{WPM}, \gls{AdjWPM} and \gls{KSPS} according to
the formulas mentioned in Section \ref{sec:meas_perf}. We used the mean of the
results for both typing tests performed with each keyboard to conduct the
following statistical analysis. A \gls{rmANOVA} was performed and revealed
possible differences between at least two of the test keyboards (\textit{Athena,
  Aphrodite, Nyx} and \textit{Hera}) in terms of \gls{WPM} (F(3, 69) = 6.036, p
= 0.001). We performed dependent T-tests with Holm correction and found
significant differences between \textit{Aphrodite} (M = 51.5, sd = 14.0) and
\textit{Nyx} (M = 49.4, sd = 13.3, t = 3.33, p = 0.014), \textit{Athena} (M =
51.5, sd = 14.2) and \textit{Nyx} (M = 49.4, sd = 13.3, t = 2.76, p = 0.044) and
\textit{Hera} (M = 51.9, sd = 14.6) and \textit{Nyx} (M = 49.4, sd = 13.3, t =
3.53, p = 0.01). Further, the \gls{rmANOVA} for \gls{AdjWPM} yielded (F(3, 69) =
6.197, p = 0.0009) and for \gls{KSPS} (F(3, 69) = 3.566, p = 0.018). All
relevant results of the post-hoc tests and the summary of the performance data
can be observed in Tables \ref{tbl:sum_tkbs_speed} and
\ref{tbl:res_tkbs_speed}. We further examined which of the four test keyboard
was the fastest for each participant and found, that \textit{Hera} was the
fastest keyboard in terms of \gls{WPM} for 46\,\% (11) of the twenty-four
subjects. Additionally, we analyzed the \gls{WPM} percentage of \textit{Own}
(\gls{OPC}) for all test keyboards to figure out, which keyboard exceeded the
performance of the participant's own keyboard. We found that three subjects
reached \gls{OPC}\_\gls{WPM} values greater than 100\,\% with all four test
keyboards. Also, \textit{Athena, Aphrodite} and \textit{Hera} exceeded 100\,\%
of \gls{OPC}\_\gls{WPM} eight, seven and six times respectively. Detailed
results are presented in Figure \ref{fig:max_opc_wpm}.

\begin{table}[H]
  \centering
  \footnotesize
  \ra{1.2}
  \toprule
  \parbox{.49\linewidth}{
    \begin{tabular}{?r^l^l^l^l^l}
      \multicolumn{6}{c}{\textbf{\gls{WPM}}}                               \\
      \rowstyle{\itshape}
      Pseud.    & Mean  & Min   & Max   & SD    & SE   \\
      \midrule
      Athena    & 51.47 & 17.96 & 73.86 & 14.21 & 2.90 \\
      Aphrodite & 51.46 & 20.76 & 76.36 & 14.01 & 2.86 \\
      Nyx       & 49.39 & 20.80 & 74.26 & 13.28 & 2.71 \\
      Hera      & 51.87 & 18.10 & 76.06 & 14.55 & 2.97 \\
    \end{tabular}
  }
  \parbox{.49\linewidth}{
    \begin{tabular}{?r^l^l^l^l^l^l^l}
      \multicolumn{6}{c}{\textbf{\gls{AdjWPM}}}                               \\
      \rowstyle{\itshape}
      Pseud.     & Mean  & Min   & Max   & SD    & SE   \\
      \midrule
      Athena    & 51.04 & 17.94 & 73.19 & 14.07 & 2.87 \\
      Aphrodite & 50.97 & 20.76 & 75.78 & 13.95 & 2.85 \\
      Nyx       & 48.84 & 20.80 & 73.62 & 13.17 & 2.69 \\
      Hera      & 51.32 & 18.06 & 75.14 & 14.40 & 2.94 \\
    \end{tabular}
  }
  \begin{tabular}{?r^l^l^l^l^l^l^l}
    \\
    \multicolumn{6}{c}{\textbf{\gls{KSPS}}}    \\
    \rowstyle{\itshape}
    Pseud.    & Mean & Min  & Max  & SD   & SE   \\
    \midrule
    Athena    & 5.23 & 1.68 & 7.94 & 1.54 & 0.31 \\
    Aphrodite & 5.32 & 2.00 & 8.14 & 1.50 & 0.31 \\
    Nyx       & 5.31 & 1.95 & 8.15 & 1.48 & 0.30 \\
    Hera      & 5.37 & 1.72 & 8.15 & 1.57 & 0.32 \\
  \end{tabular}
  \bottomrule
  \caption{Summaries for \glsfirst{WPM}, \glsfirst{AdjWPM} and \glsfirst{KSPS} for the test keyboards}
  \label{tbl:sum_tkbs_speed}
\end{table}

\begin{table}[H]
  \centering
  \small
  \ra{1.3}
  \begin{tabular}{?l^l^l^l^l^l^l^l}
    \toprule
    \rowstyle{\itshape}
    Y      & Comparison         & Statistic & p             & Estimate & CI             & Hypothesis \\
    \midrule
    \multicolumn{6}{l}{\textbf{Parametric (Dependent T-test)}}                                       \\
    WPM    & Athena - Nyx       & 2.765     & 0.044^*       & 2.083    & [0.52, 3.64]   & two-tailed \\
    WPM    & Aphrodite - Nyx    & 3.332     & 0.014^*       & 2.069    & [0.78, 3.35]   & two-tailed \\
    WPM    & Hera - Nyx         & 3.541     & 0.010^*       & 2.479    & [1.03, 3.93]   & two-tailed \\
    AdjWPM & Athena - Nyx       & 2.868     & 0.035^*       & 2.200    & [0.61, 3.79]   & two-tailed \\
    AdjWPM & Aphrodite - Nyx    & 3.443     & 0.011^*       & 2.132    & [0.85, 3.41]   & two-tailed \\
    AdjWPM & Hera - Nyx         & 3.515     & 0.011^*       & 2.475    & [1.02, 3.93]   & two-tailed \\
    KSPS   & Athena - Hera      & -2.834    & 0.056^\dagger & -0.145   & [-0.25, -0.04] & two-tailed \\
    KSPS   & Aphrodite - Athena & 2.566     & 0.086^\dagger & 0.095    & [0.02, 0.17]   & two-tailed \\
    \bottomrule
  \end{tabular}
  \caption{Relevant post-hoc results of speed related metrics for the test
    keyboards. Significant p values are denoted with * and p values indicating a
    trend towards significance are marked with $\dagger$. Confidence intervals
    are given for the estimate in the difference in means}
  \label{tbl:res_tkbs_speed}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=1.0\textwidth]{images/max_opc_wpm}
  \caption{The left graph shows the fastest keyboard in terms of \gls{WPM} for
    each participant. The right graph shows, which keyboards were even faster
    than the participant's own keyboard (\gls{OPC}\_\gls{WPM} > 100\,\%)}
  \label{fig:max_opc_wpm}
\end{figure}

\subsubsection{Error Rate}
\label{sec:res_error_rate}
\gls{GoTT} also automatically tracked various error related metrics from which
we analyzed \glsfirst{UER}, \glsfirst{CER} and \glsfirst{TER}. Since we were
interested in whether higher actuation forces lead to a lower error rates
compared to lower actuation forces, we conducted one-tailed post-hoc tests for
the following statistical analyses. Like in Section \ref{sec:res_typing_speed},
we used the means of the results from both typing test for each keyboard to
conduct the analysis. The Friedman's Tests for \gls{TER} ($\chi^2$(3) = 25.4, p
= 0.00001) and the \gls{rmANOVA} for \gls{CER} (F(3, 69) = 13.355, p = 0.0000408
(\gls{GG})) revealed differences for at least two test keyboards. The Friedman's
Test for \gls{UER} ($\chi^2$(3) = 2.59, p = 0.46) yielded no statistical
significant difference. It should be noted, that the 90th percentile of
\gls{UER} for all keyboards was still below 1\,\%. Summaries for the individual
metrics and results for all post-hoc tests can be seen in Table
\ref{tbl:sum_tkbs_err} and \ref{tbl:res_tkbs_err}. Furthermore, we compared the
\gls{TER} of all test keyboards for each participant and found that
\textit{Athena} was the keyboard which participants typed most accurately
with. Two participants scored identical \gls{TER} with two test keyboards,
therefore the total number of ``1st-placed'' keyboards increased to twenty-six.
Lastly, we compared the test keyboards to subject's own keyboards and examined
that eleven participants scored lower \gls{TER}s with \textit{Athena} compared
to \textit{Own} (\gls{OPC}). All data can be observed in Figure
\ref{fig:max_opc_ter}.

\begin{table}[H]
  \centering
  \footnotesize
  \ra{1.2}
  \toprule
  \begin{tabular}{?r^l^l^l^l^l}
    \multicolumn{6}{c}{\textbf{\gls{TER}}}       \\
    \rowstyle{\itshape}
    Pseud.    & Mean & Min  & Max  & SD   & SE   \\
    \midrule
    Athena    & 0.08 & 0.02 & 0.17 & 0.03 & 0.01 \\
    Aphrodite & 0.09 & 0.02 & 0.20 & 0.04 & 0.01 \\
    Nyx       & 0.11 & 0.03 & 0.25 & 0.06 & 0.01 \\
    Hera      & 0.09 & 0.02 & 0.21 & 0.04 & 0.01 \\
  \end{tabular}
  \\
  \parbox{.49\linewidth}{
    \begin{tabular}{?r^l^l^l^l^l^l^l}
      \multicolumn{6}{c}{\textbf{\gls{UER}}}       \\
      \rowstyle{\itshape}
      Pseud.    & Mean & Min  & Max  & SD   & SE   \\
      \midrule
      Athena    & 0.01 & 0.00 & 0.14 & 0.03 & 0.01 \\
      Aphrodite & 0.01 & 0.00 & 0.17 & 0.03 & 0.01 \\
      Nyx       & 0.01 & 0.00 & 0.21 & 0.04 & 0.01 \\
      Hera      & 0.01 & 0.00 & 0.18 & 0.04 & 0.01 \\
    \end{tabular}
  }
  \parbox{.49\linewidth}{
    \begin{tabular}{?r^l^l^l^l^l^l^l}
      \multicolumn{6}{c}{\textbf{\gls{CER}}}       \\
      \rowstyle{\itshape}
      Pseud.    & Mean & Min  & Max  & SD   & SE   \\
      \midrule
      Athena    & 0.07 & 0.02 & 0.13 & 0.03 & 0.01 \\
      Aphrodite & 0.08 & 0.02 & 0.18 & 0.04 & 0.01 \\
      Nyx       & 0.10 & 0.03 & 0.23 & 0.05 & 0.01 \\
      Hera      & 0.08 & 0.02 & 0.14 & 0.04 & 0.01 \\
    \end{tabular}
  }
  \bottomrule
  \caption{Descriptive statistics for \glsfirst{TER}, \glsfirst{UER} and
    \glsfirst{CER} for the test keyboards}
  \label{tbl:sum_tkbs_err}
\end{table}

\begin{table}[H]
  \centering
  \small
  \ra{1.3}
  \begin{tabular}{?l^l^l^l^l^l^l^l}
    \toprule
    \rowstyle{\itshape}
    Y   & Comparison         & Statistic & p         & Estimate & CI            & Hypothesis \\
    \midrule
    \multicolumn{6}{l}{\textbf{Non Parametric (Wilcoxon Signed Rank Test)}}                  \\
    TER & Athena - Hera      & 38.0      & 0.004^*   & -0.011   & ]-Inf, -0.01] & less       \\
    TER & Athena - Aphrodite & 58.5      & 0.009^*   & -0.012   & ]-Inf, 0]     & less       \\
    TER & Athena - Nyx       & 18.0      & 0.00009^* & -0.027   & ]-Inf, -0.02] & less       \\
    TER & Aphrodite - Nyx    & 35.5      & 0.002^*   & -0.018   & ]-Inf, -0.01] & less       \\
    TER & Hera - Aphrodite   & 181.0     & 0.816     & 0.002    & ]-Inf, 0.01]  & less       \\
    TER & Hera - Nyx         & 29.5      & 0.002^*   & -0.016   & ]-Inf, -0.01] & less       \\
    \multicolumn{6}{l}{\textbf{Parametric (Dependent T-test)}}                               \\
    CER & Athena - Hera      & -2.796    & 0.015^*   & -0.011   & ]-Inf, 0]     & less       \\
    CER & Athena - Aphrodite & -2.772    & 0.015^*   & -0.011   & ]-Inf, 0]     & less       \\
    CER & Athena - Nyx       & -4.356    & 0.0007^*  & -0.030   & ]-Inf, -0.02] & less       \\
    CER & Aphrodite - Nyx    & -3.821    & 0.002^*   & -0.019   & ]-Inf, -0.01] & less       \\
    CER & Hera - Aphrodite   & 0.050     & 0.520     & 0.000    & ]-Inf, 0.01]  & less       \\
    CER & Hera - Nyx         & -3.825    & 0.002^*   & -0.019   & ]-Inf, -0.01] & less       \\
    \bottomrule
  \end{tabular}
  \caption{Post-hoc results of error rates for the test keyboards. Significant p
    values are denoted with *. Confidence intervals are given for the estimate
    in the difference in means (T-test) and difference of the location parameter
    (Wilcoxon)}
  \label{tbl:res_tkbs_err}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=1.0\textwidth]{images/max_opc_ter}
  \caption{The left graph shows the keyboard with the lowest \gls{TER} for each
    participant. The right graph shows, which keyboards were more accurate than
    the participant's own keyboard (\gls{OPC}\_\gls{TER} < 100\,\%)}
  \label{fig:max_opc_ter}
\end{figure}

\subsection{Muscle Activity Measurements}
\label{sec:res_muscle_activity}
We utilized the \gls{EMG} device described in Section \ref{sec:main_design} to
gather data about the muscle activities (\% of \glsfirst{MVC}) during typing
tests for the extensor and flexor muscles of both forearms. For our analysis, we
used the mean values of the results for both typing tests with each keyboard.
It has to be noted, that we had to remove two erroneous measurements concerning
the right flexor muscle (n = 22). We found no significant differences in
\%\gls{MVC} for any of the test keyboards in neither flexor, nor extensor
\gls{EMG} measurements. Further, we analyzed the effect of the individual
keyboards on \%\gls{MVC}s separately for first and second typing tests (Tn\_1 \&
Tn\_2, n := 1, ..., 4), but did not find any statistically significant results
as well. Lastly, we analyzed possible differences between \%\gls{MVC}
measurements of first and second typing tests for each individual keyboard,
using either dependent T-tests or Wilcoxon Signed Rank Tests. There were no
statistically significant differences in \%\gls{MVC} between the first and the
second typing test for any keyboard/muscle combination. The summaries for all
test keyboards of the mean values for both typing tests combined can be observed
in Table \ref{tbl:sum_tkbs_emg}. Lastly, we created histograms (Figure
\ref{fig:max_emg_tkbs}) for each of the observed muscle groups, that show the
number of times a keyboard yielded the highest \%\gls{MVC} out of all keyboards
for each participant. We found, that \textit{Athena} most frequently ($\approx$45\,\%)
produced the highest extensor muscle activity for both arms. The highest muscle
activity for both flexor muscle groups was evenly distributed among all test
keyboards with a slight exception of \textit{Nyx}, which produced the highest
\%\gls{MVC} only in ~14\,\% of participants.

\begin{figure}[H]
  \centering
  \includegraphics[width=1.0\textwidth]{images/max_emg_tkbs}
  \caption{Histograms for all \gls{EMG} measurements that show the keyboard with
    the highest mean \% of \glsfirst{MVC} out of all four keyboards for each
    participant}
  \label{fig:max_emg_tkbs}
\end{figure}

\begin{table}[H]
  \centering
  \footnotesize
  \ra{1.2}
  \toprule
  \parbox{.49\linewidth}{
    \begin{tabular}{?r^l^l^l^l^l}
      \multicolumn{6}{c}{\textbf{Left Flexor \%\gls{MVC}}}          \\
      \rowstyle{\itshape}
      Pseud.      & Mean & Min  & Max   & SD   & SE   \\
      \midrule
      Athena      & 9.90 & 0.94 & 41.91 & 9.03 & 1.84 \\
      Aphrodite   & 8.82 & 0.26 & 23.10 & 6.37 & 1.30 \\
      Nyx         & 8.84 & 2.13 & 24.37 & 6.65 & 1.36 \\
      Hera        & 9.98 & 2.82 & 25.18 & 6.91 & 1.41 \\
    \end{tabular}
  }
  \parbox{.49\linewidth}{
    \begin{tabular}{?r^l^l^l^l^l^l^l}
      \multicolumn{6}{c}{\textbf{Right Flexor \%\gls{MVC}} \textit{(n = 22)}}     \\
      \rowstyle{\itshape}
      Pseud.    & Mean & Min  & Max   & SD   & SE   \\
      \midrule
      Athena    & 9.69 & 2.13 & 23.88 & 5.67 & 1.21 \\
      Aphrodite & 9.33 & 2.15 & 16.96 & 4.51 & 0.96 \\
      Nyx       & 8.60 & 1.68 & 16.16 & 4.43 & 0.94 \\
      Hera      & 9.26 & 1.42 & 20.39 & 5.75 & 1.23 \\
    \end{tabular}
  }
  \\
  \parbox{.49\linewidth}{
    \begin{tabular}{?r^l^l^l^l^l^l^l}
      \multicolumn{6}{c}{\textbf{Left Extensor \%\gls{MVC}}}     \\
      \rowstyle{\itshape}
      Pseud.    & Mean  & Min  & Max   & SD   & SE   \\
      \midrule
      Athena    & 12.24 & 5.17 & 18.98 & 4.11 & 0.84 \\
      Aphrodite & 11.60 & 4.80 & 16.86 & 3.67 & 0.75 \\
      Nyx       & 11.43 & 5.14 & 16.45 & 3.87 & 0.79 \\
      Hera      & 11.73 & 4.80 & 21.05 & 4.10 & 0.84 \\
    \end{tabular}
  }
  \parbox{.49\linewidth}{
    \begin{tabular}{?r^l^l^l^l^l^l^l}
      \multicolumn{6}{c}{\textbf{Right Extensor \%\gls{MVC}}}    \\
      \rowstyle{\itshape}
      Pseud.    & Mean  & Min  & Max   & SD   & SE   \\
      \midrule
      Athena    & 10.78 & 3.34 & 17.58 & 3.86 & 0.79 \\
      Aphrodite & 10.66 & 3.56 & 19.05 & 4.41 & 0.90 \\
      Nyx       & 10.57 & 3.81 & 21.55 & 4.33 & 0.88 \\
      Hera      & 10.79 & 4.11 & 19.50 & 4.09 & 0.83 \\
    \end{tabular}
  }
  \bottomrule
  \caption{Descriptive statistics for the \textit{mean values of} measured
    muscle activity (\% of \glsfirst{MVC}) in \textit{both typing tests}
    conducted with each keyboard.}
  \label{tbl:sum_tkbs_emg}
\end{table}
\pagebreak
\subsection{Questionnaires}
\label{sec:res_questionnaires}
\subsubsection{Keyboard Comfort Questionnaire}
\label{sec:res_kcq}
The \glsfirst{KCQ} was filled out by the participants after each individual
typing test. The questionnaire featured twelve questions regarding the
previously used keyboard which are labelled as follows:

\begin{table}[H]
  \centering
  \ra{0.8}
  \small
  \begin{tabular}{llll}
    \textbf{KCQ1:} & \textit{``Required operating force during usage?''} & \textbf{KCQ7:} & \textit{``Ease of use?''}              \\
    \textbf{KCQ2:} & \textit{``Perceived uniformity during usage?''}     & \textbf{KCQ8:} & \textit{``Fatigue of the fingers?''}   \\
    \textbf{KCQ3:} & \textit{``Effort required during usage?''}          & \textbf{KCQ9:} & \textit{``Fatigue of the wrists?''}    \\
    \textbf{KCQ4:} & \textit{``Perceived accuracy?''}                    & \textbf{KCQ10:}  & \textit{``Fatigue of the arms?''}      \\
    \textbf{KCQ5:} & \textit{``Acceptability of speed?''}                & \textbf{KCQ11:}  & \textit{``Fatigue of the shoulders?''} \\
    \textbf{KCQ6:} & \textit{``Overall satisfaction?''}                  & \textbf{KCQ12:}  & \textit{``Fatigue of the neck?''}      \\
  \end{tabular}
\end{table}

All questions featured a 7-point Likert scale where 1 always denoted the worst
and 7 the best possible experience \cite{iso9241-411}. We conducted Friedman's
Tests for all questions and found differences for at least two of the test
keyboards in \textit{KCQ3} ($\chi^2$(3) = 9.49, p = 0.024), \textit{KCQ4}
($\chi^2$(3) = 18.4, p = 0.0004), \textit{KCQ6} ($\chi^2$(3) = 10.2, p = 0.017)
and \textit{KCQ8} ($\chi^2$(3) = 12.0, p = 0.0075). Further, we noticed a trend
towards significance for question \textit{KCQ1} ($\chi^2$(3) = 7.02, p =
0.071). The mean values for all answers can be seen in Figure
\ref{fig:kcq_tkbs_res} and the post-hoc test for relevant answers are shown in
Table \ref{tbl:res_kcq}.

\begin{figure}[H]
  \centering
  \includegraphics[width=1.0\textwidth]{images/kcq_tkbs_res}
  \caption{Means of the responses for all questions of the \glsfirst{KCQ}}
  \label{fig:kcq_tkbs_res}
\end{figure}


\begin{table}[H]
  \centering
  \small
  \ra{1.3}
  \begin{tabular}{?l^l^l^l^l^l^l^l}
    \toprule
    \rowstyle{\itshape}
    Y    & Comparison         & Statistic & p             & Estimate & CI             & Hypothesis \\
    \midrule
    \multicolumn{6}{l}{\textbf{Non Parametric (Wilcoxon Signed Rank Test)}}                        \\
    KCQ1 & Aphrodite - Athena & 191.5     & 0.051^\dagger & 1.5      & [0.5, 2.5]     & two-tailed \\
    \midrule
    KCQ3 & Aphrodite - Athena & 209.5     & 0.03^*        & 1.25     & [0.25, 2]      & two-tailed \\
    KCQ3 & Athena - Hera      & 37.0      & 0.022^*       & -1.25    & [-2, -0.5]     & two-tailed \\
    KCQ3 & Athena - Nyx       & 31.0      & 0.03^*        & -1.5     & [-2.5, -0.5]   & two-tailed \\
    \midrule
    KCQ4 & Aphrodite - Nyx    & 161.5     & 0.038^*       & 1.5      & [0.75, 2.5]    & two-tailed \\
    KCQ4 & Athena - Hera      & 168.5     & 0.072^\dagger & 1.0      & [0.25, 1.5]    & two-tailed \\
    KCQ4 & Athena - Nyx       & 193.5     & 0.006^*       & 2.0      & [1, 2.75]      & two-tailed \\
    \midrule
    KCQ6 & Aphrodite - Nyx    & 240.000   & 0.061^\dagger & 1.0      & [0.25, 1.75]   & two-tailed \\
    \midrule
    KCQ8 & Athena - Hera      & 18.000    & 0.007^*       & -1.25    & [-1.75, -0.75] & two-tailed \\
    KCQ8 & Athena - Nyx       & 12.500    & 0.007^*       & -1.25    & [-2, -0.75]    & two-tailed \\
    \bottomrule
  \end{tabular}
  \caption{Post-hoc tests for questions from the \gls{KCQ}. Statistically
    significant differences (p < 0.05) are marked with an asterisk and p values
    indicating a trend towards significance are denoted with
    $\dagger$. Confidence intervals are given for the difference of the location
    parameter}
  \label{tbl:res_kcq}
\end{table}
\subsubsection{User Experience Questionnaire (Short)}
\label{sec:res_ueqs}
In addition to to the \gls{KCQ}, we utilized the \glsfirst{UEQ-S}. It featured
eight questions on a 7-point Likert scale, which formed two scales (pragmatic,
hedonic). Additionally we added one extra question that could be answered on a
\glsfirst{VAS} from 0 to 100. The survey was filled out after both tests with a
keyboard have been completed. The questions of our modified \gls{UEQ-S} were
labelled as follows:

\begin{table}[H]
  \centering
  \ra{0.8}
  \small
  \begin{tabular}{llll}
    \multicolumn{2}{c}{Pragmatic Scale} & \multicolumn{2}{c}{Hedonic Scale}                                                                       \\
    \\
    \textbf{PRA1:}                      & \textit{``Obstructive or Supportive?''} & \textbf{HED1:} & \textit{``Boring or Exciting?''}             \\
    \textbf{PRA2:}                      & \textit{``Complicated or Easy?''}       & \textbf{HED2:} & \textit{``Not interesting or Interesting?''} \\
    \textbf{PRA3:}                      & \textit{``Inefficient or Efficient?''}  & \textbf{HED3:} & \textit{``Conventional or Inventive?''}      \\
    \textbf{PRA4:}                      & \textit{``Confusing or Clear?''}        & \textbf{HED4:} & \textit{``Usual or Leading Edge?''}          \\
    \\
    \multicolumn{4}{c}{Additional Question (\gls{VAS})}                                                                                             \\
    \\
    \textbf{SATI:}                      & \multicolumn{3}{l}{\textit{``How satisfied have you been with this keyboard?''}}
  \end{tabular}
\end{table}

The 7-point Likert scale items (PRA1-4, HED1-4) were then transformed to
represent a scale from -3 to +3, where -3 represented the left term and +3 the
right term of the ``or'' questions. All sub-scales, pragmatic ($\alpha$ =
0.90)\footnote{PRA: Athena ($\alpha$ = 0.83), Aphrodite ($\alpha$ = 0.95), Nyx
  ($\alpha$ = 0.90), Hera ($\alpha$ = 0.85)} and hedonic ($\alpha$ =
0.88)\footnote{HED: Athena ($\alpha$ = 0.89), Aphrodite ($\alpha$ = 0.89), Nyx
  ($\alpha$ = 0.91), Hera ($\alpha$ = 0.90)}, exceeded the recommended threshold
for Cronbach's alpha of $\alpha$ > 0.7 \cite{schrepp_ueq_handbook}. The mean
values for all responses of the \gls{UEQ-S} can be seen in Figure
\ref{fig:kcq_tkbs_res} and the individual responses to the additional question
(SATI) are presented in Figure \ref{fig:res_tkbs_sati}. We conducted
\gls{rmANOVA}s for both sub-scales but found no statistically significant
variations for the pragmatic scale (F(3, 69) = 3.254, p = 0.06, post-hoc did not
reveal any tendencies) nor the hedonic scale (F(3, 69) = 0.425, p =
0.74). Contrary, the \gls{rmANOVA} for the additional question \textit{SATI}
indicated statistically significant differences (F(3, 69) = 3.254, p =
0.027). In this case, we decided to use Wilcoxon Signed Rank Tests for our
post-hoc analysis because of our interest in the difference of medians and the
relatively high power of this test in analyzing \gls{VAS} data
\cite{heller_vas}. The results and summaries for the test keyboards can be
observed in Tables \ref{tbl:res_tkbs_sati} and \ref{tbl:sum_tkbs_sati}.

\begin{figure}[H]
  \centering
  \includegraphics[width=0.92\textwidth]{images/ueq_tkbs_res}
  \caption{Means of the responses for all questions of the \glsfirst{UEQ-S}}
  \label{fig:ueq_tkbs_res}
\end{figure}

\begin{table}[H]
  \centering
  \small
  \ra{1.2}
  \begin{tabular}{?l^l^l^l^l^l^l^l}
    \toprule
    \rowstyle{\itshape}
    Y    & Comparison         & Statistic & p             & Estimate & CI           & Hypothesis \\
    \midrule
    \multicolumn{6}{l}{\textbf{Non Parametric (Wilcoxon Signed Rank Test)}}                      \\
    SATI & Aphrodite - Nyx    & 217.0     & 0.046^*       & 14.0     & [5, Inf[     & greater    \\
    SATI & Aphrodite - Athena & 201.5     & 0.046^*       & 12.5     & [4.5, Inf[   & greater    \\
    SATI & Nyx - Athena       & 125.5     & 1.0           & -3.0     & [-11.5, Inf[ & greater    \\
    SATI & Hera - Athena      & 205.5     & 0.174         & 8.5      & [0, Inf[     & greater    \\
    SATI & Hera - Aphrodite   & 118.5     & 1.0           & -2.5     & [-12.5, Inf[ & greater    \\
    SATI & Hera - Nyx         & 223.5     & 0.074^\dagger & 12.5     & [2.5, Inf[   & greater    \\
    \bottomrule
  \end{tabular}
  \caption{Post-hoc tests for the additional question \textit{``How satisfied
      have you been with this keyboard?''}. Statistically significant
    differences (p < 0.05) are marked with an * and p values indicating a trend
    towards significance are denoted with $\dagger$. Confidence intervals are
    given for the difference of the location parameter. We only tested keyboards
    with lower actuation force against keyboards with higher actuation
    force. The first comparison of Aphrodite (50\,g) and Nyx (35\,g) was added,
    because of the noticeable differences in the visual assessment of Figure
    \ref{fig:res_tkbs_sati}}
  \label{tbl:res_tkbs_sati}
\end{table}

\begin{table}[H]
  \centering
  \footnotesize
  \ra{1.1}
  \begin{tabular}{?r^l^l^l^l^l^l^l}
    \toprule
    \rowstyle{\itshape}
    Pseud.    & Mean  & Median & Min   & Max   & SD    & SE   \\
    \midrule
    Athena    & 54.12 & 50.00  & 1.00  & 95.00 & 25.43 & 5.19 \\
    Aphrodite & 65.08 & 71.50  & 10.00 & 94.00 & 22.56 & 4.61 \\
    Nyx       & 51.42 & 55.00  & 0.00  & 90.00 & 23.40 & 4.78 \\
    Hera      & 63.29 & 70.00  & 12.00 & 92.00 & 19.95 & 4.07 \\
    \bottomrule
  \end{tabular}
  \caption{Descriptive statistics for the additional question \textit{``How
      satisfied have you been with this keyboard?''} for all four test
    keyboards}
  \label{tbl:sum_tkbs_sati}
\end{table}


\begin{figure}[H]
  \centering
  \includegraphics[width=1.0\textwidth]{images/sati_tkbs_res}
  \caption{Responses for the additional question \textit{``How satisfied have
      you been with this keyboard?''} with the means for all participant
    represented as horizontal lines}
  \label{fig:res_tkbs_sati}
\end{figure}


\subsection{UX Curves and Semi-Structured Interviews}
\label{sec:res_uxc}
In order to give all participants the chance to recapitulate the whole
experiment and give retrospective feedback about each individual keyboard, we
conducted a semi-structured interview which included drawing \gls{UX Curve}s for
perceived fatigue and perceived typing speed. We evaluated the curves by
measuring the y position of the \gls{SP} for a curve and the y position of the
respective \gls{EP} an determine the slope of that curve. Slopes are defined as
improving if \gls{SP} < \gls{EP}, deteriorating if \gls{SP} > \gls{EP} and
stable if \gls{SP} = \gls{EP} (margin of $\pm$ 1 mm). One curve can either
represent one typing test (C1 or C2) or the whole experience with one keyboard
over the course of both typing tests (C12). All curves can be observed in
Appendix \ref{app:uxc} and the resulting slopes for all curve types are shown in
Figure \ref{fig:res_uxc}. During the semi-structured interview, we asked the
participants to rank the keyboards from 1 (favorite) to 5 (least favorite). If
in doubt, participants were allowed to place two keyboards on the same
rank. Further, we asked some participants (n = 19) to also rank the keyboards
from lowest actuation force (one) to highest actuation force (five). The
participants own keyboard was four times more often placed first than any other
keyboard. \textit{Hera} was the only keyboard, that never got placed fifth and
except for \textit{Own}, was the most represented keyboard in the top three. The
ranking of the perceived actuation force revealed, that participants were able
to identify \textit{Nyx} (35\,g) and \textit{Athena} (80\,g) as the keyboards with
the lowest and highest actuation force respectively. All results for both
rankings are visualized in Figure \ref{fig:res_interview}. Lastly, we analyzed
the recordings of all interviews and found several similar statements about
specific keyboards. Twelve participants noted, that because of the new form
factor of the test keyboards, additional familiarization was required to feel
comfortable. Nine of those specifically mentioned the height of the keyboard as
the main difference. Fourteen subjects reported―\textit{``Because Nyx had such a
  low resistance, I kept making mistakes!''}. Four participants explicitly
noted, that \textit{Hera} felt very pleasant and two subjects mentioned
\textit{``I had really good flow.''} and \textit{``It somehow just felt
  right''}. Ten participants reported, that typing on \textit{Athena} was
exhausting. \textit{Aphrodite} was not mentioned as often as the other keyboards
which could be related to a comment of two subjects―\textit{``It felt very
  similar to my own Keyboard''}.

\begin{figure}[H]
  \centering
  \includegraphics[width=1.0\textwidth]{images/res_uxc}
  \caption{\centering Evaluation of \gls{UX Curve} slopes for perceived fatigue and perceived
    speed. \\
    \textit{DE:} deteriorating, \textit{IM:} improving, \textit{ST:} stable}
  \label{fig:res_uxc}
\end{figure}

\begin{figure}[H]
  \centering
  \includegraphics[width=1.0\textwidth]{images/res_interview}
  \caption{Rankings for favorite keyboard and perceived required actuation force
    for all keyboards including \textit{Own}. The graphs show the number of
    times a keyboard was placed at a certain rank}
  \label{fig:res_interview}
\end{figure}