% A rapid method that creates many corrected errors, has efficient error correction, and leaves % few uncorrected errors can still be considered a successful method, since it produces % accurate text in relatively little time. pp. 56 MacKenzie \section{Results} \label{sec:results} This section addresses the statistical analysis of the data obtained throughout the main, within-subject, user study (n = 24) that consisted of five repeated measurements. Because the data was from related, dependent groups, we used \textit{\gls{rmANOVA}} if all required assumption were met and \textit{Friedman's Test} otherwise. To identify the specific pairs of treatments that differed significantly, we ran either \textit{Dependent T-Tests} or \textit{Wilcoxon Signed Rank Tests} (both with \textit{Holm correction (sequetially rejective Bonferroni test)} \cite{holm_correction}) as post-hoc tests \cite{field_stats, downey_stats}. The reliability of the two sub-scales (hedonic and pragmatic quality) in the \glsfirst{UEQ-S} was estimated using \textit{Cronbach's alpha} \cite{tavakol_cronbachs_alpha}. All results are reported statistically significant with an $\alpha$-level of $p < 0.05$. We used 95\% confidence intervals when presenting certain results. Normality of data or residuals was checked using visual assessment of \gls{Q-Q} plots and additionally \textit{Shapiro-Wilk} Test. Further, we used \textit{Mauchly's Test for Sphericity} to evaluate if there was statistically significant variation in the variances of the differences of contrasting groups \cite{field_stats, downey_stats}. \subsection{Own Keyboard} \label{sec:res_OPC} As mentioned in Section \ref{sec:main_design}, the keyboard \textit{Own} was used as a reference for some metrics captured during the experiment. Since the measurements with \textit{Own} took place at the start (T0\_1) and end (T0\_2) of the experiment, we compared the results of both typing tests to detect possible variations in performance due to fatigue. Using dependent T-tests, we found that there were no significant differences in \glsfirst{KSPS} for T0\_1 (M = 5.39, sd = 1.49) compared to T0\_2 (M = 5.47, sd = 1.48, t = -1.53, p = 0.139), \glsfirst{UER} was overall negligible with T0\_1 (M = 0.005, sd = 0.013, 85th percentile = 0.0051) and T0\_2 (M = 0.008, sd = 0.028, 85th percentile = 0.0052) and \glsfirst{WPM} showed a trend to approach significance with T0\_1 (M = 54.2, sd = 14.7) compared to T0\_2 (M = 53.0, sd = 14.5, t = 1.92, p = 0.067). Further, using dependent T-tests we were able to find statistically significant differences in \glsfirst{AdjWPM} for T0\_1 (M = 53.9, sd = 14.5) and T0\_2 (M = 52.5, sd = 14.3, t = 2.44, p = 0.023), \glsfirst{CER} for T0\_1 (M = 0.057, sd = 0.028) and T0\_2 (M = 0.078, sd = 0.038, t = -3.54, p = 0.002) and \glsfirst{TER} for T0\_1 (M = 0.063, sd = 0.031) and T0\_2 (M = 0.086, sd = 0.039, t = -4.27, p = 0.0003). Because of the differences, we decided to use the means of all metrics gathered for each participant through T0\_1 and T0\_2 as the reference values to compute the \textit{\gls{OPC}} for the test keyboards (\textit{Athena, Aphrodite, Nyx} and \textit{Hera}). This value was later used to make statements about the performance of the individual test keyboards compared to the participant's own, familiar, keyboard. Additionally, using a dependent T-test, we compared the muscle activity (\% of \glsfirst{MVC}) and found, that there are significant differences in left flexor (\glsfirst{FDP} \& \glsfirst{FDS}) \%\gls{MVC} for T0\_1 (M = 12.0, sd = 8.27) and T0\_2 (M = 8.53, sd = 7.16, t = 3.18, p = 0.004). Residuals of right flexor (\gls{FDF} \& \gls{FDS}) were not normally distributed, therefore we used the Wilcoxon Signed Rank Test and found an significant difference for T0\_1 (M = 10.8, sd = 8.18, Med = 9.52) and T0\_2 (M = 7.71, sd = 6.08, Med = 5.32, p = 0.021). It has to be noted, that we had to remove two erroneous measurements for the right flexor (n = 22). No significant differences have been found in left or right extensor (\glsfirst{ED}) \%\gls{MVC} between T0\_1 and T0\_2. All results can be observed in Table \ref{tbl:res_own_before_after}. \begin{table}[H] \centering \small \ra{1.3} \begin{tabular}{?l^l^l^l^l^l^l^l} \toprule \rowstyle{\itshape} Y & Comparison & Statistic & p & Estimate & CI & Hypothesis \\ \midrule \multicolumn{6}{l}{\textbf{Parametric (Dependent T-test)}} \\ WPM & T0\_1 - T0\_2 & 1.92 & 0.07^\dagger & 1.18 & [-0.09, 2.45] & two-tailed \\ AdjWPM & T0\_1 - T0\_2 & 2.44 & 0.02^* & 1.35 & [0.21, 2.50] & two-tailed \\ KSPS & T0\_1 - T0\_2 & -1.53 & 0.14 & -0.08 & [-0.19, 0.03] & two-tailed \\ CER & T0\_1 - T0\_2 & -3.54 & 0.002^* & -0.02 & [-0.03, -0.01] & two-tailed \\ TER & T0\_1 - T0\_2 & -4.27 & 0.0003^* & -0.02 & [-0.03, -0.01] & two-tailed \\ \%MVC_{LF} & T0\_1 - T0\_2 & 3.18 & 0.004^* & 3.44 & [1.20, 5.68] & two-tailed \\ \%MVC_{LE} & T0\_1 - T0\_2 & 1.44 & 0.163 & 0.956 & [-0.42, 2.33] & two-tailed \\ \multicolumn{6}{l}{\textbf{Non Parametric (Wilcoxon Signed Rank Test)}} \\ \%MVC_{RF} & T0\_1 - T0\_2 & 197 & 0.021^* & 1.83 & [0.39, 3.93] & two-tailed \\ \%MVC_{RE} & T0\_1 - T0\_2 & 173 & 0.527 & 0.28 & [-0.58, 0.91] & two-tailed \\ \bottomrule \end{tabular} \caption{Statistical analysis of differences between typing tests T0\_1 and T0\_2 for keyboard \textit{Own}. For $\%MVC_{RF}$ two erroneous measurements were removed (n = 22). Statistically significant differences (p < 0.05) are marked with an asterisk and p values indicating a trend towards significance are denoted with $\dagger$. Confidence intervals are given for the estimate in the difference in means (T-test) and difference of the location parameter (Wilcoxon). The subscript LF, RF, LE, RE stand for left or right forearm flexor or extensor muscles} \label{tbl:res_own_before_after} \end{table} We also evaluated the means of \glsfirst{KCQ} questions 8 to 12 which concerned perceived fatigue in fingers, wrists, arms, shoulders and neck respectively (7-point Likert scale) and the slopes (improving, deteriorating, stable) of the UX-curves drawn by each participant after the whole experiment, to identify possible differences in perceived fatigue from T0\_1 to T0\_2. As shown in Figure \ref{fig:res_own_per_fat}, participants \gls{KCQ} reported slight improvements in terms of finger (diff = 0.33) and wrist (diff = 0.33) fatigue in T0\_2 compared to T0\_1, no difference in arm fatigue (diff = 0) and very slightly increased fatigue in shoulder (diff = -0.12) and neck (diff = -0.13) in T0\_2 compared to T0\_1. Sixteen of the twenty-four UX-curves regarding overall perceived fatigue had positive slope when measured from start of T0\_1 to end of T0\_2 ($\pm$ 1 mm). The subjective reports about the decrease in finger and wrist fatigue emphasize the decrease in muscle activity for the flexor muscles we described in the last paragraph. \begin{figure}[H] \centering \includegraphics[width=1.0\textwidth]{images/res_own_per_fat} \caption{Trends for reported fatigue through the \gls{KCQ} (questions 8: finger, 9: wrist, 10: arm, 11: shoulder, 12: neck) and histogram for the slopes (IM: improving, DE: deteriorating, ST: stable) of UX-curves concerning perceived fatigue. The curves were evaluated by looking at the y value of the starting point for T0\_1 and comparing it to y value of the end point for T0\_2 with a margin of $\pm$ 1 mm} \label{fig:res_own_per_fat} \end{figure} \subsection{Performance Metrics} % As briefly mentioned in the last section, the individual measurements were then converted into % percentage values of the mean of the reference values gathered from typing tests % with keyboard \textit{Own} (\gls{OPC}). \label{sec:res_perf} \subsubsection{Typing Speed} \label{sec:res_typing_speed} The typing speed for each individual keyboard and typing test was automatically captured with the help of the typing test functionality offered by \glsfirst{GoTT}. We captured \gls{WPM}, \gls{AdjWPM} and \gls{KSPS} according to the formulas mentioned in Section \ref{sec:meas_perf}. We used the mean of the results for both typing tests performed with each keyboard to conduct the following statistical analysis. A \gls{rmANOVA} was performed and revealed possible differences between at least two of the test keyboards (\textit{Athena, Aphrodite, Nyx} and \textit{Hera}) in terms of \gls{WPM} (F(3, 69) = 6.036, p = 0.001). We performed dependent T-tests with Holm correction and found significant differences between \textit{Aphrodite} (M = 51.5, sd = 14.0) and \textit{Nyx} (M = 49.4, sd = 13.3, t = 3.33, p = 0.014), \textit{Athena} (M = 51.5, sd = 14.2) and \textit{Nyx} (M = 49.4, sd = 13.3, t = 2.76, p = 0.044) and \textit{Hera} (M = 51.9, sd = 14.6) and \textit{Nyx} (M = 49.4, sd = 13.3, t = 3.53, p = 0.01). Further, the \gls{rmANOVA} for \gls{AdjWPM} yielded (F(3, 69) = 6.197, p = 0.0009) and for \gls{KSPS} (F(3, 69) = 3.566, p = 0.018). All relevant results of the post-hoc tests and the summary of the performance data can be observed in Tables \ref{tbl:sum_tkbs_speed} and \ref{tbl:res_tkbs_speed}. \begin{table}[H] \centering \footnotesize \ra{1.2} \toprule \parbox{.49\linewidth}{ \begin{tabular}{?r^l^l^l^l^l} \multicolumn{6}{c}{\textbf{\gls{WPM}}} \\ \rowstyle{\itshape} Pseud. & Mean & Min & Max & SD & SE \\ \midrule Athena & 51.47 & 17.96 & 73.86 & 14.21 & 2.90 \\ Aphrodite & 51.46 & 20.76 & 76.36 & 14.01 & 2.86 \\ Nyx & 49.39 & 20.80 & 74.26 & 13.28 & 2.71 \\ Hera & 51.87 & 18.10 & 76.06 & 14.55 & 2.97 \\ \end{tabular} } \parbox{.49\linewidth}{ \begin{tabular}{?r^l^l^l^l^l^l^l} \multicolumn{6}{c}{\textbf{\gls{AdjWPM}}} \\ \rowstyle{\itshape} Pseud. & Mean & Min & Max & SD & SE \\ \midrule Athena & 51.04 & 17.94 & 73.19 & 14.07 & 2.87 \\ Aphrodite & 50.97 & 20.76 & 75.78 & 13.95 & 2.85 \\ Nyx & 48.84 & 20.80 & 73.62 & 13.17 & 2.69 \\ Hera & 51.32 & 18.06 & 75.14 & 14.40 & 2.94 \\ \end{tabular} } \begin{tabular}{?r^l^l^l^l^l^l^l} \\ \multicolumn{6}{c}{\textbf{\gls{KSPS}}} \\ \rowstyle{\itshape} Pseud. & Mean & Min & Max & SD & SE \\ \midrule Athena & 5.23 & 1.68 & 7.94 & 1.54 & 0.31 \\ Aphrodite & 5.32 & 2.00 & 8.14 & 1.50 & 0.31 \\ Nyx & 5.31 & 1.95 & 8.15 & 1.48 & 0.30 \\ Hera & 5.37 & 1.72 & 8.15 & 1.57 & 0.32 \\ \end{tabular} \bottomrule \caption{Summaries for \glsfirst{WPM}, \glsfirst{AdjWPM} and \glsfirst{KSPS} for the test keyboards} \label{tbl:sum_tkbs_speed} \end{table} \begin{table}[H] \centering \small \ra{1.3} \begin{tabular}{?l^l^l^l^l^l^l^l} \toprule \rowstyle{\itshape} Y & Comparison & Statistic & p & Estimate & CI & Hypothesis \\ \midrule \multicolumn{6}{l}{\textbf{Parametric (Dependent T-test)}} \\ WPM & Athena - Nyx & 2.765 & 0.044^* & 2.083 & [0.52, 3.64] & two-tailed \\ WPM & Aphrodite - Nyx & 3.332 & 0.014^* & 2.069 & [0.78, 3.35] & two-tailed \\ WPM & Hera - Nyx & 3.541 & 0.010^* & 2.479 & [1.03, 3.93] & two-tailed \\ AdjWPM & Athena - Nyx & 2.868 & 0.035^* & 2.200 & [0.61, 3.79] & two-tailed \\ AdjWPM & Aphrodite - Nyx & 3.443 & 0.011^* & 2.132 & [0.85, 3.41] & two-tailed \\ AdjWPM & Hera - Nyx & 3.515 & 0.011^* & 2.475 & [1.02, 3.93] & two-tailed \\ KSPS & Athena - Hera & -2.834 & 0.056^\dagger & -0.145 & [-0.25, -0.04] & two-tailed \\ KSPS & Aphrodite - Athena & 2.566 & 0.086^\dagger & 0.095 & [0.02, 0.17] & two-tailed \\ \bottomrule \end{tabular} \caption{Relevant post-hoc results of speed related metrics for the test keyboards. Significant p values are denoted with * and p values indicating a trend towards significance are marked with $\dagger$. Confidence intervals are given for the estimate in the difference in means} \label{tbl:res_tkbs_speed} \end{table} \subsubsection{Error Rate} \label{sec:res_error_rate} \gls{GoTT} also automatically tracked various error related metrics from which we analyzed \glsfirst{UER}, \glsfirst{CER} and \glsfirst{TER}. Since we were interested in whether higher actuation forces lead to a lower error rates compared to lower actuation forces, we conducted one-tailed post-hoc tests for the following statistical analyses. Like in Section \ref{sec:res_typing_speed}, we used the means of the results from both typing test for each keyboard to conduct the analysis. The Friedman's Tests for \gls{TER} ($\chi^2$(3) = 25.4, p = 0.00001) and the \gls{rmANOVA} for \gls{CER} (F(3, 69) = 13.355, p = 0.0000408 (\gls{GG})) revealed differences for at least two test keyboards. The Friedman's Test for \gls{UER} ($\chi^2$(3) = 2.59, p = 0.46) yielded no statistical significant difference. It should be noted, that the 90th percentile of \gls{UER} for all keyboards was still below 1\%. Summaries for the individual metrics and results for all post-hoc tests can be seen in Table \ref{tbl:sum_tkbs_err} and \ref{tbl:res_tkbs_err}. \begin{table}[H] \centering \footnotesize \ra{1.2} \toprule \begin{tabular}{?r^l^l^l^l^l} \multicolumn{6}{c}{\textbf{\gls{TER}}} \\ \rowstyle{\itshape} Pseud. & Mean & Min & Max & SD & SE \\ \midrule Athena & 0.08 & 0.02 & 0.17 & 0.03 & 0.01 \\ Aphrodite & 0.09 & 0.02 & 0.20 & 0.04 & 0.01 \\ Nyx & 0.11 & 0.03 & 0.25 & 0.06 & 0.01 \\ Hera & 0.09 & 0.02 & 0.21 & 0.04 & 0.01 \\ \end{tabular} \\ \parbox{.49\linewidth}{ \begin{tabular}{?r^l^l^l^l^l^l^l} \multicolumn{6}{c}{\textbf{\gls{UER}}} \\ \rowstyle{\itshape} Pseud. & Mean & Min & Max & SD & SE \\ \midrule Athena & 0.01 & 0.00 & 0.14 & 0.03 & 0.01 \\ Aphrodite & 0.01 & 0.00 & 0.17 & 0.03 & 0.01 \\ Nyx & 0.01 & 0.00 & 0.21 & 0.04 & 0.01 \\ Hera & 0.01 & 0.00 & 0.18 & 0.04 & 0.01 \\ \end{tabular} } \parbox{.49\linewidth}{ \begin{tabular}{?r^l^l^l^l^l^l^l} \multicolumn{6}{c}{\textbf{\gls{CER}}} \\ \rowstyle{\itshape} Pseud. & Mean & Min & Max & SD & SE \\ \midrule Athena & 0.07 & 0.02 & 0.13 & 0.03 & 0.01 \\ Aphrodite & 0.08 & 0.02 & 0.18 & 0.04 & 0.01 \\ Nyx & 0.10 & 0.03 & 0.23 & 0.05 & 0.01 \\ Hera & 0.08 & 0.02 & 0.14 & 0.04 & 0.01 \\ \end{tabular} } \bottomrule \caption{Summaries for \glsfirst{TER}, \glsfirst{UER} and \glsfirst{CER} for the test keyboards} \label{tbl:sum_tkbs_err} \end{table} \begin{table}[H] \centering \small \ra{1.3} \begin{tabular}{?l^l^l^l^l^l^l^l} \toprule \rowstyle{\itshape} Y & Comparison & Statistic & p & Estimate & CI & Hypothesis \\ \midrule \multicolumn{6}{l}{\textbf{Non Parametric (Wilcoxon Signed Rank Test)}} \\ TER & Athena - Hera & 38.0 & 0.004^* & -0.011 & [-Inf, -0.01] & less \\ TER & Athena - Aphrodite & 58.5 & 0.009^* & -0.012 & [-Inf, 0] & less \\ TER & Athena - Nyx & 18.0 & 0.00009^* & -0.027 & [-Inf, -0.02] & less \\ TER & Aphrodite - Nyx & 35.5 & 0.002^* & -0.018 & [-Inf, -0.01] & less \\ TER & Hera - Aphrodite & 181.0 & 0.816 & 0.002 & [-Inf, 0.01] & less \\ TER & Hera - Nyx & 29.5 & 0.002^* & -0.016 & [-Inf, -0.01] & less \\ \multicolumn{6}{l}{\textbf{Parametric (Dependent T-test)}} \\ CER & Athena - Hera & -2.796 & 0.015^* & -0.011 & [-Inf, 0] & less \\ CER & Athena - Aphrodite & -2.772 & 0.015^* & -0.011 & [-Inf, 0] & less \\ CER & Athena - Nyx & -4.356 & 0.0007^* & -0.030 & [-Inf, -0.02] & less \\ CER & Aphrodite - Nyx & -3.821 & 0.002^* & -0.019 & [-Inf, -0.01] & less \\ CER & Hera - Aphrodite & 0.050 & 0.520 & 0.000 & [-Inf, 0.01] & less \\ CER & Hera - Nyx & -3.825 & 0.002^* & -0.019 & [-Inf, -0.01] & less \\ \bottomrule \end{tabular} \caption{Post-hoc results of error rates for the test keyboards. Significant p values are denoted with *. Confidence intervals are given for the estimate in the difference in means (T-test) and difference of the location parameter (Wilcoxon)} \label{tbl:res_tkbs_err} \end{table} \subsection{Muscle Activity} \label{sec:res_muscle_activity} We utilized the \gls{EMG} device described in Section \ref{sec:main_design} to gather data about the muscle activities (\% of \glsfirst{MVC}) during typing tests for the extensor and flexor muscles of both forearms. For our analysis, we used the mean values of the results for both typing tests with each keyboard. It has to be noted, that we had to remove two erroneous measurements concerning the right flexor muscle (n = 22). We found no significant differences in \%\gls{MVC} for any of the test keyboards in neither flexor, nor extensor \gls{EMG} measurements. Further, we analyzed the effect of the individual keyboards on \%\gls{MVC}s separately for first and second typing tests (Tn\_1 \& Tn\_2, n := 1, ..., 4), but did not find any statistically significant results as well. Lastly, we analyzed possible differences between \%\gls{MVC} measurements of first and second typing tests for each individual keyboard, using either dependent T-tests or Wilcoxon Signed Rank Tests. There were no statistically significant differences in \%\gls{MVC} between the first and the second typing test for any keyboard/muscle combination. The summaries for all test keyboards of the mean values for both typing tests combined can be observed in Table \ref{tbl:sum_tkbs_emg}. \begin{table}[H] \centering \footnotesize \ra{1.2} \toprule \parbox{.49\linewidth}{ \begin{tabular}{?r^l^l^l^l^l} \multicolumn{6}{c}{\textbf{Left Flexor \%\gls{MVC}}} \\ \rowstyle{\itshape} Pseud. & Mean & Min & Max & SD & SE \\ \midrule Athena & 9.90 & 0.94 & 41.91 & 9.03 & 1.84 \\ Aphrodite & 8.82 & 0.26 & 23.10 & 6.37 & 1.30 \\ Nyx & 8.84 & 2.13 & 24.37 & 6.65 & 1.36 \\ Hera & 9.98 & 2.82 & 25.18 & 6.91 & 1.41 \\ \end{tabular} } \parbox{.49\linewidth}{ \begin{tabular}{?r^l^l^l^l^l^l^l} \multicolumn{6}{c}{\textbf{Right Flexor \%\gls{MVC}} \textit{(n = 22)}} \\ \rowstyle{\itshape} Pseud. & Mean & Min & Max & SD & SE \\ \midrule Athena & 9.69 & 2.13 & 23.88 & 5.67 & 1.21 \\ Aphrodite & 9.33 & 2.15 & 16.96 & 4.51 & 0.96 \\ Nyx & 8.60 & 1.68 & 16.16 & 4.43 & 0.94 \\ Hera & 9.26 & 1.42 & 20.39 & 5.75 & 1.23 \\ \end{tabular} } \\ \parbox{.49\linewidth}{ \begin{tabular}{?r^l^l^l^l^l^l^l} \multicolumn{6}{c}{\textbf{Left Extensor \%\gls{MVC}}} \\ \rowstyle{\itshape} Pseud. & Mean & Min & Max & SD & SE \\ \midrule Athena & 12.24 & 5.17 & 18.98 & 4.11 & 0.84 \\ Aphrodite & 11.60 & 4.80 & 16.86 & 3.67 & 0.75 \\ Nyx & 11.43 & 5.14 & 16.45 & 3.87 & 0.79 \\ Hera & 11.73 & 4.80 & 21.05 & 4.10 & 0.84 \\ \end{tabular} } \parbox{.49\linewidth}{ \begin{tabular}{?r^l^l^l^l^l^l^l} \multicolumn{6}{c}{\textbf{Right Extensor \%\gls{MVC}}} \\ \rowstyle{\itshape} Pseud. & Mean & Min & Max & SD & SE \\ \midrule Athena & 10.78 & 3.34 & 17.58 & 3.86 & 0.79 \\ Aphrodite & 10.66 & 3.56 & 19.05 & 4.41 & 0.90 \\ Nyx & 10.57 & 3.81 & 21.55 & 4.33 & 0.88 \\ Hera & 10.79 & 4.11 & 19.50 & 4.09 & 0.83 \\ \end{tabular} } \bottomrule \caption{Summaries for the \textit{mean values of} measured muscle activity (\% of \glsfirst{MVC}) in \textit{both typing tests} conducted with each keyboard.} \label{tbl:sum_tkbs_emg} \end{table} \pagebreak \subsection{Questionnaires} \label{sec:res_questionnaires} \subsubsection{Keyboard Comfort Questionnaire} \label{sec:res_kcq} The \glsfirst{KCQ} was filled out by the participants after each individual typing test. The questionnaire featured twelve questions regarding the previously used keyboard which are labelled as follows: \begin{table}[H] \centering \ra{0.8} \small \begin{tabular}{llll} \textbf{KCQ1:} & \textit{``Required operating force during usage?''} & \textbf{KCQ7:} & \textit{``Ease of use?''} \\ \textbf{KCQ2:} & \textit{``Perceived uniformity during usage?''} & \textbf{KCQ8:} & \textit{``Fatigue of the fingers?''} \\ \textbf{KCQ3:} & \textit{``Effort required during usage?''} & \textbf{KCQ9:} & \textit{``Fatigue of the wrists?''} \\ \textbf{KCQ4:} & \textit{``Perceived accuracy?''} & \textbf{KCQ10:} & \textit{``Fatigue of the arms?''} \\ \textbf{KCQ5:} & \textit{``Acceptability of speed?''} & \textbf{KCQ11:} & \textit{``Fatigue of the shoulders?''} \\ \textbf{KCQ6:} & \textit{``Overall satisfaction?''} & \textbf{KCQ12:} & \textit{``Fatigue of the neck?''} \\ \end{tabular} \end{table} All questions featured a 7-point Likert scale where 1 always denoted the worst and 7 the best possible experience \cite{iso9241-411}. We conducted Friedman's Tests for all questions and found differences for at least two of the test keyboards in \textit{KCQ3} ($\chi^2$(3) = 9.49, p = 0.024), \textit{KCQ4} ($\chi^2$(3) = 18.4, p = 0.0004), \textit{KCQ6} ($\chi^2$(3) = 10.2, p = 0.017) and \textit{KCQ8} ($\chi^2$(3) = 12.0, p = 0.0075). Further, we noticed a trend towards significance for question \textit{KCQ1} ($\chi^2$(3) = 7.02, p = 0.071). The mean values for all answers can be seen in Figure \ref{fig:kcq_tkbs_res} and the post-hoc test for relevant answers are shown in Table \ref{tbl:res_kcq}. \begin{figure}[H] \centering \includegraphics[width=1.0\textwidth]{images/kcq_tkbs_res} \caption{Means of the responses for all questions of the \glsfirst{KCQ}} \label{fig:kcq_tkbs_res} \end{figure} \begin{table}[H] \centering \small \ra{1.3} \begin{tabular}{?l^l^l^l^l^l^l^l} \toprule \rowstyle{\itshape} Y & Comparison & Statistic & p & Estimate & CI & Hypothesis \\ \midrule \multicolumn{6}{l}{\textbf{Non Parametric (Wilcoxon Signed Rank Test)}} \\ KCQ1 & Aphrodite - Athena & 191.5 & 0.051^\dagger & 1.5 & [0.5, 2.5] & two-tailed \\ \midrule KCQ3 & Aphrodite - Athena & 209.5 & 0.03^* & 1.25 & [0.25, 2] & two-tailed \\ KCQ3 & Athena - Hera & 37.0 & 0.022^* & -1.25 & [-2, -0.5] & two-tailed \\ KCQ3 & Athena - Nyx & 31.0 & 0.03^* & -1.5 & [-2.5, -0.5] & two-tailed \\ \midrule KCQ4 & Aphrodite - Nyx & 161.5 & 0.038^* & 1.5 & [0.75, 2.5] & two-tailed \\ KCQ4 & Athena - Hera & 168.5 & 0.072^\dagger & 1.0 & [0.25, 1.5] & two-tailed \\ KCQ4 & Athena - Nyx & 193.5 & 0.006^* & 2.0 & [1, 2.75] & two-tailed \\ \bottomrule \end{tabular} \caption{Post-hoc tests for questions from the \gls{KCQ}. Statistically significant differences (p < 0.05) are marked with an asterisk and p values indicating a trend towards significance are denoted with $\dagger$. Confidence intervals are given for the difference of the location parameter} \label{tbl:res_kcq} \end{table} \subsubsection{User Experience Questionnaire (Short)} \label{sec:res_ueqs}